手把手帶你入門Python爬蟲(chóng)Scrapy

導(dǎo)讀:Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。可以應(yīng)用在包括數(shù)據(jù)挖掘,信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。


調(diào)度器(Scheduler) 下載器(Downloader) 爬蟲(chóng)(Spider) 中間件(Middleware) 實(shí)體管道(Item Pipeline) Scrapy引擎(Scrapy Engine)
def?parse(self,?response):
????pass
python3?-m?pip?install?scrapy?#這個(gè)可能需要花掉一段時(shí)間,如果你的網(wǎng)絡(luò)快可能就比較快,如果你出現(xiàn)超時(shí)導(dǎo)致沒(méi)有安裝成功可以繼續(xù)執(zhí)行這個(gè)命令scrapy?startproject?lab?#創(chuàng)建新的Scrapy項(xiàng)目,注意一下,如果此命令沒(méi)有你就需要配置一下Scrapy?的環(huán)境變量
cd?lab?#進(jìn)入創(chuàng)建的項(xiàng)目目錄
scrapy?genspider?labs?http://lab.scrapyd.cn/page/1/?#?生成spider?代碼class?LabItem(scrapy.Item):?
????title?=?scrapy.Field()
????author?=?scrapy.Field()def?parse(self,?response):
????items=LabItem()?#實(shí)例化一個(gè)數(shù)據(jù)對(duì)象,用于返回
????for?sel?in?response.xpath('//div[@class="col-mb-12?col-8"]'):
????????print(sel)
????????for?i?in?range(len(sel.xpath('//div[@class="quote?post"]//span[@class="text"]/text()'))):
????????????title?=?sel.xpath('//div[@class="quote?post"]//span[@class="text"]/text()')[i].get()
????????????author?=?sel.xpath('//div[@class="quote?post"]//small[@class="author"]/text()')[i].get()
????????????items["title"]=title
????????????items["author"]?=?author
????????????yield?items?#返回提出來(lái)的每一個(gè)數(shù)據(jù)對(duì)象from?itemadapter?import?ItemAdapter
import?json
class?FilePipeline(object):
????def?open_spider(self,?spider):
????????print("當(dāng)爬蟲(chóng)執(zhí)行開(kāi)始的時(shí)候回調(diào):open_spider")
????def?__init__(self):
????????print("創(chuàng)建爬蟲(chóng)數(shù)據(jù)存儲(chǔ)文件")
????????self.file?=?open('test.json',"w",?encoding="utf-8")
????def?process_item(self,?item,?spider):
????????print("開(kāi)始處理每一條提取出來(lái)的數(shù)據(jù)")
????????content?=?json.dumps(dict(item),ensure_ascii=False)+"\n"
????????self.file.write(content)
????????return?item
????def?close_spider(self,?spider):
????????print("當(dāng)爬蟲(chóng)執(zhí)行結(jié)束的時(shí)候回調(diào):close_spider")
????????self.file.close()
這里是自定義的一個(gè)pipeline,所以還需要在setting.py 文件里面把它配置上,如下:
ITEM_PIPELINES?=?{
???'lab.pipelines.FilePipeline':?300,
}



評(píng)論
圖片
表情
