實戰(zhàn) | 如何利用 Scrapy 編寫一個完整的爬蟲!

大家好,我是安果!
提到爬蟲框架,這里不得不提 Scrapy,它是一款非常強大的分布式異步爬蟲框架,更加適用于企業(yè)級的爬蟲!
項目地址:
https://github.com/scrapy/scrapy
本篇文章將借助一個簡單實例來聊聊使用 Scrapy 編寫爬蟲的完整流程
1. 實戰(zhàn)
目標對象:
aHR0cHMlM0EvL2dvLmNxbW1nby5jb20vZm9ydW0tMjMzLTEuaHRtbA==
我們需要爬取目標網(wǎng)站下帖子的基本信息
2-1 安裝依賴
# 安裝依賴
pip3 install Scrapy
# Mysql
pip3 install mysqlclient2-2 創(chuàng)建項目及爬蟲
分析目前地址,獲取網(wǎng)站 HOST 及爬取地址,在某個文件夾下利用命令創(chuàng)建一個爬蟲項目及一個爬蟲
# 創(chuàng)建一個爬蟲項目
scrapy startproject cqmmgo
# 打開文件夾
cd cqmmgo
# 創(chuàng)建一個爬蟲
scrapy genspider talk 網(wǎng)站HOST2-3 定義 Item 實體對象
在 items.py 文件中,將需要爬取的數(shù)據(jù)定義為 Item
比如,這里就需要爬取帖子標題、作者、閱讀數(shù)、評論數(shù)、貼子 URL、發(fā)布時間
# items.py
import scrapy
# 雜談
class CqTalkItem(scrapy.Item):
# 標題
title = scrapy.Field()
# 作者
author = scrapy.Field()
# 查看次數(shù)
watch_num = scrapy.Field()
# 評論次數(shù)
comment_num = scrapy.Field()
# 地址
address_url = scrapy.Field()
# 發(fā)布時間
create_time = scrapy.Field()2-4 編寫爬蟲
在 spiders 文件夾下的爬蟲文件中編寫具體的爬蟲邏輯
通過分析發(fā)現(xiàn),帖子數(shù)據(jù)是通過模板直接渲染,非動態(tài)加載,因此我們直接對 response 進行數(shù)據(jù)解析
PS:解析方式這里推薦使用 Xpath
解析完成的數(shù)據(jù)組成上面定義的 Item 實體添加到生成器中
# spiders/talk.py
import scrapy
from cqmmgo.items import CqTalkItem
from cqmmgo.settings import talk_hour_before
from cqmmgo.utils import calc_interval_hour
class TalkSpider(scrapy.Spider):
name = 'talk'
allowed_domains = ['HOST']
# 第1-5頁數(shù)據(jù)
start_urls = ['https://HOST/forum-233-{}.html'.format(i + 1) for i in range(5)]
def parse(self, response):
# 直接Xpath解析
elements = response.xpath('//div[contains(@class,"list-data-item")]')
for element in elements:
item = CqTalkItem()
# title = element.xpath('.//span[@class="has-businessTag"]/text()').extract_first()
# title = element.xpath('.//span[@class="has-businessTag"]/text()').extract()
# title = element.xpath('.//*[@class="subject"]/a').extract_first()
title = element.xpath('.//*[@class="subject"]/a/@title').extract_first()
author = element.xpath(".//span[@itemprop='帖子作者']/text()").extract_first()
watch_num = element.xpath(".//span[@class='num-read']/text()").extract_first()
comment_num = element.xpath(".//span[@itemprop='回復數(shù)']/text()").extract_first()
address_url = "https:" + element.xpath('.//*[@class="subject"]/a/@href').extract_first()
create_time = element.xpath('.//span[@class="author-time"]/text()').extract_first().strip()
# 過濾超過設定小時之前的數(shù)據(jù)
if calc_interval_hour(create_time) > talk_hour_before:
continue
print(
f"標題:{title},作者:{author},觀看:{watch_num},評論:{comment_num},地址:{address_url},發(fā)布時間:{create_time}")
item['title'] = title
item['author'] = author
item['watch_num'] = watch_num
item['comment_num'] = comment_num
item['address_url'] = address_url
item['create_time'] = create_time
yield item2-5 自定義隨機 UA 下載中間件
在 middlewares.py 文件中自定義隨機 User Agent 下載中間件
# middlewares.py
import random # 導入隨機模塊
class RandomUADownloaderMiddleware(object):
def process_request(self, request, spider):
# UA列表
USER_AGENT_LIST = [
'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
]
# 隨機生成一個UA
agent = random.choice(USER_AGENT_LIST)
# 設置到請求頭中
request.headers['User_Agent'] = agent2-6 自定義下載管道 Pipline
在 piplines.py 文件中,自定義兩個下載管道,分別將數(shù)據(jù)寫入到本地 CSV 文件和 Mysql 數(shù)據(jù)中
PS:為了演示方便,這里僅展示同步寫入 Mysql 數(shù)據(jù)庫的方式
# piplines.py
from scrapy.exporters import CsvItemExporter
from cqmmgo.items import CqTalkItem
import MySQLdb # 導入數(shù)據(jù)庫模塊
class TalkPipeline(object):
"""雜談"""
def __init__(self):
self.file = open("./result/talk.csv", 'wb')
self.exporter = CsvItemExporter(self.file, fields_to_export=[
'title', 'author', 'watch_num', 'comment_num', 'create_time', 'address_url'
])
self.exporter.start_exporting()
def process_item(self, item, spider):
if isinstance(item, CqTalkItem):
self.exporter.export_item(item)
return item
# 關(guān)閉資源
def close_spider(self, spider):
self.exporter.finish_exporting()
self.file.close()
# 數(shù)據(jù)存入到數(shù)據(jù)庫(同步)
class MysqlPipeline(object):
def __init__(self):
# 鏈接mysql數(shù)據(jù)庫
self.conn = MySQLdb.connect("host", "root", "pwd", "cq", charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
table_name = 'talk'
# sql語句
insert_sql = """
insert into {}(title,author,watch_num,comment_num,address_url,create_time,insert_time) values(%s,%s,%s,%s,%s,%s,%s)
""".format(table_name)
# 從item獲得數(shù)據(jù),保存為元祖,插入數(shù)據(jù)庫
params = list()
params.append(item.get("title", ""))
params.append(item.get("author", ""))
params.append(item.get("watch_num", 0))
params.append(item.get("comment_num", 0))
params.append(item.get("address_url", ""))
params.append(item.get("create_time", ""))
params.append(current_date())
# 執(zhí)行插入數(shù)據(jù)到數(shù)據(jù)庫操作
self.cursor.execute(insert_sql, tuple(params))
# 提交,保存到數(shù)據(jù)庫
self.conn.commit()
return item
def close_spider(self, spider):
"""釋放數(shù)據(jù)庫資源"""
self.cursor.close()
self.conn.close()當然,這里也可以定義一個數(shù)據(jù)去重的數(shù)據(jù)管道,通過帖子標題,對重復的數(shù)據(jù)不進行處理即可
# piplines.py
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
"""
Pipline去重
"""
def __init__(self):
self.talk_set = set()
def process_item(self, item, spider):
name = item['title']
if name in self.talk_set:
raise DropItem("重復數(shù)據(jù),拋棄:%s" % item)
self.talk_set.add(name)
return item2-7 配置爬蟲配置文件
打開 settings.py 文件,對下載延遲時間、默認請求頭、下載中間件、數(shù)據(jù)管道進行編輯
# settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Host': 'HOST',
'Referer': 'https://HOST/forum-233-1.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
}
DOWNLOADER_MIDDLEWARES = {
'cqmmgo.middlewares.RandomUADownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
ITEM_PIPELINES = {
'cqmmgo.pipelines.TalkPipeline': 1,
'cqmmgo.pipelines.MysqlPipeline': 6,
'cqmmgo.pipelines.DuplicatesPipeline': 200,
'cqmmgo.pipelines.CqmmgoPipeline': 300,
}
# 爬取時間限制
talk_hour_before = 242-8 爬蟲主入口
在爬蟲項目根目錄下創(chuàng)建一個文件,通過下面的方式運行單個爬蟲
# main.py
from scrapy.cmdline import execute
import sys, os
def start():
sys.path.append(os.path.dirname(__file__))
# 運行單個爬蟲
execute(["scrapy", "crawl", "talk"])
if __name__ == '__main__':
start()2. 最后
如果 Scrapy 項目中包含多個爬蟲,我們可以利用 CrawlerProcess 類并發(fā)執(zhí)行多個爬蟲
# main.py
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
# 同時運行項目下的多個爬蟲
def start():
setting = get_project_settings()
process = CrawlerProcess(setting)
# 不運行的爬蟲
spider_besides = ['other']
# 所有爬蟲
for spider_name in process.spiders.list():
if spider_name in spider_besides:
continue
print("現(xiàn)在執(zhí)行爬蟲:%s" % (spider_name))
process.crawl(spider_name)
process.start()
if __name__ == '__main__':
start()
當然,除了 Scrapy 外,我們也可以考慮另外一款爬蟲框架 Feapder
使用方法可以參考之前寫的一篇文章
介紹一款能取代 Scrapy 的爬蟲框架 - feapder
END
