Scrapy源碼剖析:Scrapy是如何運(yùn)行起來的?
閱讀本文大約需要 15 分鐘。 本文章代碼較多,如果手機(jī)端閱讀體驗(yàn)不好,建議先收藏后在 PC 端閱讀。
scrapy 命令從哪來?
?scrapy?crawl?
scrapy?命令從何而來?$?which?scrapy
/usr/local/bin/scrapy
import?re
import?sys
from?scrapy.cmdline?import?execute
if?__name__?==?'__main__':
????sys.argv[0]?=?re.sub(r'(-script\.pyw|\.exe)?$',?'',?sys.argv[0])
????sys.exit(execute())
setup.py?中,我們找到這個(gè)文件,就會(huì)發(fā)現(xiàn)在這個(gè)文件里,已經(jīng)聲明好了程序的運(yùn)行入口處:from?os.path?import?dirname,?join
from?setuptools?import?setup,?find_packages
setup(
????name='Scrapy',
????version=version,
????url='http://scrapy.org',
????...
????entry_points={??????#?運(yùn)行入口在這里:scrapy.cmdline:execute
????????'console_scripts':?['scrapy?=?scrapy.cmdline:execute']
????},
????classifiers=[
????????...
????],
????install_requires=[
????????...
????],
)
entry_points?配置,它就是調(diào)用 Scrapy 開始的地方,也就是cmdline.py?的?execute?方法。setuptools?這個(gè)包管理工具,就會(huì)把上述代碼生成好并放在可執(zhí)行路徑下,這樣當(dāng)我們調(diào)用?scrapy?命令時(shí),就會(huì)調(diào)用 Scrapy 模塊下的?cmdline.py?的?execute?方法。編寫一個(gè)帶有? main?方法的 Python 模塊(首行必須注明 Python 執(zhí)行路徑)去掉 .py后綴名修改權(quán)限為可執(zhí)行( chmod +x?文件名)直接用文件名就可以執(zhí)行這個(gè) Python 文件
mycmd,在這個(gè)文件中編寫一個(gè)?main?方法,這個(gè)方法編寫我們想要的執(zhí)行的邏輯,之后執(zhí)行?chmod +x mycmd?把這個(gè)文件權(quán)限變成可執(zhí)行,最后通過?./mycmd?就可以執(zhí)行這段代碼了,而不再需要通過?python ?方式就可以執(zhí)行了,是不是很簡單?運(yùn)行入口(execute.py)
scrapy/cmdline.py?的?execute?方法,那我們就看一下這個(gè)方法。def?execute(argv=None,?settings=None):
????if?argv?is?None:
????????argv?=?sys.argv
????#?---?兼容低版本scrapy.conf.settings的配置?---
????if?settings?is?None?and?'scrapy.conf'?in?sys.modules:
????????from?scrapy?import?conf
????????if?hasattr(conf,?'settings'):
????????????settings?=?conf.settings
????#?-----------------------------------------
?#?初始化環(huán)境、獲取項(xiàng)目配置參數(shù)?返回settings對(duì)象
????if?settings?is?None:
????????settings?=?get_project_settings()
????#?校驗(yàn)棄用的配置項(xiàng)
????check_deprecated_settings(settings)
????#?---?兼容低版本scrapy.conf.settings的配置?---
????import?warnings
????from?scrapy.exceptions?import?ScrapyDeprecationWarning
????with?warnings.catch_warnings():
????????warnings.simplefilter("ignore",?ScrapyDeprecationWarning)
????????from?scrapy?import?conf
????????conf.settings?=?settings
????#?---------------------------------------
????#?執(zhí)行環(huán)境是否在項(xiàng)目中?主要檢查scrapy.cfg配置文件是否存在
????inproject?=?inside_project()
????
????#?讀取commands文件夾?把所有的命令類轉(zhuǎn)換為{cmd_name:?cmd_instance}的字典
????cmds?=?_get_commands_dict(settings,?inproject)
????#?從命令行解析出執(zhí)行的是哪個(gè)命令
????cmdname?=?_pop_command_name(argv)
????parser?=?optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),?\
????????conflict_handler='resolve')
????if?not?cmdname:
????????_print_commands(settings,?inproject)
????????sys.exit(0)
????elif?cmdname?not?in?cmds:
????????_print_unknown_command(settings,?cmdname,?inproject)
????????sys.exit(2)
????#?根據(jù)命令名稱找到對(duì)應(yīng)的命令實(shí)例
????cmd?=?cmds[cmdname]
????parser.usage?=?"scrapy?%s?%s"?%?(cmdname,?cmd.syntax())
????parser.description?=?cmd.long_desc()
????#?設(shè)置項(xiàng)目配置和級(jí)別為command
????settings.setdict(cmd.default_settings,?priority='command')
????cmd.settings?=?settings
????#?添加解析規(guī)則
????cmd.add_options(parser)
????#?解析命令參數(shù),并交由Scrapy命令實(shí)例處理
????opts,?args?=?parser.parse_args(args=argv[1:])
????_run_print_help(parser,?cmd.process_options,?args,?opts)
????#?初始化CrawlerProcess實(shí)例?并給命令實(shí)例添加crawler_process屬性
????cmd.crawler_process?=?CrawlerProcess(settings)
????#?執(zhí)行命令實(shí)例的run方法
????_run_print_help(parser,?_run_command,?cmd,?args,?opts)
????sys.exit(cmd.exitcode)
初始化項(xiàng)目配置
scrapy.cfg?有關(guān),通過調(diào)用 ?get_project_settings?方法,最終生成一個(gè)?Settings?實(shí)例。def?get_project_settings():
????#?環(huán)境變量中是否有SCRAPY_SETTINGS_MODULE配置
????if?ENVVAR?not?in?os.environ:
????????project?=?os.environ.get('SCRAPY_PROJECT',?'default')
????????#?初始化環(huán)境?找到用戶配置文件settings.py?設(shè)置到環(huán)境變量SCRAPY_SETTINGS_MODULE中
????????init_env(project)
????#?加載默認(rèn)配置文件default_settings.py?生成settings實(shí)例
????settings?=?Settings()
????#?取得用戶配置文件
????settings_module_path?=?os.environ.get(ENVVAR)
????#?如果有用戶配置?則覆蓋默認(rèn)配置
????if?settings_module_path:
????????settings.setmodule(settings_module_path,?priority='project')
????#?如果環(huán)境變量中有其他scrapy相關(guān)配置也覆蓋
????pickled_settings?=?os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
????if?pickled_settings:
????????settings.setdict(pickle.loads(pickled_settings),?priority='project')
????env_overrides?=?{k[7:]:?v?for?k,?v?in?os.environ.items()?if
?????????????????????k.startswith('SCRAPY_')}
????if?env_overrides:
????????settings.setdict(env_overrides,?priority='project')
????return?settings
default_settings.py,主要邏輯在?Settings?類中。class?Settings(BaseSettings):
????def?__init__(self,?values=None,?priority='project'):
????????#?調(diào)用父類構(gòu)造初始化
????????super(Settings,?self).__init__()
????????#?把default_settings.py的所有配置set到settings實(shí)例中
????????self.setmodule(default_settings,?'default')
????????#?把a(bǔ)ttributes屬性也set到settings實(shí)例中
????????for?name,?val?in?six.iteritems(self):
????????????if?isinstance(val,?dict):
????????????????self.set(name,?BaseSettings(val,?'default'),?'default')
????????self.update(values,?priority)
default_settings.py?中的所有配置項(xiàng)設(shè)置到?Settings?中,而且這個(gè)配置是有優(yōu)先級(jí)的。default_settings.py?是非常重要的,我們讀源碼時(shí)有必要重點(diǎn)關(guān)注一下里面的內(nèi)容,這里包含了所有組件的默認(rèn)配置,以及每個(gè)組件的類模塊,例如調(diào)度器類、爬蟲中間件類、下載器中間件類、下載處理器類等等。#?下載器類
DOWNLOADER?=?'scrapy.core.downloader.Downloader'
#?調(diào)度器類
CHEDULER?=?'scrapy.core.scheduler.Scheduler'
#?調(diào)度隊(duì)列類
SCHEDULER_DISK_QUEUE?=?'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE?=?'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE?=?'scrapy.pqueues.ScrapyPriorityQueue'
檢查運(yùn)行環(huán)境是否在項(xiàng)目中
scrapy?命令有的是依賴項(xiàng)目運(yùn)行的,有的命令則是全局的。這里主要通過就近查找?scrapy.cfg?文件來確定是否在項(xiàng)目環(huán)境中,主要邏輯在?inside_project?方法中。def?inside_project():
????#?檢查此環(huán)境變量是否存在(上面已設(shè)置)
????scrapy_module?=?os.environ.get('SCRAPY_SETTINGS_MODULE')
????if?scrapy_module?is?not?None:
????????try:
????????????import_module(scrapy_module)
????????except?ImportError?as?exc:
????????????warnings.warn("Cannot?import?scrapy?settings?module?%s:?%s"?%?(scrapy_module,?exc))
????????else:
????????????return?True
?#?如果環(huán)境變量沒有?就近查找scrapy.cfg?找得到就認(rèn)為是在項(xiàng)目環(huán)境中
????return?bool(closest_scrapy_cfg())
scrapy.cfg?文件,如果能找到,則說明是在爬蟲項(xiàng)目中,否則就認(rèn)為是執(zhí)行的全局命令。組裝命令實(shí)例集合
scrapy?包括很多命令,例如?scrapy crawl?、?scrapy fetch?等等,那這些命令是從哪來的?答案就在?_get_commands_dict?方法中。def?_get_commands_dict(settings,?inproject):
????#?導(dǎo)入commands文件夾下的所有模塊?生成{cmd_name:?cmd}的字典集合
????cmds?=?_get_commands_from_module('scrapy.commands',?inproject)
????cmds.update(_get_commands_from_entry_points(inproject))
????#?如果用戶自定義配置文件中有COMMANDS_MODULE配置?則加載自定義的命令類
????cmds_module?=?settings['COMMANDS_MODULE']
????if?cmds_module:
????????cmds.update(_get_commands_from_module(cmds_module,?inproject))
????return?cmds
def?_get_commands_from_module(module,?inproject):
????d?=?{}
????#?找到這個(gè)模塊下所有的命令類(ScrapyCommand子類)
????for?cmd?in?_iter_command_classes(module):
????????if?inproject?or?not?cmd.requires_project:
????????????#?生成{cmd_name:?cmd}字典
????????????cmdname?=?cmd.__module__.split('.')[-1]
????????????d[cmdname]?=?cmd()
????return?d
def?_iter_command_classes(module_name):
????#?迭代這個(gè)包下的所有模塊?找到ScrapyCommand的子類
????for?module?in?walk_modules(module_name):
????????for?obj?in?vars(module).values():
????????????if?inspect.isclass(obj)?and?\
????????????????????issubclass(obj,?ScrapyCommand)?and?\
????????????????????obj.__module__?==?module.__name__:
????????????????yield?obj
commands?文件夾下的所有模塊,最終生成一個(gè)?{cmd_name: cmd}?字典集合,如果用戶在配置文件中也配置了自定義的命令類,也會(huì)追加進(jìn)去。也就是說,我們自己也可以編寫自己的命令類,然后追加到配置文件中,之后就可以使用自己定義的命令了。解析命令
def?_pop_command_name(argv):
????i?=?0
????for?arg?in?argv[1:]:
????????if?not?arg.startswith('-'):
????????????del?argv[i]
????????????return?arg
????????i?+=?1
scrapy crawl ,這個(gè)方法會(huì)解析出?crawl,通過上面生成好的命令類的字典集合,就能找到?commands?目錄下的?crawl.py文件,最終執(zhí)行的就是它的?Command?類。解析命令行參數(shù)
cmd.process_options?方法解析我們的參數(shù):def?process_options(self,?args,?opts):
????#?首先調(diào)用了父類的process_options?解析統(tǒng)一固定的參數(shù)
????ScrapyCommand.process_options(self,?args,?opts)
????try:
????????#?命令行參數(shù)轉(zhuǎn)為字典
????????opts.spargs?=?arglist_to_dict(opts.spargs)
????except?ValueError:
????????raise?UsageError("Invalid?-a?value,?use?-a?NAME=VALUE",?print_help=False)
????if?opts.output:
????????if?opts.output?==?'-':
????????????self.settings.set('FEED_URI',?'stdout:',?priority='cmdline')
????????else:
????????????self.settings.set('FEED_URI',?opts.output,?priority='cmdline')
????????feed_exporters?=?without_none_values(
????????????self.settings.getwithbase('FEED_EXPORTERS'))
????????valid_output_formats?=?feed_exporters.keys()
????????if?not?opts.output_format:
????????????opts.output_format?=?os.path.splitext(opts.output)[1].replace(".",?"")
????????if?opts.output_format?not?in?valid_output_formats:
????????????raise?UsageError("Unrecognized?output?format?'%s',?set?one"
?????????????????????????????"?using?the?'-t'?switch?or?as?a?file?extension"
?????????????????????????????"?from?the?supported?list?%s"?%?(opts.output_format,
????????????????????????????????????????????????????????????????tuple(valid_output_formats)))
????????self.settings.set('FEED_FORMAT',?opts.output_format,?priority='cmdline')
初始化CrawlerProcess
CrawlerProcess?實(shí)例,然后運(yùn)行對(duì)應(yīng)命令實(shí)例的?run?方法。cmd.crawler_process?=?CrawlerProcess(settings)
_run_print_help(parser,?_run_command,?cmd,?args,?opts)
scrapy crawl ,也就是說最終調(diào)用的是?commands/crawl.py?的?run?方法:def?run(self,?args,?opts):
????if?len(args)?1:
????????raise?UsageError()
????elif?len(args)?>?1:
????????raise?UsageError("running?'scrapy?crawl'?with?more?than?one?spider?is?no?longer?supported")
????spname?=?args[0]
????self.crawler_process.crawl(spname,?**opts.spargs)
????self.crawler_process.start()
run?方法中調(diào)用了?CrawlerProcess?實(shí)例的?crawl?和?start?方法,就這樣整個(gè)爬蟲程序就會(huì)運(yùn)行起來了。CrawlerProcess初始化:class?CrawlerProcess(CrawlerRunner):
????def?__init__(self,?settings=None):
????????#?調(diào)用父類初始化
????????super(CrawlerProcess,?self).__init__(settings)
????????#?信號(hào)和log初始化
????????install_shutdown_handlers(self._signal_shutdown)
????????configure_logging(self.settings)
????????log_scrapy_info(self.settings)
CrawlerRunner?的構(gòu)造方法:class?CrawlerRunner(object):
????def?__init__(self,?settings=None):
????????if?isinstance(settings,?dict)?or?settings?is?None:
????????????settings?=?Settings(settings)
????????self.settings?=?settings
????????#?獲取爬蟲加載器
????????self.spider_loader?=?_get_spider_loader(settings)
????????self._crawlers?=?set()
????????self._active?=?set()
_get_spider_loader方法:def?_get_spider_loader(settings):
????#?讀取配置文件中的SPIDER_MANAGER_CLASS配置項(xiàng)
????if?settings.get('SPIDER_MANAGER_CLASS'):
????????warnings.warn(
????????????'SPIDER_MANAGER_CLASS?option?is?deprecated.?'
????????????'Please?use?SPIDER_LOADER_CLASS.',
????????????category=ScrapyDeprecationWarning,?stacklevel=2
????????)
????cls_path?=?settings.get('SPIDER_MANAGER_CLASS',
????????????????????????????settings.get('SPIDER_LOADER_CLASS'))
????loader_cls?=?load_object(cls_path)
????try:
????????verifyClass(ISpiderLoader,?loader_cls)
????except?DoesNotImplement:
????????warnings.warn(
????????????'SPIDER_LOADER_CLASS?(previously?named?SPIDER_MANAGER_CLASS)?does?'
????????????'not?fully?implement?scrapy.interfaces.ISpiderLoader?interface.?'
????????????'Please?add?all?missing?methods?to?avoid?unexpected?runtime?errors.',
????????????category=ScrapyDeprecationWarning,?stacklevel=2
????????)
????return?loader_cls.from_settings(settings.frozencopy())
spider_loader項(xiàng),默認(rèn)配置是?spiderloader.SpiderLoader類,從名字我們也能看出來,這個(gè)類是用來加載我們編寫好的爬蟲類的,下面看一下這個(gè)類的具體實(shí)現(xiàn)。@implementer(ISpiderLoader)
class?SpiderLoader(object):
????def?__init__(self,?settings):
????????#?配置文件獲取存放爬蟲腳本的路徑
????????self.spider_modules?=?settings.getlist('SPIDER_MODULES')
????????self._spiders?=?{}
????????#?加載所有爬蟲
????????self._load_all_spiders()
????????
????def?_load_spiders(self,?module):
????????#?組裝成{spider_name:?spider_cls}的字典
????????for?spcls?in?iter_spider_classes(module):
????????????self._spiders[spcls.name]?=?spcls
????def?_load_all_spiders(self):
????????for?name?in?self.spider_modules:
????????????for?module?in?walk_modules(name):
????????????????self._load_spiders(module)
{spider_name: spider_cls}?的字典,所以我們在執(zhí)行?scarpy crawl ?時(shí),Scrapy 就能找到我們的爬蟲類。運(yùn)行爬蟲
CrawlerProcess?初始化完之后,調(diào)用它的?crawl?方法:def?crawl(self,?crawler_or_spidercls,?*args,?**kwargs):
????#?創(chuàng)建crawler
????crawler?=?self.create_crawler(crawler_or_spidercls)
????return?self._crawl(crawler,?*args,?**kwargs)
def?_crawl(self,?crawler,?*args,?**kwargs):
????self.crawlers.add(crawler)
????#?調(diào)用Crawler的crawl方法
????d?=?crawler.crawl(*args,?**kwargs)
????self._active.add(d)
????def?_done(result):
????????self.crawlers.discard(crawler)
????????self._active.discard(d)
????????return?result
????return?d.addBoth(_done)
def?create_crawler(self,?crawler_or_spidercls):
????if?isinstance(crawler_or_spidercls,?Crawler):
????????return?crawler_or_spidercls
????return?self._create_crawler(crawler_or_spidercls)
def?_create_crawler(self,?spidercls):
????#?如果是字符串?則從spider_loader中加載這個(gè)爬蟲類
????if?isinstance(spidercls,?six.string_types):
????????spidercls?=?self.spider_loader.load(spidercls)
????#?否則創(chuàng)建Crawler
????return?Crawler(spidercls,?self.settings)
Cralwer?實(shí)例,然后調(diào)用它的?crawl?方法:@defer.inlineCallbacks
def?crawl(self,?*args,?**kwargs):
????assert?not?self.crawling,?"Crawling?already?taking?place"
????self.crawling?=?True
????try:
????????#?到現(xiàn)在?才是實(shí)例化一個(gè)爬蟲實(shí)例
????????self.spider?=?self._create_spider(*args,?**kwargs)
????????#?創(chuàng)建引擎
????????self.engine?=?self._create_engine()
????????#?調(diào)用爬蟲類的start_requests方法
????????start_requests?=?iter(self.spider.start_requests())
????????#?執(zhí)行引擎的open_spider?并傳入爬蟲實(shí)例和初始請求
????????yield?self.engine.open_spider(self.spider,?start_requests)
????????yield?defer.maybeDeferred(self.engine.start)
????except?Exception:
????????if?six.PY2:
????????????exc_info?=?sys.exc_info()
????????self.crawling?=?False
????????if?self.engine?is?not?None:
????????????yield?self.engine.close()
????????if?six.PY2:
????????????six.reraise(*exc_info)
????????raise
????????
def?_create_spider(self,?*args,?**kwargs):
????return?self.spidercls.from_crawler(self,?*args,?**kwargs)
start_requests?方法獲取種子 URL,最后交給引擎執(zhí)行。Cralwer?是如何開始運(yùn)行的額,也就是它的?start?方法:def?start(self,?stop_after_crawl=True):
????if?stop_after_crawl:
????????d?=?self.join()
????????if?d.called:
????????????return
????????d.addBoth(self._stop_reactor)
????reactor.installResolver(self._get_dns_resolver())
????#?配置reactor的池子大小(可修改REACTOR_THREADPOOL_MAXSIZE調(diào)整)
????tp?=?reactor.getThreadPool()
????tp.adjustPoolsize(maxthreads=self.settings.getint('REACTOR_THREADPOOL_MAXSIZE'))
????reactor.addSystemEventTrigger('before',?'shutdown',?self.stop)
????#?開始執(zhí)行
????reactor.run(installSignalHandlers=False)
reactor?的模塊。reactor?是個(gè)什么東西呢?它是?Twisted?模塊的事件管理器,我們只要把需要執(zhí)行的事件注冊到?reactor?中,然后調(diào)用它的?run?方法,它就會(huì)幫我們執(zhí)行注冊好的事件,如果遇到網(wǎng)絡(luò)IO等待,它會(huì)自動(dòng)幫切換到可執(zhí)行的事件上,非常高效。reactor?是如何工作的,你可以把它想象成一個(gè)線程池,只是采用注冊回調(diào)的方式來執(zhí)行事件。ExecuteEngine?處理了,引擎會(huì)協(xié)調(diào)多個(gè)組件,相互配合完成整個(gè)任務(wù)的執(zhí)行。總結(jié)

更多閱讀
特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會(huì)員
評(píng)論
圖片
表情
