<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          Scrapy爬取整個(gè)美女網(wǎng)爬下來(lái),要多少有多少

          共 5583字,需瀏覽 12分鐘

           ·

          2021-10-21 05:32

          都2021年了還沒爬過(guò)大家喜歡的美女圖片,上先爬取的成果。

          簡(jiǎn)介

          基于Scrapy框架的 美女網(wǎng)爬取

          爬蟲入口地址:http://www.meinv.hk/?cat=2

          如果你的爬蟲運(yùn)行正常卻沒有數(shù)據(jù),可能的原因是訪問(wèn)該網(wǎng)站需要梯子。

          這里主要學(xué)習(xí)兩個(gè) 技術(shù)點(diǎn)、自定義圖片管道和自定義csv數(shù)據(jù)管道

          實(shí)現(xiàn)流程

          創(chuàng)建項(xiàng)目太簡(jiǎn)單了,不說(shuō)了。

          打開網(wǎng)站

          在點(diǎn)擊過(guò)程中,得到具體的爬取思路,先爬取熱門推薦的標(biāo)簽,然后在得到每一個(gè)美女的具體的圖片的網(wǎng)址。

          那么就使用下rules。

          rules內(nèi)規(guī)定了對(duì)響應(yīng)中url的爬取規(guī)則,爬取得到的url會(huì)被再次進(jìn)行請(qǐng)求,并根據(jù)callback函數(shù)和follow屬性的設(shè)置進(jìn)行解析或跟進(jìn)。

          這里強(qiáng)調(diào)兩點(diǎn):

          • 一是會(huì)對(duì)所有返回的response進(jìn)行url提取,包括首次url請(qǐng)求得來(lái)的response;
          • 二是rules列表中規(guī)定的所有Rule都會(huì)被執(zhí)行。

          進(jìn)入到pipeline.py里面,引入 from scrapy.pipelines.images import ImagesPipeline, 并且繼承 ImagesPipeline

          自定義pipeline可以基于scrapy自帶的ImagesPipeline的基礎(chǔ)上完成。可以重寫ImagesPipeline中的三個(gè)法:get_media_requests(),file_path(),item_completed()

          在這里插入圖片描述

          具體代碼

          因?yàn)槭褂玫氖亲远x管道(圖片和CSV),因此不需要編寫「item.py」

          mv.py

          from?scrapy.linkextractors?import?LinkExtractor
          from?scrapy.spiders?import?CrawlSpider,?Rule

          class?MvSpider(CrawlSpider):
          ????name?=?'mv'
          ????allowed_domains?=?['www.meinv.hk']
          ????start_urls?=?['http://www.meinv.hk/?cat=2']

          ????#?增加提取?a?標(biāo)簽的href連接的規(guī)則
          ????#?將提取到的href連接,生成新的Request?請(qǐng)求,?同時(shí)指定新的請(qǐng)求后解析函數(shù)
          ????rules?=?(
          ????????#?allow?默認(rèn)使用正則的表達(dá)式,查找所有a標(biāo)簽的href
          ????????#?follow?為True時(shí),表示在提取規(guī)則連接下載完成后,是否再次提取規(guī)則中連接
          ????????Rule(LinkExtractor(allow=r'p=\d+'),?callback='parse_item',?follow=True),

          ????)

          ????def?parse_item(self,?response):
          ????????item?=?{}
          ????????info?=?response.xpath('//div[@class="wshop?wshop-layzeload"]/text()').extract_first()
          ????????try:
          ????????????item['hometown']?=?info.split("/")[2].strip().split()[1]
          ????????????item['birthday']?=?info.split("/")[1].strip().split()[1]
          ????????except:
          ????????????item['birthday']?=?"未知"
          ????????????item['hometown']?=?"未知"
          ????????item['name']?=?response.xpath('//h1[@class="title"]/text()').extract_first()
          ????????images?=?response.xpath('//div[@class="post-content"]//img/@src')
          ????????try:
          ????????????item['image_urls']?=?images.extract()
          ????????except:
          ????????????item['image_urls']?=?''
          ????????item['images']?=?''
          ????????item['detail_url']?=?response.url
          ????????yield?item

          middlewares.py

          import?random

          from?scrapy.downloadermiddlewares.useragent?import?UserAgentMiddleware


          class?RandomUserAgentMiddleware(UserAgentMiddleware):
          ????def?__init__(self,?user_agent_list):
          ????????super().__init__()
          ????????self.user_agent_list?=?user_agent_list

          ????@classmethod
          ????def?from_crawler(cls,?crawler):
          ????????return?cls(user_agent_list=crawler.settings.get('USER_AGENT_LIST'))

          ????def?process_request(self,?request,?spider):
          ????????user_agent?=?random.choice(self.user_agent_list)
          ????????if?user_agent:
          ????????????request.headers['User-Agent']?=?user_agent
          ????????return?None

          pipelines.py

          import?csv
          import?os
          from?hashlib?import?sha1

          from?scrapy?import?Request
          from?scrapy.pipelines.images?import?ImagesPipeline
          from?meinv?import?settings


          class?MvImagePipeline(ImagesPipeline):
          ????def?get_media_requests(self,?item,?info):
          ????????for?url?in?item['image_urls']:
          ????????????yield?Request(url,?meta={'name':?item['name']})

          ????def?item_completed(self,?results,?item,?info):
          ????????#將下載完成后的圖片路徑設(shè)置到item中
          ????????item['images']?=?[x?for?ok,?x?in?results?if?ok]
          ????????return?item


          ????def?file_path(self,?request,?response=None,?info=None):
          ????????#?為每位人員創(chuàng)建一個(gè)目錄,存放她自己所有的圖片
          ????????author_name?=?request.meta['name']
          ????????author_dir?=?os.path.join(settings.IMAGES_STORE,?author_name)
          ????????if?not?os.path.exists(author_dir):
          ????????????os.makedirs(author_dir)
          ????????#從連接中提取文件名和擴(kuò)展名
          ????????try:
          ????????????filename?=?request.url.split("/")[-1].split(".")[0]
          ????????except:
          ????????????filename?=?sha1(request.url.encode(encoding='utf-8')).hexdigest()
          ????????try:
          ????????????ext_name?=?request.url.split(".")[-1]
          ????????except:
          ????????????ext_name?=?'jpg'

          ????????#?返回的相對(duì)路徑
          ????????return?'%s/%s.%s'?%?(author_name,?filename,?ext_name)


          class?MeinvPipeline(object):
          ????def?__init__(self):
          ????????self.csv_filename?=?'meinv.csv'
          ????????self.existed_header?=?False
          ????def?process_item(self,?item,?spider):
          ????????#?item?dict對(duì)象,是spider.detail_parse()?yield{}輸出模塊
          ????????with?open(self.csv_filename,?'a',?encoding='utf-8')?as?f:
          ????????????writer?=?csv.DictWriter(f,?fieldnames=(
          ????????????????'name',?'hometown',?'birthday',?'detail_url'))
          ????????????if?not?self.existed_header:
          ????????????????#?如果文件不存在,則表示第一次寫入
          ????????????????writer.writeheader()
          ????????????????self.existed_header?=?True
          ????????????image_urls?=?''
          ????????????for?image_url?in?item['image_urls']:
          ????????????????image_urls?+=?image_url?+?','
          ????????????image_urls.strip("\"").strip("\'")
          ????????????data?=?{
          ????????????????'name':?item['name'].strip(),
          ????????????????'hometown':?item['hometown'],
          ????????????????'birthday':?item['birthday'].replace('年',?'-').replace('月',?'-').replace('日',?''),
          ????????????????'detail_url':?item['detail_url'],
          ????????????}
          ????????????writer.writerow(data)
          ????????????f.close()
          ????????return?item

          Setting.py

          import?os

          BOT_NAME?=?'meinv'

          SPIDER_MODULES?=?['meinv.spiders']
          NEWSPIDER_MODULE?=?'meinv.spiders'

          #?Obey?robots.txt?rules
          ROBOTSTXT_OBEY?=?False

          DOWNLOAD_DELAY?=?1

          DOWNLOADER_MIDDLEWARES?=?{
          ???'meinv.middlewares.RandomUserAgentMiddleware':?543,
          }

          BASE_DIR?=?os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
          #?ImagePipeline?存放圖片使用的目錄位置
          IMAGES_STORE?=?os.path.join(BASE_DIR,?'images')

          ITEM_PIPELINES?=?{
          ????'meinv.pipelines.MeinvPipeline':?300,
          ????'meinv.pipelines.MvImagePipeline':100
          }


          USER_AGENT_LIST?=?[
          ????'Mozilla/5.0?(Windows?NT?6.1;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/71.0.3578.98?Safari/537.36',
          ????'Mozilla/5.0?(Windows?NT?10.0;?WOW64;?Trident/7.0;?rv:11.0)?like?Gecko',
          ????'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10.6;?rv:2.0.1)?Gecko/20100101?Firefox/4.0.1'
          ]

          1.文件目錄

          2.某人圖片

          3.csv文件內(nèi)容

          最后上幾個(gè)美圖


          瀏覽 42
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  淫色无限一区二区 | 豆花AV网站入口 | AA网址| 婷婷无码在线 | 曰本无码人妻丰满熟妇啪啪 |