<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          手把手帶你爬蟲(chóng) | 爬取語(yǔ)錄大全

          共 6657字,需瀏覽 14分鐘

           ·

          2021-01-09 15:09

          目標(biāo)

          爬取語(yǔ)錄,批量下載到本地。

          項(xiàng)目準(zhǔn)備

          軟件:Pycharm

          第三方庫(kù):requests,fake_useragent,re,lxml

          網(wǎng)站地址:http://www.yuluju.com

          網(wǎng)站分析

          打開(kāi)網(wǎng)站。

          有很多分類(lèi),不同類(lèi)型的語(yǔ)錄。?

          點(diǎn)擊愛(ài)情語(yǔ)錄,發(fā)現(xiàn)上方網(wǎng)址變化為http://www.yuluju.com/aiqingyulu/

          點(diǎn)擊搞笑語(yǔ)錄,也會(huì)發(fā)生類(lèi)似的變化。

          判斷是否為靜態(tài)網(wǎng)頁(yè)。

          有頁(yè)碼跳轉(zhuǎn)一般為靜態(tài)網(wǎng)頁(yè)。Ctrl+U查看源代碼,Ctrl+F調(diào)出搜索框,輸入一些網(wǎng)頁(yè)上出現(xiàn)的文字。

          反爬分析

          同一個(gè)ip地址去多次訪(fǎng)問(wèn)會(huì)面臨被封掉的風(fēng)險(xiǎn),這里采用fake_useragent,產(chǎn)生隨機(jī)的User-Agent請(qǐng)求頭進(jìn)行訪(fǎng)問(wèn)。

          每一頁(yè)的鏈接分析

          第一頁(yè):http://www.yuluju.com/aiqingyulu/list_18_1.html
          第二頁(yè):http://www.yuluju.com/aiqingyulu/list_18_2.html
          第三頁(yè):http://www.yuluju.com/aiqingyulu/list_18_3.html

          可以發(fā)現(xiàn),每頁(yè)的變化會(huì)隨著數(shù)字變化。當(dāng)然這里分析的是愛(ài)情語(yǔ)錄這一欄目,其它的也類(lèi)似。

          代碼實(shí)現(xiàn)

          1.導(dǎo)入相對(duì)應(yīng)的第三方庫(kù),定義一個(gè)class類(lèi)繼承object,定義init方法繼承self,主函數(shù)main繼承self。
          import??requests
          from?fake_useragent?import?UserAgent
          from?lxml?import?etree
          class?yulu(object):
          ????def?__init__(self):
          ????????self.url?=?'http://www.yuluju.com'
          ????????ua?=?UserAgent(verify_ssl=False)
          ????????#隨機(jī)產(chǎn)生user-agent
          ????????for?i?in?range(1,?100):
          ????????????self.headers?=?{
          ????????????????'User-Agent':?ua.random
          ????????????}
          ????def?mian(self):
          ?????pass
          if?__name__?==?'__main__':
          ????spider?=?yulu()
          ????spider.main()
          2.交互界面
          print('?????1.勵(lì)志語(yǔ)錄\n'
          ??????????????'2.愛(ài)情語(yǔ)錄\n'
          ??????????????'3.搞笑語(yǔ)錄\n'
          ??????????????'4.人生語(yǔ)錄\n'
          ??????????????'5.情感語(yǔ)錄\n'
          ??????????????'6.經(jīng)典語(yǔ)錄\n'
          ??????????????'7.傷感語(yǔ)錄\n'
          ??????????????'8.名人語(yǔ)錄\n'
          ??????????????'9.心情語(yǔ)錄\n')
          ????????select=int(input('請(qǐng)輸入您的選擇:'))
          ????????if?(select==1):
          ????????????url=self.url+'lizhimingyan/list_1_{}.html'
          ????????elif?(select==2):
          ????????????url?=?self.url?+?'aiqingyulu/list_18_{}.html'
          ????????elif?(select==3):
          ????????????url?=?self.url?+?'gaoxiaoyulu/list_19_{}.html'
          ????????elif?(select==4):
          ????????????url=self.url+'renshenggeyan/list_14_{}.html'
          ????????elif?(select==5):
          ????????????url=self.url+'qingganyulu/list_23_{}.html'
          ????????elif?(select==6):
          ????????????url=self.url+'jingdianyulu/list_12_{}.html'
          ????????elif?(select==7):
          ????????????url=self.url+'shangganyulu/list_21_{}.html'
          ????????elif?(select==8):
          ????????????url=self.url+'mingrenmingyan/list_2_{}.html'
          ????????else:
          ????????????url=self.url+'xinqingyulu/list_22_{}.html'
          3.發(fā)送請(qǐng)求,獲取網(wǎng)頁(yè)。
          ????def?get_html(self,url):
          ????????response=requests.get(url,headers=self.headers)
          ????????html=response.content.decode('gb2312')#經(jīng)過(guò)測(cè)試這里是'gb2312'
          ????????return?html
          4.解析網(wǎng)頁(yè),獲取文本信息。
          ????def?parse_html(self,html):
          ?????#獲取每頁(yè)中的鏈接地址和標(biāo)題
          ????????datas=re.compile('(.*?)').findall(html)
          ????????for?data?in?datas:
          ????????????host='http://www.yuluju.com'+data[0]
          ????????????res=requests.get(host,headers=self.headers)
          ????????????con=res.content.decode('gb2312')
          ????????????target=etree.HTML(con)
          ????????????#獲取文本內(nèi)容
          ????????????results=target.xpath('//div[@class="content"]/div/div/span/text()')
          ????????????filename=data[1]
          ????????????#保存本地
          ????????????with?open('F:/pycharm文件/document/'+filename+'.txt','a',encoding='utf-8')as?f:
          ????????????????for?result?in?results:
          ????????????????????f.write(result+'\n')
          5.獲取多頁(yè)及主函數(shù)調(diào)用。
          ????def?main(self):
          ????????print('1.勵(lì)志語(yǔ)錄\n'
          ??????????????'2.愛(ài)情語(yǔ)錄\n'
          ??????????????'3.搞笑語(yǔ)錄\n'
          ??????????????'4.人生語(yǔ)錄\n'
          ??????????????'5.情感語(yǔ)錄\n'
          ??????????????'6.經(jīng)典語(yǔ)錄\n'
          ??????????????'7.傷感語(yǔ)錄\n'
          ??????????????'8.名人語(yǔ)錄\n'
          ??????????????'9.心情語(yǔ)錄\n')
          ????????select=int(input('請(qǐng)輸入您的選擇:'))
          ????????if?(select==1):
          ????????????url=self.url+'lizhimingyan/list_1_{}.html'
          ????????elif?(select==2):
          ????????????url?=?self.url?+?'aiqingyulu/list_18_{}.html'
          ????????elif?(select==3):
          ????????????url?=?self.url?+?'gaoxiaoyulu/list_19_{}.html'
          ????????elif?(select==4):
          ????????????url=self.url+'renshenggeyan/list_14_{}.html'
          ????????elif?(select==5):
          ????????????url=self.url+'qingganyulu/list_23_{}.html'
          ????????elif?(select==6):
          ????????????url=self.url+'jingdianyulu/list_12_{}.html'
          ????????elif?(select==7):
          ????????????url=self.url+'shangganyulu/list_21_{}.html'
          ????????elif?(select==8):
          ????????????url=self.url+'mingrenmingyan/list_2_{}.html'
          ????????else:
          ????????????url=self.url+'xinqingyulu/list_22_{}.html'
          ????????start?=?int(input('輸入開(kāi)始:'))
          ????????end?=?int(input('輸入結(jié)束頁(yè):'))
          ????????for?page?in?range(start,?end?+?1):
          ????????????print('第%s頁(yè)開(kāi)始:...'?%?page)
          ????????????newUrl=url.format(page)
          ????????????html=self.get_html(newUrl)
          ????????????self.parse_html(html)
          ????????????print('第%s頁(yè)爬取完成!'%page)

          效果顯示

          打開(kāi)文件目錄:

          爬取其它欄目也是可以的,就不做演示了,都一樣。

          完整代碼

          import??requests
          from?fake_useragent?import?UserAgent
          import?re
          from?lxml?import?etree
          class?yulu(object):
          ????def?__init__(self):
          ????????self.url='http://www.yuluju.com/'
          ????????ua?=?UserAgent(verify_ssl=False)
          ????????for?i?in?range(1,?100):
          ????????????self.headers?=?{
          ????????????????'User-Agent':?ua.random
          ????????????}
          ????def?get_html(self,url):
          ????????response=requests.get(url,headers=self.headers)
          ????????html=response.content.decode('gb2312')
          ????????return?html
          ????def?parse_html(self,html):
          ????????datas=re.compile('(.*?)').findall(html)
          ????????for?data?in?datas:
          ????????????host='http://www.yuluju.com'+data[0]
          ????????????res=requests.get(host,headers=self.headers)
          ????????????con=res.content.decode('gb2312')
          ????????????target=etree.HTML(con)
          ????????????results=target.xpath('//div[@class="content"]/div/div/span/text()')
          ????????????filename=data[1]
          ????????????with?open('F:/pycharm文件/document/'+filename+'.txt','a',encoding='utf-8')as?f:
          ????????????????for?result?in?results:
          ????????????????????f.write(result+'\n')
          ????def?main(self):
          ????????print('1.勵(lì)志語(yǔ)錄\n'
          ??????????????'2.愛(ài)情語(yǔ)錄\n'
          ??????????????'3.搞笑語(yǔ)錄\n'
          ??????????????'4.人生語(yǔ)錄\n'
          ??????????????'5.情感語(yǔ)錄\n'
          ??????????????'6.經(jīng)典語(yǔ)錄\n'
          ??????????????'7.傷感語(yǔ)錄\n'
          ??????????????'8.名人語(yǔ)錄\n'
          ??????????????'9.心情語(yǔ)錄\n')
          ????????select=int(input('請(qǐng)輸入您的選擇:'))
          ????????if?(select==1):
          ????????????url=self.url+'lizhimingyan/list_1_{}.html'
          ????????elif?(select==2):
          ????????????url?=?self.url?+?'aiqingyulu/list_18_{}.html'
          ????????elif?(select==3):
          ????????????url?=?self.url?+?'gaoxiaoyulu/list_19_{}.html'
          ????????elif?(select==4):
          ????????????url=self.url+'renshenggeyan/list_14_{}.html'
          ????????elif?(select==5):
          ????????????url=self.url+'qingganyulu/list_23_{}.html'
          ????????elif?(select==6):
          ????????????url=self.url+'jingdianyulu/list_12_{}.html'
          ????????elif?(select==7):
          ????????????url=self.url+'shangganyulu/list_21_{}.html'
          ????????elif?(select==8):
          ????????????url=self.url+'mingrenmingyan/list_2_{}.html'
          ????????else:
          ????????????url=self.url+'xinqingyulu/list_22_{}.html'
          ????????start?=?int(input('輸入開(kāi)始:'))
          ????????end?=?int(input('輸入結(jié)束頁(yè):'))
          ????????for?page?in?range(start,?end?+?1):
          ????????????print('第%s頁(yè)開(kāi)始:...'?%?page)
          ????????????newUrl=url.format(page)
          ????????????html=self.get_html(newUrl)
          ????????????self.parse_html(html)
          ????????????print('第%s頁(yè)爬取完成!'%page)
          if?__name__?==?'__main__':
          ????spider?=?yulu()
          ????spider.main()


          推薦閱讀

          平時(shí)都逛哪些技術(shù)網(wǎng)站?(程序員必備58個(gè)網(wǎng)站匯總)

          肝!精心整理了 50 個(gè)數(shù)據(jù)源網(wǎng)站!

          3種Python數(shù)據(jù)結(jié)構(gòu),13種創(chuàng)建方法,這個(gè)總結(jié),超贊!

          瀏覽 72
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  12321举报中心官网 | 成人网站在线精品国产免费 | 久久久一曲二曲三曲四曲免费听 | a免费观看片 | 丁香五月婷婷色爱 |