<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          任意爬取!超全開(kāi)源爬蟲(chóng)工具箱

          共 11794字,需瀏覽 24分鐘

           ·

          2020-11-02 08:24

          點(diǎn)擊上方“?python入門(mén)與進(jìn)階”,關(guān)注并“星標(biāo)

          每日接收Python干貨!開(kāi)源最前線?、數(shù)據(jù)管道綜合整理


          最近國(guó)內(nèi)一位開(kāi)發(fā)者在 GitHub 上開(kāi)源了個(gè)集眾多數(shù)據(jù)源于一身的爬蟲(chóng)工具箱——InfoSpider,一不小心就火了!!!


          有多火呢?開(kāi)源沒(méi)幾天就登上GitHub周榜第四,標(biāo)星1.3K,累計(jì)分支 172 個(gè)。同時(shí)作者已經(jīng)開(kāi)源了所有的項(xiàng)目代碼及使用文檔,并且在B站上還有使用視頻講解。

          項(xiàng)目代碼:?

          https://github.com/kangvcar/InfoSpider?


          項(xiàng)目使用文檔:

          https://infospider.vercel.app?


          項(xiàng)目視頻演示:

          https://www.bilibili.com/video/BV14f4y1R7oF/


          在這樣一個(gè)信息爆炸的時(shí)代,每個(gè)人都有很多個(gè)賬號(hào),賬號(hào)一多就會(huì)出現(xiàn)這么一個(gè)情況:個(gè)人數(shù)據(jù)分散在各種各樣的公司之間,就會(huì)形成數(shù)據(jù)孤島,多維數(shù)據(jù)無(wú)法融合,這個(gè)項(xiàng)目可以幫你將多維數(shù)據(jù)進(jìn)行融合并對(duì)個(gè)人數(shù)據(jù)進(jìn)行分析,這樣你就可以更直觀、深入了解自己的信息

          InfoSpider 是一個(gè)集眾多數(shù)據(jù)源于一身的爬蟲(chóng)工具箱,旨在安全快捷的幫助用戶(hù)拿回自己的數(shù)據(jù),工具代碼開(kāi)源,流程透明。并提供數(shù)據(jù)分析功能,基于用戶(hù)數(shù)據(jù)生成圖表文件,使得用戶(hù)更直觀、深入了解自己的信息。?

          目前支持?jǐn)?shù)據(jù)源包括GitHub、QQ郵箱、網(wǎng)易郵箱、阿里郵箱、新浪郵箱、Hotmail郵箱、Outlook郵箱、京東、淘寶、支付寶、中國(guó)移動(dòng)、中國(guó)聯(lián)通、中國(guó)電信、知乎、嗶哩嗶哩、網(wǎng)易云音樂(lè)、QQ好友、QQ群、生成朋友圈相冊(cè)、瀏覽器瀏覽歷史、12306、博客園、CSDN博客、開(kāi)源中國(guó)博客、簡(jiǎn)書(shū)

          根據(jù)創(chuàng)建者介紹,InfoSpider 具有以下特性:
          • 安全可靠:本項(xiàng)目為開(kāi)源項(xiàng)目,代碼簡(jiǎn)潔,所有源碼可見(jiàn),本地運(yùn)行,安全可靠。
          • 使用簡(jiǎn)單:提供 GUI 界面,只需點(diǎn)擊所需獲取的數(shù)據(jù)源并根據(jù)提示操作即可。
          • 結(jié)構(gòu)清晰:本項(xiàng)目的所有數(shù)據(jù)源相互獨(dú)立,可移植性高,所有爬蟲(chóng)腳本在項(xiàng)目的 Spiders 文件下。
          • 數(shù)據(jù)源豐富:本項(xiàng)目目前支持多達(dá)24+個(gè)數(shù)據(jù)源,持續(xù)更新。
          • 數(shù)據(jù)格式統(tǒng)一:爬取的所有數(shù)據(jù)都將存儲(chǔ)為json格式,方便后期數(shù)據(jù)分析。
          • 個(gè)人數(shù)據(jù)豐富:本項(xiàng)目將盡可能多地為你爬取個(gè)人數(shù)據(jù),后期數(shù)據(jù)處理可根據(jù)需要?jiǎng)h減。
          • 數(shù)據(jù)分析:本項(xiàng)目提供個(gè)人數(shù)據(jù)的可視化分析,目前僅部分支持。

          InfoSpider使用起來(lái)也非常簡(jiǎn)單,你只需要安裝python3和Chrome瀏覽器,運(yùn)行 python3 main.py,在打開(kāi)的窗口點(diǎn)擊數(shù)據(jù)源按鈕, 根據(jù)提示選擇數(shù)據(jù)保存路徑,接著輸入賬號(hào)密碼,就會(huì)自動(dòng)爬取數(shù)據(jù),根據(jù)下載的目錄就可以查看爬下來(lái)的數(shù)據(jù)。

          當(dāng)然如果你想自己去練習(xí)和學(xué)習(xí)爬蟲(chóng),作者也開(kāi)源了所有的爬取代碼,非常適合實(shí)戰(zhàn)。


          舉個(gè)例子,比如爬取taobao的:


          import?json
          import?random
          import?time
          import?sys
          import?os
          import?requests
          import?numpy?as?np
          import?math
          from?lxml?import?etree
          from?pyquery?import?PyQuery?as?pq
          from?selenium?import?webdriver
          from?selenium.webdriver?import?ChromeOptions
          from?selenium.webdriver.common.by?import?By
          from?selenium.webdriver.support?import?expected_conditions?as?EC
          from?selenium.webdriver.support.wait?import?WebDriverWait
          from?selenium.webdriver?import?ChromeOptions,?ActionChains
          from?tkinter.filedialog?import?askdirectory
          from?tqdm?import?trange


          def?ease_out_quad(x):
          ????return?1?-?(1?-?x)?*?(1?-?x)

          def?ease_out_quart(x):
          ????return?1?-?pow(1?-?x,?4)

          def?ease_out_expo(x):
          ????if?x?==?1:
          ????????return?1
          ????else:
          ????????return?1?-?pow(2,?-10?*?x)

          def?get_tracks(distance,?seconds,?ease_func):
          ????tracks?=?[0]
          ????offsets?=?[0]
          ????for?t?in?np.arange(0.0,?seconds,?0.1):
          ????????ease?=?globals()[ease_func]
          ????????offset?=?round(ease(t?/?seconds)?*?distance)
          ????????tracks.append(offset?-?offsets[-1])
          ????????offsets.append(offset)
          ????return?offsets,?tracks

          def?drag_and_drop(browser,?offset=26.5):
          ????knob?=?browser.find_element_by_id('nc_1_n1z')
          ????offsets,?tracks?=?get_tracks(offset,?12,?'ease_out_expo')
          ????ActionChains(browser).click_and_hold(knob).perform()
          ????for?x?in?tracks:
          ????????ActionChains(browser).move_by_offset(x,?0).perform()
          ????ActionChains(browser).pause(0.5).release().perform()

          def?gen_session(cookie):
          ????session?=?requests.session()
          ????cookie_dict?=?{}
          ????list?=?cookie.split(';')
          ????for?i?in?list:
          ????????try:
          ????????????cookie_dict[i.split('=')[0]]?=?i.split('=')[1]
          ????????except?IndexError:
          ????????????cookie_dict['']?=?i
          ????requests.utils.add_dict_to_cookiejar(session.cookies,?cookie_dict)
          ????return?session

          class?TaobaoSpider(object):
          ????def?__init__(self,?cookies_list):
          ????????self.path?=?askdirectory(title='選擇信息保存文件夾')
          ????????if?str(self.path)?==?"":
          ????????????sys.exit(1)
          ????????self.headers?=?{
          ????????????'User-Agent':?'Mozilla/5.0?(Macintosh;?Intel?Mac?OS?X?10_14_3)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/73.0.3683.86?Safari/537.36',
          ????????}
          ????????option?=?ChromeOptions()
          ????????option.add_experimental_option('excludeSwitches',?['enable-automation'])
          ????????option.add_experimental_option("prefs",?{"profile.managed_default_content_settings.images":?2})??#?不加載圖片,加快訪問(wèn)速度
          ????????option.add_argument('--headless')
          ????????self.driver?=?webdriver.Chrome(options=option)
          ????????self.driver.get('https://i.taobao.com/my_taobao.htm')
          ????????for?i?in?cookies_list:
          ????????????self.driver.add_cookie(cookie_dict=i)
          ????????self.driver.get('https://i.taobao.com/my_taobao.htm')
          ????????self.wait?=?WebDriverWait(self.driver,?20)??#?超時(shí)時(shí)長(zhǎng)為10s

          ????#?模擬向下滑動(dòng)瀏覽
          ????def?swipe_down(self,?second):
          ????????for?i?in?range(int(second?/?0.1)):
          ????????????#?根據(jù)i的值,模擬上下滑動(dòng)
          ????????????if?(i?%?2?==?0):
          ????????????????js?=?"var?q=document.documentElement.scrollTop="?+?str(300?+?400?*?i)
          ????????????else:
          ????????????????js?=?"var?q=document.documentElement.scrollTop="?+?str(200?*?i)
          ????????????self.driver.execute_script(js)
          ????????????time.sleep(0.1)

          ????????js?=?"var?q=document.documentElement.scrollTop=100000"
          ????????self.driver.execute_script(js)
          ????????time.sleep(0.1)

          ????#?爬取淘寶?我已買(mǎi)到的寶貝商品數(shù)據(jù),?pn?定義爬取多少頁(yè)數(shù)據(jù)
          ????def?crawl_good_buy_data(self,?pn=3):

          ????????#?對(duì)我已買(mǎi)到的寶貝商品數(shù)據(jù)進(jìn)行爬蟲(chóng)
          ????????self.driver.get("https://buyertrade.taobao.com/trade/itemlist/list_bought_items.htm")

          ????????#?遍歷所有頁(yè)數(shù)
          ????????
          ????????for?page?in?trange(1,?pn):
          ????????????data_list?=?[]

          ????????????#?等待該頁(yè)面全部已買(mǎi)到的寶貝商品數(shù)據(jù)加載完畢
          ????????????good_total?=?self.wait.until(
          ????????????????EC.presence_of_element_located((By.CSS_SELECTOR,?'#tp-bought-root?>?div.js-order-container')))

          ????????????#?獲取本頁(yè)面源代碼
          ????????????html?=?self.driver.page_source

          ????????????#?pq模塊解析網(wǎng)頁(yè)源代碼
          ????????????doc?=?pq(html)

          ????????????#?#?存儲(chǔ)該頁(yè)已經(jīng)買(mǎi)到的寶貝數(shù)據(jù)
          ????????????good_items?=?doc('#tp-bought-root?.js-order-container').items()

          ????????????#?遍歷該頁(yè)的所有寶貝
          ????????????for?item?in?good_items:
          ????????????????#?商品購(gòu)買(mǎi)時(shí)間、訂單號(hào)
          ????????????????good_time_and_id?=?item.find('.bought-wrapper-mod__head-info-cell___29cDO').text().replace('\n',?"").replace('\r',?"")
          ????????????????#?商家名稱(chēng)
          ????????????????#?good_merchant?=?item.find('.seller-mod__container___1w0Cx').text().replace('\n',?"").replace('\r',?"")
          ????????????????good_merchant?=?item.find('.bought-wrapper-mod__seller-container___3dAK3').text().replace('\n',?"").replace('\r',?"")
          ????????????????#?商品名稱(chēng)
          ????????????????#?good_name?=?item.find('.sol-mod__no-br___1PwLO').text().replace('\n',?"").replace('\r',?"")
          ????????????????good_name?=?item.find('.sol-mod__no-br___3Ev-2').text().replace('\n',?"").replace('\r',?"")
          ????????????????#?商品價(jià)格??
          ????????????????good_price?=?item.find('.price-mod__price___cYafX').text().replace('\n',?"").replace('\r',?"")
          ????????????????#?只列出商品購(gòu)買(mǎi)時(shí)間、訂單號(hào)、商家名稱(chēng)、商品名稱(chēng)
          ????????????????#?其余的請(qǐng)自己實(shí)踐獲取
          ????????????????data_list.append(good_time_and_id)
          ????????????????data_list.append(good_merchant)
          ????????????????data_list.append(good_name)
          ????????????????data_list.append(good_price)
          ????????????????#print(good_time_and_id,?good_merchant,?good_name)
          ????????????????#file_path?=?os.path.join(os.path.dirname(__file__)?+?'/user_orders.json')
          ????????????????#?file_path?=?"../Spiders/taobao/user_orders.json"
          ????????????????json_str?=?json.dumps(data_list)
          ????????????????with?open(self.path?+?os.sep?+?'user_orders.json',?'a')?as?f:
          ????????????????????f.write(json_str)

          ????????????#?print('\n\n')

          ????????????#?大部分人被檢測(cè)為機(jī)器人就是因?yàn)檫M(jìn)一步模擬人工操作
          ????????????#?模擬人工向下瀏覽商品,即進(jìn)行模擬下滑操作,防止被識(shí)別出是機(jī)器人
          ????????????#?隨機(jī)滑動(dòng)延時(shí)時(shí)間
          ????????????swipe_time?=?random.randint(1,?3)
          ????????????self.swipe_down(swipe_time)

          ????????????#?等待下一頁(yè)按鈕?出現(xiàn)
          ????????????good_total?=?self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,?'.pagination-next')))
          ????????????good_total.click()
          ????????????time.sleep(2)
          ????????????#?while?1:
          ????????????#?????time.sleep(0.2)
          ????????????#?????try:
          ????????????#?????????good_total?=?self.driver.find_element_by_xpath('//li[@title="下一頁(yè)"]')
          ????????????#?????????break
          ????????????#?????except:
          ????????????#?????????continue
          ????????????#?#?點(diǎn)擊下一頁(yè)按鈕
          ????????????#?while?1:
          ????????????#?????time.sleep(2)
          ????????????#?????try:
          ????????????#?????????good_total.click()
          ????????????#?????????break
          ????????????#?????except?Exception:
          ????????????#?????????pass

          ????#?收藏寶貝?傳入爬幾頁(yè)?默認(rèn)三頁(yè)??https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
          ????def?get_choucang_item(self,?page=3):
          ????????url?=?'https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow={}'
          ????????pn?=?0
          ????????json_list?=?[]
          ????????for?i?in?trange(page):
          ????????????self.driver.get(url.format(pn))
          ????????????pn?+=?30
          ????????????html_str?=?self.driver.page_source
          ????????????if?html_str?==?'':
          ????????????????break
          ????????????if?'登錄'?in?html_str:
          ????????????????raise?Exception('登錄')
          ????????????obj_list?=?etree.HTML(html_str).xpath('//li')
          ????????????for?obj?in?obj_list:
          ????????????????item?=?{}
          ????????????????item['title']?=?''.join([i.strip()?for?i?in?obj.xpath('./div[@class="img-item-title"]//text()')])
          ????????????????item['url']?=?''.join([i.strip()?for?i?in?obj.xpath('./div[@class="img-item-title"]/a/@href')])
          ????????????????item['price']?=?''.join([i.strip()?for?i?in?obj.xpath('./div[@class="price-container"]//text()')])
          ????????????????if?item['price']?==?'':
          ????????????????????item['price']?=?'失效'
          ????????????????json_list.append(item)
          ????????#?file_path?=?os.path.join(os.path.dirname(__file__)?+?'/shoucang_item.json')
          ????????json_str?=?json.dumps(json_list)
          ????????with?open(self.path?+?os.sep?+?'shoucang_item.json',?'w')?as?f:
          ????????????f.write(json_str)

          ????#?瀏覽足跡?傳入爬幾頁(yè)?默認(rèn)三頁(yè)??https://shoucang.taobao.com/nodejs/item_collect_chunk.htm?ifAllTag=0&tab=0&tagId=&categoryCount=0&type=0&tagName=&categoryName=&needNav=false&startRow=60
          ????def?get_footmark_item(self,?page=3):
          ????????url?=?'https://www.taobao.com/markets/footmark/tbfoot'
          ????????self.driver.get(url)
          ????????pn?=?0
          ????????item_num?=?0
          ????????json_list?=?[]
          ????????for?i?in?trange(page):
          ????????????html_str?=?self.driver.page_source
          ????????????obj_list?=?etree.HTML(html_str).xpath('//div[@class="item-list?J_redsList"]/div')[item_num:]
          ????????????for?obj?in?obj_list:
          ????????????????item_num?+=?1
          ????????????????item?=?{}
          ????????????????item['date']?=?''.join([i.strip()?for?i?in?obj.xpath('./@data-date')])
          ????????????????item['url']?=?''.join([i.strip()?for?i?in?obj.xpath('./a/@href')])
          ????????????????item['name']?=?''.join([i.strip()?for?i?in?obj.xpath('.//div[@class="title"]//text()')])
          ????????????????item['price']?=?''.join([i.strip()?for?i?in?obj.xpath('.//div[@class="price-box"]//text()')])
          ????????????????json_list.append(item)
          ????????????self.driver.execute_script('window.scrollTo(0,1000000)')
          ????????#?file_path?=?os.path.join(os.path.dirname(__file__)?+?'/footmark_item.json')
          ????????json_str?=?json.dumps(json_list)
          ????????with?open(self.path?+?os.sep?+?'footmark_item.json',?'w')?as?f:
          ????????????f.write(json_str)

          ????#?地址
          ????def?get_addr(self):
          ????????url?=?'https://member1.taobao.com/member/fresh/deliver_address.htm'
          ????????self.driver.get(url)
          ????????html_str?=?self.driver.page_source
          ????????obj_list?=?etree.HTML(html_str).xpath('//tbody[@class="next-table-body"]/tr')
          ????????data_list?=?[]
          ????????for?obj?in?obj_list:
          ????????????item?=?{}
          ????????????item['name']?=?obj.xpath('.//td[1]//text()')
          ????????????item['area']?=?obj.xpath('.//td[2]//text()')
          ????????????item['detail_area']?=?obj.xpath('.//td[3]//text()')
          ????????????item['youbian']?=?obj.xpath('.//td[4]//text()')
          ????????????item['mobile']?=?obj.xpath('.//td[5]//text()')
          ????????????data_list.append(item)
          ????????#?file_path?=?os.path.join(os.path.dirname(__file__)?+?'/addr.json')
          ????????json_str?=?json.dumps(data_list)
          ????????with?open(self.path?+?os.sep?+?'address.json',?'w')?as?f:
          ????????????f.write(json_str)


          if?__name__?==?'__main__':
          ????#?pass
          ????cookie_list?=?json.loads(open('taobao_cookies.json',?'r').read())
          ????t?=?TaobaoSpider(cookie_list)
          ????t.get_orders()
          ????#?t.crawl_good_buy_data()
          ????#?t.get_addr()
          ????#?t.get_choucang_item()
          ????#?t.get_footmark_item()


          Github地址:https://github.com/kangvcar/InfoSpider


          b站講解:https://www.bilibili.com/video/BV14f4y1R7oF/

          —??—

          回復(fù)關(guān)鍵字“簡(jiǎn)明python ”,立即獲取入門(mén)必備書(shū)籍簡(jiǎn)明python教程》電子版

          回復(fù)關(guān)鍵字爬蟲(chóng)”,立即獲取爬蟲(chóng)學(xué)習(xí)資料

          python入門(mén)與進(jìn)階
          每天與你一起成長(zhǎng)

          推薦閱讀


          點(diǎn)「在看」的人都變好看了哦!
          瀏覽 29
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  豆花视频成人版WWW18 | 男人天堂AV电影 | 河北最大AV网站 | 99爱精品视频在线观看 | 亚州逼逼 |