<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          WebCollector-Python基于 Python 的開源網絡爬蟲框架

          聯(lián)合創(chuàng)作 · 2023-09-29 13:44

          WebCollector-Python

          WebCollector-Python 是一個無須配置、便于二次開發(fā)的 Python 爬蟲框架(內核),它提供精簡的的 API,只需少量代碼即可實現(xiàn)一個功能強大的爬蟲。

          WebCollector Java版本

          WebCollector Java版相比WebCollector-Python具有更高的效率: https://github.com/CrawlScript/WebCollector

          安裝

          pip安裝命令

          pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip

          示例

          Basic

          快速入門

          自動探測URL

          demo_auto_news_crawler.py:

          # coding=utf-8
          import webcollector as wc
          
          
          class NewsCrawler(wc.RamCrawler):
              def __init__(self):
                  super().__init__(auto_detect=True)
                  self.num_threads = 10
                  self.add_seed("https://github.blog/")
                  self.add_regex("+https://github.blog/[0-9]+.*")
                  self.add_regex("-.*#.*")  # do not detect urls that contain "#"
          
              def visit(self, page, detected):
                  if page.match_url("https://github.blog/[0-9]+.*"):
                      title = page.select("h1.lh-condensed")[0].text.strip()
                      content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
                      print("\nURL: ", page.url)
                      print("TITLE: ", title)
                      print("CONTENT: ", content[:50], "...")
          
          
          crawler = NewsCrawler()
          crawler.start(10)

          手動探測URL

          demo_manual_news_crawler.py:

          # coding=utf-8
          import webcollector as wc
          
          
          class NewsCrawler(wc.RamCrawler):
              def __init__(self):
                  super().__init__(auto_detect=False)
                  self.num_threads = 10
                  self.add_seed("https://github.blog/")
          
              def visit(self, page, detected):
          
                  detected.extend(page.links("https://github.blog/[0-9]+.*"))
          
                  if page.match_url("https://github.blog/[0-9]+.*"):
                      title = page.select("h1.lh-condensed")[0].text.strip()
                      content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
                      print("\nURL: ", page.url)
                      print("TITLE: ", title)
                      print("CONTENT: ", content[:50], "...")
          
          
          crawler = NewsCrawler()
          crawler.start(10)

          用detected_filter插件過濾探測到的URL

          demo_detected_filter.py:

          # coding=utf-8
          import webcollector as wc
          from webcollector.filter import Filter
          import re
          
          
          class RegexDetectedFilter(Filter):
              def filter(self, crawl_datum):
                  if re.fullmatch("https://github.blog/2019-02.*", crawl_datum.url):
                      return crawl_datum
                  else:
                      print("filtered by detected_filter: {}".format(crawl_datum.brief_info()))
                      return None
          
          
          class NewsCrawler(wc.RamCrawler):
              def __init__(self):
                  super().__init__(auto_detect=True, detected_filter=RegexDetectedFilter())
                  self.num_threads = 10
                  self.add_seed("https://github.blog/")
          
              def visit(self, page, detected):
          
                  detected.extend(page.links("https://github.blog/[0-9]+.*"))
          
                  if page.match_url("https://github.blog/[0-9]+.*"):
                      title = page.select("h1.lh-condensed")[0].text.strip()
                      content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
                      print("\nURL: ", page.url)
                      print("TITLE: ", title)
                      print("CONTENT: ", content[:50], "...")
          
          
          crawler = NewsCrawler()
          crawler.start(10)

          用RedisCrawler進行可斷點的采集(可在關閉后恢復)

          demo_redis_crawler.py:

          # coding=utf-8
          from redis import StrictRedis
          import webcollector as wc
          
          
          class NewsCrawler(wc.RedisCrawler):
          
              def __init__(self):
                  super().__init__(redis_client=StrictRedis("127.0.0.1"),
                                   db_prefix="news",
                                   auto_detect=True)
                  self.num_threads = 10
                  self.resumable = True # you can resume crawling after shutdown
                  self.add_seed("https://github.blog/")
                  self.add_regex("+https://github.blog/[0-9]+.*")
                  self.add_regex("-.*#.*")  # do not detect urls that contain "#"
          
              def visit(self, page, detected):
                  if page.match_url("https://github.blog/[0-9]+.*"):
                      title = page.select("h1.lh-condensed")[0].text.strip()
                      content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
                      print("\nURL: ", page.url)
                      print("TITLE: ", title)
                      print("CONTENT: ", content[:50], "...")
          
          
          crawler = NewsCrawler()
          crawler.start(10)
          

          用Requests定制Http請求

          demo_custom_http_request.py:

          # coding=utf-8
          
          import webcollector as wc
          from webcollector.model import Page
          from webcollector.plugin.net import HttpRequester
          
          import requests
          
          
          class MyRequester(HttpRequester):
              def get_response(self, crawl_datum):
                  # custom http request
                  headers = {
                      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
                  }
          
                  print("sending request with MyRequester")
          
                  # send request and get response
                  response = requests.get(crawl_datum.url, headers=headers)
          
                  # update code
                  crawl_datum.code = response.status_code
          
                  # wrap http response as a Page object
                  page = Page(crawl_datum,
                              response.content,
                              content_type=response.headers["Content-Type"],
                              http_charset=response.encoding)
          
                  return page
          
          
          class NewsCrawler(wc.RamCrawler):
              def __init__(self):
                  super().__init__(auto_detect=True)
                  self.num_threads = 10
          
                  # set requester to enable MyRequester
                  self.requester = MyRequester()
          
                  self.add_seed("https://github.blog/")
                  self.add_regex("+https://github.blog/[0-9]+.*")
                  self.add_regex("-.*#.*")  # do not detect urls that contain "#"
          
              def visit(self, page, detected):
                  if page.match_url("https://github.blog/[0-9]+.*"):
                      title = page.select("h1.lh-condensed")[0].text.strip()
                      content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
                      print("\nURL: ", page.url)
                      print("TITLE: ", title)
                      print("CONTENT: ", content[:50], "...")
          
          
          crawler = NewsCrawler()
          crawler.start(10)
          瀏覽 18
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          編輯 分享
          舉報
          評論
          圖片
          表情
          推薦
          點贊
          評論
          收藏
          分享

          手機掃一掃分享

          編輯 分享
          舉報
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  亚洲污视频在线观看 | 国产精品久久久久久久久毛毛 | 蜜桃四季春秘 一区二区三区 | 91成长视频蘑菇视频在线观看 | 操穴网|