成人蘑菇 www网站,婷婷色视频在线观看,91福利免费,日皮免费视频成全视频,影音先锋成人电影AV一色女人在线播放 ,麻豆出品必是精品(3),夜夜看,夜夜爽,国产精品九九99久久精品

WebCollector-Python

WebCollector-Python 是一個無須配置、便于二次開發(fā)的 Python 爬蟲框架（內核），它提供精簡的的 API，只需少量代碼即可實現(xiàn)一個功能強大的爬蟲。

WebCollector Java版本

WebCollector Java版相比WebCollector-Python具有更高的效率: https://github.com/CrawlScript/WebCollector

安裝

pip安裝命令

pip install https://github.com/CrawlScript/WebCollector-Python/archive/master.zip

示例

快速入門

自動探測URL

demo_auto_news_crawler.py:

# coding=utf-8
import webcollector as wc


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True)
        self.num_threads = 10
        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

手動探測URL

demo_manual_news_crawler.py:

# coding=utf-8
import webcollector as wc


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=False)
        self.num_threads = 10
        self.add_seed("https://github.blog/")

    def visit(self, page, detected):

        detected.extend(page.links("https://github.blog/[0-9]+.*"))

        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用detected_filter插件過濾探測到的URL

demo_detected_filter.py:

# coding=utf-8
import webcollector as wc
from webcollector.filter import Filter
import re


class RegexDetectedFilter(Filter):
    def filter(self, crawl_datum):
        if re.fullmatch("https://github.blog/2019-02.*", crawl_datum.url):
            return crawl_datum
        else:
            print("filtered by detected_filter: {}".format(crawl_datum.brief_info()))
            return None


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True, detected_filter=RegexDetectedFilter())
        self.num_threads = 10
        self.add_seed("https://github.blog/")

    def visit(self, page, detected):

        detected.extend(page.links("https://github.blog/[0-9]+.*"))

        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用RedisCrawler進行可斷點的采集（可在關閉后恢復）

demo_redis_crawler.py:

# coding=utf-8
from redis import StrictRedis
import webcollector as wc


class NewsCrawler(wc.RedisCrawler):

    def __init__(self):
        super().__init__(redis_client=StrictRedis("127.0.0.1"),
                         db_prefix="news",
                         auto_detect=True)
        self.num_threads = 10
        self.resumable = True # you can resume crawling after shutdown
        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

用Requests定制Http請求

demo_custom_http_request.py:

# coding=utf-8

import webcollector as wc
from webcollector.model import Page
from webcollector.plugin.net import HttpRequester

import requests


class MyRequester(HttpRequester):
    def get_response(self, crawl_datum):
        # custom http request
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
        }

        print("sending request with MyRequester")

        # send request and get response
        response = requests.get(crawl_datum.url, headers=headers)

        # update code
        crawl_datum.code = response.status_code

        # wrap http response as a Page object
        page = Page(crawl_datum,
                    response.content,
                    content_type=response.headers["Content-Type"],
                    http_charset=response.encoding)

        return page


class NewsCrawler(wc.RamCrawler):
    def __init__(self):
        super().__init__(auto_detect=True)
        self.num_threads = 10

        # set requester to enable MyRequester
        self.requester = MyRequester()

        self.add_seed("https://github.blog/")
        self.add_regex("+https://github.blog/[0-9]+.*")
        self.add_regex("-.*#.*")  # do not detect urls that contain "#"

    def visit(self, page, detected):
        if page.match_url("https://github.blog/[0-9]+.*"):
            title = page.select("h1.lh-condensed")[0].text.strip()
            content = page.select("div.markdown-body")[0].text.replace("\n", " ").strip()
            print("\nURL: ", page.url)
            print("TITLE: ", title)
            print("CONTENT: ", content[:50], "...")


crawler = NewsCrawler()
crawler.start(10)

WebCollector-Python基于 Python 的開源網絡爬蟲框架

WebCollector-Python

WebCollector Java版本

安裝

pip安裝命令

示例

Basic

快速入門

自動探測URL

手動探測URL

用detected_filter插件過濾探測到的URL

用RedisCrawler進行可斷點的采集（可在關閉后恢復）

用Requests定制Http請求