av天堂资源在线,成人在线免费网站,波多野结衣AV二区,伊人综合大香蕉,福利在线观看中文字幕,免费黄色在线,麻豆视频一区999,欧美日韩国产传媒

溫馨提示：有效閱讀本文大概需要15min

文章目錄

前言
爬蟲的原理與流程
爬蟲框架搭建

一、新建 Spring Boot 項(xiàng)目
二、免費(fèi)代理服務(wù)
三、HTTP 請求封裝
四、業(yè)務(wù)爬取邏輯
五、云數(shù)據(jù)庫
六、任務(wù)調(diào)度

前言

“數(shù)據(jù)是新一輪技術(shù)革命最重要的生產(chǎn)資料”，在互聯(lián)網(wǎng)行業(yè)可以近似的說擁有了數(shù)據(jù)就擁有了市場。爬蟲在互聯(lián)網(wǎng)上無處不在，國內(nèi)外的各大搜索引擎都是基于爬蟲抓取信息后檢索的，所以說 “互聯(lián)網(wǎng)上 50% 的流量都是爬蟲創(chuàng)造的” 這一點(diǎn)都不為過。近 2 年爬蟲技術(shù)跟隨著大數(shù)據(jù)的火熱逐漸從臺后走到臺前，被越來越多的人所熟知，也被應(yīng)用的越來越廣泛，小到個(gè)人利用爬蟲抓取數(shù)據(jù)分析建模，大到利用爬蟲構(gòu)建公司的內(nèi)容和數(shù)據(jù)生態(tài)圈。所以爬蟲已經(jīng)成為一門 “平民化” 的技術(shù)，大家在工作和學(xué)習(xí)中都有用得著的地方。

爬蟲的原理與流程

本場 Chat 重點(diǎn)是要講述如何構(gòu)建爬蟲框架，但考慮到大家的對爬蟲的了解程度不同，所以還是稍微帶一下爬蟲的原理和流程。

爬蟲的原理其實(shí)很簡單，其實(shí)跟在瀏覽器中輸入一串 URL 地址并按下回車鍵后發(fā)生的事情是一樣的（對于這個(gè)問題的理解有興趣的童鞋可以參看：在瀏覽器中輸入 URL 地址并回車后都發(fā)生了什么？），只不過爬蟲的這些操作是用代碼來實(shí)現(xiàn)的。

爬蟲的主要目的就是爬取目標(biāo)數(shù)據(jù)，但是為了達(dá)到這個(gè)目的，還需要很多輔助工作要做，比如前期的目標(biāo) URL 提取或頁面分析，繞過登錄限制，以及爬蟲身份的隱藏、爬蟲的調(diào)度和容錯(cuò)處理等，還有最后的數(shù)據(jù)清洗和入庫。一個(gè)完整的爬蟲程序應(yīng)該是包含分析、爬取到入庫等一系列流程的，為了直觀我就畫一個(gè)圖來表示：

當(dāng)然上面只是一個(gè)單線的爬蟲流程，如果考慮到分布式和并發(fā)功能等，還需要繼續(xù)加上任務(wù)調(diào)度功能。比如前期已經(jīng)解析出了 URL，并發(fā)現(xiàn)了 URL 的數(shù)據(jù)分頁關(guān)系，假定用 pageNumber 來表示分頁 index，那么完全可以一次并發(fā) 10 個(gè)線程去跑 10 個(gè)任務(wù)，每個(gè)任務(wù)就是爬取指定 URL 的的數(shù)據(jù)，這樣就可以大大提高爬取效率。

此外，前面還提到過容錯(cuò)處理，這里也說明一下，爬蟲在爬取過程中有時(shí)可能會遇到突發(fā)異常，比如目標(biāo)服務(wù)器異常、網(wǎng)絡(luò)異常、對方實(shí)施了反扒策略等，這種情況我們也需要分別考慮到，通常需要加入重試機(jī)制，你可以自己定義什么類型的異常需要加入重試，比如請求失敗，這可能被對方屏蔽了，也可能網(wǎng)絡(luò)出現(xiàn)了震蕩。所以這時(shí)候重試 2 次看看，如果重試還是失敗，那么就可以放棄這次爬取任務(wù)了，你可以將失敗的任務(wù)記錄下來，后面等空閑或等實(shí)現(xiàn)失敗任務(wù)定時(shí)調(diào)度功能，等待下一次的再次調(diào)度；并設(shè)置一個(gè)最大的調(diào)度失敗次數(shù)（比如 3），超過這個(gè)次數(shù)，移除這個(gè)任務(wù)，以后都不用考慮了。

另外對于頁面數(shù)據(jù)解析，要分 2 種情況，提取的種子 URL 請求后返回的是 JSON 或 XML 格式的，直接用 FastJson 來解析 JSON 即可，可是通常沒有這么方便，很多時(shí)候需要我們?nèi)ソ馕鲰撁娴?HTML 文檔，這時(shí)候就需要用到 jsoup 來解析頁面，從而拿到我們需要的數(shù)據(jù)。這些流程中具體的功能和問題都會在后面的框架搭建時(shí)體現(xiàn)。

爬蟲框架搭建

主體框架采用現(xiàn)在比較火熱的微服務(wù)框架 Spring Boot，HTTP 請求部分基于 HttpClient 封裝，頁面解析部分使用 FastJson + jsoup，數(shù)據(jù)儲存采用 Bmob 后端云數(shù)據(jù)庫。項(xiàng)目的整體架構(gòu)如下圖：

整個(gè)框架根據(jù)功能或業(yè)務(wù)分為 4 層 / 模塊：代理層、HTTP 請求層、業(yè)務(wù)層、調(diào)度層。后面會依次介紹每一層的實(shí)現(xiàn)邏輯，首先我們從創(chuàng)建項(xiàng)目開始。

注：在學(xué)習(xí)和實(shí)踐爬蟲框架搭建時(shí)，建議將源碼項(xiàng)目也在編輯器中打開，參考著源碼看文章效果更好，因?yàn)槲闹胁豢赡馨阉写a都貼出來，未貼出的部分可以在源碼中了解，這樣更方便理解框架。

框架源碼地址：

https://github.com/AlanYangs/spiders/tree/master/spider-framework

一、新建 Spring Boot 項(xiàng)目

本項(xiàng)目采用宇宙最強(qiáng)的 IDEA 開發(fā)工具，新建 Spring Boot 項(xiàng)目，JDK 建議選擇 1.8（熟悉的同學(xué)可以跳過本小節(jié)）：

下一步，輸入項(xiàng)目的 group 和 artifact：

下一步，選擇依賴，這里暫時(shí)就選擇 Web：

下一步，輸入 module 名稱，項(xiàng)目創(chuàng)建完成。

接著再新建一些后面需要用的包，整體的項(xiàng)目結(jié)構(gòu)及核心包名的說明如下圖所示：

再貼一下項(xiàng)目中所需要用到的依賴，這里就直接給出對應(yīng)的 pom：


    
        org.springframework.boot
        spring-boot-starter-web
    

    
        org.apache.httpcomponents
        httpclient
        4.5.5
    

    
        com.google.guava
        guava
        18.0
    

    
        org.jsoup
        jsoup
        1.11.2
    

    
        com.alibaba
        fastjson
        1.2.51
    

    
        commons-io
        commons-io
        2.6
    

    
        org.springframework.boot
        spring-boot-starter-test
        test

下面按架構(gòu)圖中分層的功能和模塊來逐步實(shí)現(xiàn)對應(yīng)的功能和服務(wù)。

二、免費(fèi)代理服務(wù)

要想爬蟲健壯長久，肯定不能直接暴露本機(jī) IP 去跑，這時(shí)候就需要用到代理池，每次請求都從帶上代理去請求，從而隱藏爬蟲的身份。市面上有不少專門做代理的公司，都是要收費(fèi)的，價(jià)格還不便宜，每天 1000 個(gè) IP 使用上限，按穩(wěn)定時(shí)長區(qū)分，來感受一下包月的價(jià)格吧：

這價(jià)格一點(diǎn)都不美麗，所以我才想通過獲取免費(fèi) IP 代理的方式來自己構(gòu)建一個(gè)代理池。構(gòu)建的思路就是參看各大代理網(wǎng)站上面都有提供一些免費(fèi)的 IP 列表（舉例：快代理），我們可以抓取一些下來然后校驗(yàn)一下它的可用性，加入到自己的代理池中。考慮到在代碼實(shí)現(xiàn)中，IP 的抓取和校驗(yàn)都需要一定的時(shí)間，所以需要有一個(gè)緩存機(jī)制來存儲，可以用 Redis 等，但考慮部署的問題，這里就用 Google 的 Guava Cache 來作為本地緩存存儲代理 IP。

我們來封裝一個(gè) CacheService 服務(wù)，方便使用，代碼如下：

@Service
public class CacheService {
    private static final Logger LOGGER = LoggerFactory.getLogger(CacheService.class);

    private int cacheMaxSize = 3000;

    public void setCacheMaxSize(int cacheMaxSize) {
        this.cacheMaxSize = cacheMaxSize;
    }

    // 緩存
    private Cache cache = CacheBuilder.newBuilder()
            .maximumSize(cacheMaxSize)
            .removalListener(new RemovalListener() { //移出緩存時(shí)執(zhí)行的操作
                @Override
                public void onRemoval(RemovalNotification notification) {
                    if (notification.wasEvicted()) {
                        LOGGER.debug("key[" + notification.getKey() + "] was removed with expired.");
                    } else {
                        LOGGER.debug("key[" + notification.getKey() + "] was updated with put operation.");
                    }
                }
            })
            .build();

    public boolean containsKey(K key) {
        try {
            return cache.getIfPresent(key) != null;
        } catch (Exception e) {
            return false;
        }
    }

    public long size() {
        return cache.size();
    }

    public V get(K key) {
        return cache.getIfPresent(key);
    }

    public ConcurrentMap getAll() {
        return cache.asMap();
    }

    public List getValues(){
        return new ArrayList(getAll().values());
    }

    public void put(K key, V value) {
        cache.put(key, value);
        LOGGER.debug(String.format("put key %s with value %s to cache...", key, JSON.toJSONString(value)));
    }

    public void remove(K key) {
        if (containsKey(key)) {
            cache.invalidate(key);
        }
    }
}

由于需要從多個(gè)免費(fèi)代理網(wǎng)站上獲取代理，所以這里采用 Factory 工廠模式，類之間的繼承關(guān)系如下圖：

各具體的代理抓取實(shí)現(xiàn)類會實(shí)現(xiàn)各自的爬取方法，并會在初始化時(shí)自動注冊到工廠類 ProxyFetchFactory 中，然后在工廠類中統(tǒng)一構(gòu)建和多線程執(zhí)行。下面看一下工廠類 ProxyFetchFactory 的實(shí)現(xiàn)代碼：

@Service
public class ProxyFetchFactory {
    private static final Logger LOGGER = LoggerFactory.getLogger(ProxyFetchFactory.class);

    private static final int MAX_TIMEOUT_MS = 2000;//ms

    private List proxyServices = new ArrayList<>();

    private CacheService cacheService = SpringContext.getBean(CacheService.class);

    public List getProxyHosts() {
        return cacheService.getValues();
    }

    public void addService(AbstractProxyFetcher proxyService) {
        proxyServices.add(proxyService);
    }

    public void buildProxy() {
        List proxies = null;
        List> futures = new ArrayList<>();
        for (AbstractProxyFetcher proxyService : proxyServices) {
            if (!proxyService.getBusiness().equals("github")){
                continue;
            }
            try {
                proxies = proxyService.fetchProxy();
            } catch (Exception e) {
                LOGGER.error("fetch " + proxyService.getBusiness() + " proxy meet error", e);
                continue;
            }
            if (proxies != null && !proxies.isEmpty()) {
                ExecutorService es = Executors.newFixedThreadPool(10);
                for (final HttpProxy proxy : proxies) {
                    futures.add((Future) es.submit(new Runnable() {
                        @Override
                        public void run() {
                            proxy.setValid(checkHost(proxy));
                        }
                    }));
                }
                es.shutdown();
                for (Future future : futures) {
                    try {
                        future.get();
                    } catch (Exception e) {
                    }
                }
            }
        }
        LOGGER.info("valid host total size: " + getProxyHosts().size());
    }

    private boolean checkHost(final HttpProxy httpProxy) {
        boolean isValid = false;
        ExecutorService executor = Executors.newSingleThreadExecutor();
        FutureTask future = new FutureTask(new Callable() {//使用Callable接口作為構(gòu)造參數(shù)
            public String call() {
                //搜索任務(wù)異步處理
                return request(httpProxy);
            }
        });
        executor.execute(future);
        String content = null;
        try {
            content = future.get(MAX_TIMEOUT_MS, TimeUnit.MILLISECONDS); //在執(zhí)行超時(shí)時(shí)間內(nèi)獲取結(jié)果
        } catch (Exception e) {
            future.cancel(true);
        } finally {
            executor.shutdown();
        }
        isValid = StringUtils.isNotBlank(content) && content.indexOf("百度") > 0;

        LOGGER.info(httpProxy.getProvider() + " - " + httpProxy.getAddress() + " : " + isValid);
        if (isValid) {
            cacheService.put(httpProxy.getAddress() + ":" + httpProxy.getPort(), httpProxy.toHost());
            LOGGER.info("cache size:" + cacheService.size());
        }
        return isValid;
    }

    private String request(HttpProxy httpProxy) {
        String content = null;
        InetSocketAddress addr = null;
        URLConnection conn = null;
        InputStream in = null;
        try {
            //Proxy類代理方法
            URL url = new URL("http://www.baidu.com");
            // 創(chuàng)建代理服務(wù)器
            addr = new InetSocketAddress(httpProxy.getAddress(), httpProxy.getPort());
            Proxy proxy = new Proxy(Proxy.Type.HTTP, addr); // http 代理
            conn = url.openConnection(proxy);
            in = conn.getInputStream();
            content = IOUtils.toString(in);
        } catch (Exception e) {
            return null;
        } finally {
            if (in != null) {
                try {
                    in.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return content;
    }
}

代理的構(gòu)建和多線程爬取是在 buildProxy() 方法中實(shí)現(xiàn)的，對于抓取下來的免費(fèi)代理，我們還需要驗(yàn)證一下可用性，有些代理已經(jīng)失效或本身就不可用，這種我們需要排除掉，通過 checkHost() 來驗(yàn)證，驗(yàn)證邏輯其實(shí)就是用抓取的代理加到 HTTP 請求，看能不能跑通，超時(shí)或不通則可以判定該代理無效。

具體實(shí)現(xiàn)類的抓取邏輯其實(shí)就是一個(gè)簡單的爬蟲邏輯，根據(jù)種子 URL 作為入口，抓取頁面所需的內(nèi)容，這里舉例快代理的頁面解析邏輯（KuaidailiProxyFetcher 類），代碼如下：

@Service("kuaidailiNewProxyService")
public class KuaidailiProxyFetcher extends AbstractProxyFetcher {
    private static final Logger LOGGER = LoggerFactory.getLogger(KuaidailiProxyFetcher.class);

    private static final String KUAIDAILI_FREE_PROXY_URL = "https://www.kuaidaili.com/proxylist/%d";

    @Override
    public String getBusiness() {
        return "kuaidaili";
    }

    @Override
    public List fetchProxy() {
        Document doc = null;
        List list = new ArrayList<>();
        Random random = new Random();
        for (int i = 1; i < 10; i++) {
            String url = String.format(KUAIDAILI_FREE_PROXY_URL, i);
            String refererUrl = i > 1 ?  String.format(KUAIDAILI_FREE_PROXY_URL, i-1) : "";
            try {
                doc = jsoup.connect(url)
                        .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")
                        .header("Accept-Encoding", "gzip, deflate, br")
                        .header("Accept-Language", "zh-CN,zh;q=0.9")
                        .header("User-Agent", getRandomUserAgent())
                        .header("Host", "www.kuaidaili.com")
                        .header("Referer", refererUrl)
                        .timeout(30 * 1000)
                        .get();

                Elements ipElms = doc.select("td[data-title=\"IP\"]");
                Elements portElms = doc.select("td[data-title=\"PORT\"]");
                Elements typeElms = doc.select("td[data-title=\"類型\"]");
                if (ipElms != null) {
                    for (int j = 0, length = ipElms.size(); j < length; j++) {
                        String ip = ipElms.get(j).text();
                        String port = portElms.get(j).text();
                        if (StringUtil.isBlank(ip) || StringUtil.isBlank(port)) {
                            continue;
                        }
                        HttpProxy httpProxy = new HttpProxy(ip, Integer.parseInt(port));
                        String type = typeElms.get(j).text();
                        if (StringUtils.isNotBlank(type) && !type.contains(",")) {
                            httpProxy.setType(type.toLowerCase());
                        }
                        httpProxy.setProvider(getBusiness());
                        list.add(httpProxy);
                    }
                }
                Thread.sleep(random.nextInt(5) * 1000);
            } catch (Exception e) {
                LOGGER.error("fetch proxy meet error with url["+ url +"]: ", e);
            }
        }
        LOGGER.info("fetch [" + getBusiness() + "] proxy list size=" + list.size());
        return list;
    }
}

代碼主要在 fetchProxy() 方法，拿到返回結(jié)果后使用的是 jsoup 來解析，鑒于篇幅關(guān)系，關(guān)于 jsoup 的使用我就詳細(xì)介紹了，不熟悉的同學(xué)請參看?jsoup 中文文檔，關(guān)鍵就在于查找 DOM 文檔的元素，然后拿到對應(yīng)的值。

這一部分搭建完以后，我們起一個(gè) Test 來測試一下：

@RunWith(SpringRunner.class)
@SpringBootTest
public class SpiderApplicationTests {

    @Autowired
    private ProxyFetchFactory proxyFetchFactory;

    @Test
    public void testProxyFetch() {
        proxyFetchFactory.buildProxy();
    }

}

下面貼一下結(jié)果，總共跑了 12 分鐘，抓取到了 346 個(gè)有用的代理 IP，耗時(shí) 12 分鐘主要是檢驗(yàn)代理有效性上面的耗時(shí)較多，所以才需要用緩存或者隊(duì)列來存儲代理數(shù)據(jù)從而避免后續(xù)任務(wù)的阻塞，不然后面的業(yè)務(wù)爬取就要阻塞等待，等代理抓取完了才能跑。

三、HTTP 請求封裝

爬蟲就是基于網(wǎng)絡(luò)請求的業(yè)務(wù)，那么當(dāng)然需要進(jìn)行大量 HTTP 請求，為了方便使用，這里需要對 HttpClient 做一下封裝。按照鏈?zhǔn)秸{(diào)用的方式封裝參數(shù)和方法，下面給出 HTTP 請求的 HttpRequest 類代碼參考：

@Service("httpRequest")
public class HttpRequest {
    private static Logger logger = LoggerFactory.getLogger(HttpRequest.class);

    private boolean useProxy = true; //默認(rèn)請求時(shí)使用代理

    private String url;

    private Map headers;
    private Map params;
    private String content;
    private ContentType contentType;

    private ProxyFetchFactory proxyFetchFactory = SpringContext.getBean(ProxyFetchFactory.class);

    public HttpRequest() {
    }

    public HttpRequest(String url) {
        this.url = url;
    }

    public HttpRequest setUrl(String url) {
        this.url = url;
        return this;
    }

    public HttpRequest setHeaders(Map headers) {
        this.headers = headers;
        return this;
    }

    public HttpRequest setParams(Map params) {
        this.params = params;
        return this;
    }

    public HttpRequest setContent(String content, ContentType contentType) {
        this.content = content;
        this.contentType = contentType;
        return this;
    }

    public HttpRequest setUseProxy(boolean useProxy) {
        this.useProxy = useProxy;
        return this;
    }

    public CloseableHttpClient getSSLHttpClient() throws Exception {
        //設(shè)置代理IP、端口、協(xié)議
        RequestConfig config = null;
        if (useProxy && proxyFetchFactory != null){
            List proxyHosts = proxyFetchFactory.getProxyHosts();
            if (proxyHosts != null && !proxyHosts.isEmpty()) {
                int index = new Random().nextInt(proxyHosts.size());
                HttpHost host = proxyHosts.get(index);
                logger.info("url: " + url + ", proxy: " + host.toHostString());
                //請求配置，設(shè)置鏈接超時(shí)和讀取超時(shí)
                config = RequestConfig.custom().setProxy(host).setConnectTimeout(30000).setSocketTimeout(30000).build();
            }
        }

        if (config == null) {
            config = RequestConfig.custom().setConnectTimeout(30000).setSocketTimeout(30000).build();
        }

        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() {
                public boolean isTrusted(X509Certificate[] paramArrayOfX509Certificate, String paramString) throws CertificateException {
                    return true;
                }
            }).build();
            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext, SSLConnectionSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER);
            return HttpClients.custom().setSSLSocketFactory(sslsf).setDefaultRequestConfig(config).build();
        } catch (Exception e) {
            throw e;
        }
    }

    public HttpResult doGet() {
        try {
            //HttpClient client = HttpClients.createDefault();
            HttpClient client = getSSLHttpClient();
            //發(fā)送get請求
            HttpGet request = new HttpGet(url);
            if (headers != null && headers.size() > 0) {
                for (Map.Entry header : headers.entrySet()) {
                    request.setHeader(header.getKey(), String.valueOf(header.getValue()));
                }
            }

            if (params != null && params.size() > 0) {
                StringBuilder sb = new StringBuilder();
                if (!url.contains("?")) {
                    sb.append("?");
                }
                for (Map.Entry param : params.entrySet()) {
                    sb.append(param.getKey() + "=" + param.getValue() + "&");
                }
                sb.deleteCharAt(sb.lastIndexOf("&"));
                url = url + sb.toString();
            }
            HttpResponse response = client.execute(request, new BasicHttpContext());
            return new HttpResult(response.getStatusLine().getStatusCode(), EntityUtils.toString(response.getEntity()));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

    public HttpResult doPost() {
        try {
            //HttpClient client = HttpClients.createDefault();
            HttpClient client = getSSLHttpClient();
            //發(fā)送get請求
            HttpPost request = new HttpPost(url);

            //設(shè)置url
            request.setURI(new URI(url));

            setParams(request);

            HttpResponse response = client.execute(request);
            return new HttpResult(response.getStatusLine().getStatusCode(), EntityUtils.toString(response.getEntity()));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }

    public HttpResult doPut(){
        //參看項(xiàng)目源碼
    }

    public HttpResult doDelete() {
        //參看項(xiàng)目源碼
    }

    private void setParams(HttpEntityEnclosingRequestBase request) throws UnsupportedEncodingException {
        if (headers != null && headers.size() > 0) {
            for (Map.Entry header : headers.entrySet()) {
                request.setHeader(header.getKey(), String.valueOf(header.getValue()));
            }
        }

        List nvps = new ArrayList();

        //設(shè)置參數(shù)
        if (params != null && params.size() > 0) {
            for (Iterator iter = params.keySet().iterator(); iter.hasNext(); ) {
                String name = (String) iter.next();
                String value = String.valueOf(params.get(name));
                nvps.add(new BasicNameValuePair(name, value));
            }
            request.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
        }
        //設(shè)置內(nèi)容
        if (content != null) {
            contentType = contentType == null ? ContentType.APPLICATION_JSON : contentType;
            request.setEntity(new StringEntity(content, contentType));
        }
    }
}

定義了各種參數(shù)的 setter 方法，調(diào)用的時(shí)候就非常簡單了，按照鏈?zhǔn)秸{(diào)用即可：

httpRequest.setUrl(url).setHeaders(headers).doGet();

另外，有一點(diǎn)需要注意，請求的時(shí)候 header 需要帶上 UserAgent 信息，這是一個(gè)最簡單的爬蟲偽裝，不加直接就被判定為爬蟲或機(jī)器人了，這里采用簡單的一點(diǎn)的做法，搜集一些已知的 UA 信息寫到一個(gè) TXT 文件中，然后初始化的時(shí)候?qū)?UA 信息加載到 List 中，后面在請求的時(shí)候隨機(jī)遍歷 List 的值即可。代碼就不貼了，可以參看源碼的 UserAgentService 類。

四、業(yè)務(wù)爬取邏輯

這一部分是爬蟲的業(yè)務(wù)部分，也就是說跟你爬取的數(shù)據(jù)對象有關(guān)系，有時(shí)候我們需要爬取網(wǎng)頁的全部內(nèi)容，比如資訊類的文章信息；而有時(shí)候我們需要爬取網(wǎng)頁里面的特定內(nèi)容，比如京東淘寶的商品價(jià)格信息，這時(shí)候就需要對頁面內(nèi)容進(jìn)行定位和過濾。文中我們就以爬取 CSDN 的熱門文章來舉例。首先，我們還是來看一下各類之間的關(guān)系，如下圖：

從上面類圖中可以看到有 2 個(gè)抽象類，這是便于業(yè)務(wù)擴(kuò)展用的，后面如果還需寫其他頁面的爬蟲，比如需要爬取知乎熱榜的文章，直接套著當(dāng)前框架就能寫，加 2 個(gè)對應(yīng)的實(shí)現(xiàn)類，實(shí)現(xiàn)知乎熱榜具體的頁面解析即可。下面具體解釋一下這 2 個(gè)類的用法。

1. AbstractListProcessor 類

抽象頁面列表處理器，目的就是構(gòu)建多個(gè)頁面內(nèi)容處理器，通常是按分頁來構(gòu)建。

public abstract class AbstractListProcessor {

    /**
     * 內(nèi)容處理器列表
     * @return
     */
    public abstract List getItemProcessors();

    /**
     * 業(yè)務(wù)名稱
     * @return
     */
    public abstract String getBusiness();
}

定義了 2 個(gè)抽象方法，需要子類去實(shí)現(xiàn)，下面我們看一下它的實(shí)現(xiàn)類：

@Service
public class CSDNListProcessor extends AbstractListProcessor {

    private static final String API_NEWS_URL = "https://www.csdn.net/api/articles?type=more&category=news&shown_offset=%s";

    private long offset = System.currentTimeMillis() * 1000; //初始o(jì)ffset值，16位

    private int maxPageNumber = 10; //最大的分頁數(shù)

    @Override
    public List getItemProcessors() {
        List list = new ArrayList<>();
        for (int i = 0; i < maxPageNumber; i++) {
            ExecuteContent executeContent = new ExecuteContent();
            long currOffset = offset + i * 10; //由于接口每次返回10條記錄，所以offset按10遞增
            executeContent.setUrl(String.format(API_NEWS_URL, ""));//多次調(diào)試發(fā)現(xiàn)這種方式可以不傳offset，接口每次會自動更新
            executeContent.setBusiness(getBusiness());
            Map params = new HashMap();
            params.put("offset", currOffset);
            executeContent.setParams(JSON.toJSONString(params));
            CSDNNewsProcessor csdnNewsProcessor = new CSDNNewsProcessor();
            csdnNewsProcessor.setExecuteContent(executeContent);
            list.add(csdnNewsProcessor);
        }
        return list;
    }

    @Override
    public String getBusiness() {
        return "CSDN";
    }
}

代碼說明：

在實(shí)現(xiàn)類中就比較清晰的能看出來，在 getItemProcessors() 方法中是需要構(gòu)建一個(gè)后面頁面內(nèi)容處理器所需要的一些參數(shù)，包括入口 URL、業(yè)務(wù)名稱、一些后面需要用到的參數(shù)等，然后通過 ExecuteContent 對象進(jìn)行傳遞；
這里的循環(huán)邏輯，也可以在具體內(nèi)容處理器的 hasMore() 方法中實(shí)現(xiàn)，后面對應(yīng)內(nèi)容時(shí)會提到；
照顧一下基礎(chǔ)較差的同學(xué)，怕有些同學(xué)不知道這個(gè)入口 URL 是怎么來的，其實(shí)就是爬蟲流程中的前期階段需要做的工作，對頁面進(jìn)行分析提取出入口 URL，如下圖，在分析 CSDN 首頁時(shí) F12 打開 Chrome 瀏覽器的調(diào)試器，在網(wǎng)絡(luò)那個(gè) tab 可以看到頁面的請求信息，可以看到這個(gè) URL 就是后臺的獲取文章信息的接口：

2. AbstractItemProcessor 類

抽象頁面內(nèi)容處理器，這里面定義了爬取頁面所需要的一系列操作，包括請求、解析、入庫等，此外它還是一個(gè)線程類，用于在任務(wù)調(diào)度時(shí)多線程并發(fā)執(zhí)行。

public abstract class AbstractItemProcessor implements Runnable{
    private static final Logger LOGGER = LoggerFactory.getLogger(AbstractItemProcessor.class);

    public List userAgentList = new ArrayList<>();

    public SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    public ExecuteContent executeContent;

    public void setExecuteContent(ExecuteContent executeContent) {
        this.executeContent = executeContent;
    }

    /**
     * 請求
     *
     * @return
     */
    protected abstract HttpResult request();

    /**
     * 解析
     */
    protected abstract List parse(HttpResult result);

    /**
     * 入庫
     */
    protected abstract void store(List list);

    /**
     * 是否有下一頁
     */
    protected abstract boolean hasMore();

    /**
     * 入口：整合流程
     */
    public void process() throws InterruptedException {
        UserAgentService userAgentService = SpringContext.getBean(UserAgentService.class);
        userAgentList = userAgentService.getUserAgentList();
        HttpResult result = null;
        List list = null;
        int retryTimes = 0;
        do {
            while (result == null && retryTimes < 3) { //請求失敗重試3次
                result = request();
                retryTimes ++;
                Thread.sleep(5000);
            }

            list = parse(result);
            store(list);
        } while (hasMore());
    }

    @Override
    public void run() {
        LOGGER.info("executing processor: " + executeContent.toString());
        try {
            process();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

將請求、解析、入庫和分頁判斷，這些按業(yè)務(wù)可能有差異的過程抽象出來，讓業(yè)務(wù)實(shí)現(xiàn)類自己去實(shí)現(xiàn)，其中請求的過程加了失敗重試（默認(rèn) 3 次）機(jī)制，此外該類實(shí)現(xiàn)了 Runnable 接口，也就是說它及其子類都是一個(gè)線程類，用于后續(xù)任務(wù)的多線程執(zhí)行。下面看一個(gè)業(yè)務(wù)實(shí)現(xiàn)類 CSDNNewsProcessor：

@SpiderProcessor
@Service("csdnNewsProcessor")
public class CSDNNewsProcessor extends AbstractItemProcessor {

    private static Logger logger = LoggerFactory.getLogger(CSDNNewsProcessor.class);

    private ArticleCouldDBService articleCouldDBService = SpringContext.getBean(ArticleCouldDBService.class);

    private HttpRequest httpRequest = SpringContext.getBean(HttpRequest.class);

    @Override
    public HttpResult request() {
        Map headers = new HashMap<>();
        if (userAgentList != null && !userAgentList.isEmpty()) {
            int index = new Random().nextInt(userAgentList.size());
            headers.put("User-Agent", userAgentList.get(index));
        }

        HttpResult result = httpRequest.setUrl(executeContent.getUrl()).setHeaders(headers).doGet();
        return result;
    }

    @Override
    public List parse(HttpResult result) {
        List list = new ArrayList<>();
        if (result != null && StringUtils.isNotEmpty(result.getContent())) {
            try {
                JSONObject root = JSON.parseObject(result.getContent());
                JSONArray articles = root.getJSONArray("articles");
                if (articles != null && !articles.isEmpty()) {
                    for (int i = 0, length = articles.size(); i < length; i++) {
                        JSONObject item = articles.getJSONObject(i);
                        Article article = new Article();
                        article.setSourceName(executeContent.getBusiness());
                        article.setTitle(item.getString("title"));
                        article.setArticleId(item.getString("id"));
                        article.setAuthor(item.getString("nickname"));
                        article.setUrl(item.getString("url"));
                        String category = item.getString("category");
                        if (StringUtils.isEmpty(category)) {
                            category = item.getString("tag");
                        }
                        article.setCategory(category);
                        article.setDescription(item.getString("desc"));
                        article.setImgUrl(item.getString("avatar"));
                        Date now = new Date();
                        article.setPublishTime(sdf.format(now));

                        if (StringUtils.isNotBlank(article.getTitle()) && StringUtils.isNotBlank(article.getUrl())) {
                            list.add(article);
                        }
                    }
                }
            } catch (Exception e) {
                logger.error("parse json to bean meet error:", e);
            }
        }
        return list;
    }

    @Override
    public void store(List list) {
        if (list != null && !list.isEmpty()) {
            for (Article article : list) {
                articleCouldDBService.insert(article);
            }
        }
    }

    @Override
    protected boolean hasMore() {
//        long currMs = System.currentTimeMillis();
//        Map paramsMap = (Map) JSON.parse(executeContent.getParams());
//        long offset = Long.parseLong(String.valueOf(paramsMap.get("offset")).substring(0,Long.toString(currMs).length()));//就取毫秒級別的時(shí)間戳作為比較
//        return offset < currMs;
        return false;
    }
}

代碼說明：

上面內(nèi)容處理器的類名前面加了一個(gè)自定義的注解 @SpiderProcessor ，也是為了后面任務(wù)調(diào)度的需要；
request() 方法對應(yīng)的是請求過程，唯一需要注意的就是 header 中需要加入隨機(jī)的 UserAagent；
parse() 方法對應(yīng)的是頁面數(shù)據(jù)解析過程，解析的是請求返回的結(jié)果，剛好上面的 CSDN 中的例子返回的是 JSON 數(shù)據(jù)，所以我們直接用 FastJson 來解析就 ok 了（如果返回的是 HTML 文本數(shù)據(jù)，就用 jsoup 來處理，可以參考之前的代理模塊），就是簡單的 JSON 解析過程，就不多說了；
store() 方法對應(yīng)的是數(shù)據(jù)存儲過程，這里應(yīng)用的是云數(shù)據(jù)庫存儲，后面一節(jié)會具體介紹；
hasMore() 方法是判斷當(dāng)前爬蟲是否結(jié)束？類似于是否存在分頁，由于前面列表處理器中已經(jīng)加入了 offset 自增，所以這里直接返回 false 即可，如果列表處理器沒有處理自增邏輯，那么可以按照 hasMore() 方法代碼注釋部分來實(shí)現(xiàn)循環(huán)爬取。

五、云數(shù)據(jù)庫

數(shù)據(jù)入庫，你當(dāng)然可以入庫到本地的 MySQL 數(shù)據(jù)庫，甚至寫入到文件中，可是這里我想介紹一下云數(shù)據(jù)庫，顧名思義，就是數(shù)據(jù)存儲在第三方云服務(wù)器上，對于個(gè)人開發(fā)者自己因?yàn)榕d趣或?qū)W習(xí)去開發(fā) App 或小程序應(yīng)用，最煩惱的就是后端，自己去寫一個(gè)后端服務(wù)成本太高，服務(wù)器成本和運(yùn)維成本都得不償失，所以這時(shí)候就想到了云數(shù)據(jù)庫，這里推薦 Bmob（官網(wǎng)地址）的云數(shù)據(jù)庫，我從 17 年開始使用的，App 和小程序的后臺數(shù)據(jù)都存在 Bmob 的云數(shù)據(jù)庫上，暫時(shí)還有免費(fèi)版，下面簡要說一下 Bmob 云數(shù)據(jù)庫的使用步驟：

1. 去 Bmob?官網(wǎng)申請賬號。

2. 創(chuàng)建應(yīng)用，選擇開發(fā)版，應(yīng)用類型選其他：

3. 進(jìn)入應(yīng)用設(shè)置的應(yīng)用秘鑰頁面，可以看到有 AppId 和幾個(gè) key，后面代碼時(shí)需要用到：

4. 創(chuàng)建表結(jié)構(gòu)，需要注意下在 Bmob 后臺創(chuàng)建表結(jié)構(gòu)時(shí)字段要和代碼中的表 bean 對象的屬性要一致；

5. 前面準(zhǔn)備工作做好后，就可以開始編碼了，下面就參考 API 文檔（文檔地址），來封裝代碼，類關(guān)系圖如下：

CouldDBService 是一個(gè)接口，定義了需要實(shí)現(xiàn)的方法
AbstractCouldDBService 是一個(gè)實(shí)現(xiàn)了通用方法的業(yè)務(wù)抽象類
ArticleCouldDBService 是一個(gè)具體的表實(shí)現(xiàn)類，主要功能在 AbstractCouldDBService 類中：

public abstract class AbstractCouldDBService implements CouldDBService {
    private static Logger logger = LoggerFactory.getLogger(AbstractCouldDBService.class);

    private static final String APP_ID = "68ab2056a549e0f640dfc1e801bf6915"; //對應(yīng)應(yīng)用秘鑰的Application ID
    private static final String API_KEY = "f8eee065e9e011ecd7e98e4256373629"; //對應(yīng)應(yīng)用秘鑰的REST API Key

    private static final String API_HOST_BASE = "https://api2.bmob.cn/1/";

    private static final String API_TABLE_URL = API_HOST_BASE + "classes/%s";

    private HttpRequest httpRequest = SpringContext.getBean(HttpRequest.class);

    private static Map headers;

    static {
        headers = new HashMap<>();
        headers.put("X-Bmob-Application-Id", APP_ID);
        headers.put("X-Bmob-REST-API-Key", API_KEY);
        headers.put("Content-Type", "application/json");
    }

    public abstract Class setClass();

    public abstract String setTable();

    @Override
    public List query(String whereAs) {
        String url = String.format(API_TABLE_URL, setTable());
        if (whereAs != null && !whereAs.isEmpty()) {
            try {
                whereAs = URLEncoder.encode(whereAs, "utf-8");
            } catch (UnsupportedEncodingException e) {
                logger.error("encode params["+ whereAs +"] meet error: ", e);
            }
            url += "?where=" + whereAs;
        }
        logger.info("query url: " + url);
        HttpResult response = httpRequest.setUrl(url).setHeaders(headers).setUseProxy(false).doGet();
        if (response == null) {
            throw new RuntimeException("request error: response is null");
        }

        if (200 == response.getCode()) {
            try {
                String res = response.getContent();
                JSONArray dataArray = JSON.parseObject(res).getJSONArray("results");
                if (dataArray != null && !dataArray.isEmpty()) {
                    List list = new ArrayList();
                    for (int i = 0, length = dataArray.size(); i < length; i++) {
                        JSONObject data = dataArray.getJSONObject(i);
                        list.add(data.toJavaObject(setClass()));
                    }
                    logger.info("query success and result size=" + list.size());
                    return list;
                }
            } catch (Exception e) {
                logger.error("parse response meet error: ", e);
            }
        } else {
            logger.error("request error: response=" + response.toString());
        }
        return null;
    }

    @Override
    public boolean insert(T bean) {
        String url = String.format(API_TABLE_URL, setTable());

        if (bean == null) {
            logger.error("insert failed: bean is null");
            return false;
        }

        Map beanMap = bean.toMap();

        HttpResult response = httpRequest.setUrl(url).setUseProxy(false).setHeaders(headers)
                .setContent(JSON.toJSONString(beanMap), ContentType.APPLICATION_JSON).doPost();

        if (response == null) {
            throw new RuntimeException("request error: response is null");
        }

        if (201 == response.getCode()) {
            logger.info("success to insert bean ["+ JSON.toJSONString(beanMap) +"]");
            return true;
        } else {
            logger.error("failed to insert bean ["+ JSON.toJSONString(beanMap) +"], msg:" + response.getContent());
            return false;
        }
    }

    @Override
    public void insertBatch(List list) {
        String url = API_HOST_BASE + "batch";
        if (list == null || list.isEmpty()) {
            return;
        }

        JSONArray reqArray = new JSONArray();
        for (T t : list) {
            JSONObject item = new JSONObject();
            item.put("method", "POST");
            item.put("path", "1/classes/" + setTable());
            item.put("body", JSON.toJSONString(t.toMap()));
            reqArray.add(item);
        }
        JSONObject contentJson = new JSONObject();
        contentJson.put("requests", reqArray);
        HttpResult response = httpRequest.setUrl(url).setUseProxy(false).setHeaders(headers)
                .setContent(contentJson.toJSONString(), ContentType.APPLICATION_JSON).doPost();

        if (response == null) {
            throw new RuntimeException("request error: response is null");
        }
        System.out.println(response.toString());
    }

    @Override
    public void update() {

    }

    @Override
    public void replace() {

    }

    @Override
    public void replaceBatch() {

    }

    @Override
    public boolean delete(String objectId) {
        if (objectId == null || "".equals(objectId)) {
            return false;
        }
        String url = String.format(API_TABLE_URL, setTable()) + "/" + objectId;

        HttpResult response = httpRequest.setUrl(url).setUseProxy(false).setHeaders(headers).doDelete();
        if (response == null) {
            throw new RuntimeException("request error: response is null");
        }

        JSONObject retJson = JSON.parseObject(response.getContent());
        boolean isSuccess = "ok".equalsIgnoreCase(retJson.getString("msg"));
        if (isSuccess) {
            logger.info("success to delete row [objectId="+ objectId +"]");
        } else {
            logger.error("failed to delete row [objectId="+ objectId +"], msg:" + response.getContent());
        }
        return isSuccess;
    }
}

代碼說明：

記得將上面的 APPID 和 key 換成自己在 Bmob 上面創(chuàng)建的
這里封裝了一個(gè)支持泛型的抽象云數(shù)據(jù)服務(wù)類，這樣設(shè)計(jì)的好處就是可以支持多個(gè)表
原 API 文檔中只提供了 RESTful 接口供調(diào)用，這里通過 HTTP 請求封裝了增刪查方法

下面再看一個(gè)具體表的實(shí)現(xiàn)類：

@Service("articleCouldDBService")
public class ArticleCouldDBService extends AbstractCouldDBService {

    @Override
    public Class setClass() {
        return Article.class;
    }

    @Override
    public String setTable() {
        return "t_articles";
    }
}

表的實(shí)現(xiàn)類非常簡單，只需要指定表名和對象類型，這樣設(shè)計(jì)的好處，在有多個(gè)表的時(shí)候就體現(xiàn)出來了，無需額外修改代碼，只需要增加表的實(shí)現(xiàn)類即可。

下面寫一個(gè)測試方法來測試一下 ArticleCouldDBService 類：

@Test
public void testArticleCouldDBSeervice(){
    List articles = new ArrayList<>();
    Article article = new Article();
    article.setArticleId("test12345678");
    article.setTitle("insert test");
    article.setSourceName("測試數(shù)據(jù)");
    articles.add(article);
    articleCouldDBService.insert(article);

    articles.clear();
    articles = articleCouldDBService.query("{\"articleId\":\"test12345678\"}");
    System.out.println(articles);
}

從打印的 log 中可以看到插入和查詢都正常，再去 Bmob 的后臺表中可以看到記錄成功插入。

六、任務(wù)調(diào)度

一個(gè)多層的爬蟲任務(wù)架構(gòu)應(yīng)該至少是下圖這樣的（有興趣的同學(xué)，可以按此架構(gòu)去試著搭一個(gè) task 模塊或者微服務(wù)，那樣架構(gòu)更加完整和清晰），但是考慮到實(shí)施的復(fù)雜程度，所以本 Chat 就不對任務(wù)做分層和分庫處理了，處理簡單一點(diǎn)，就按照之前上面的項(xiàng)目整體架構(gòu)圖來實(shí)施，把任務(wù)調(diào)度簡單化。

先忘掉上面的任務(wù)架構(gòu)，我們還是按照最開始前面說的架構(gòu)圖（不記得了？翻回去再加深一下記憶），我們需要實(shí)現(xiàn)一個(gè)任務(wù)調(diào)度器，可以按業(yè)務(wù)調(diào)起任務(wù)，可以定時(shí)調(diào)度任務(wù)，這里需要用到 Spring 相關(guān)知識，先看代碼：

@Service("asyncProcessTask")
public class AsyncProcessTask {
    private static final Logger LOGGER = LoggerFactory.getLogger(AsyncProcessTask.class);

    private static SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

    private ApplicationContext context = SpringContext.getApplicationContext();

    /**
     * 初始化代理爬取，觸發(fā)代理爬取
     */
    public void initProxy() {
        Map serviceBeanMap = context.getBeansOfType(ProxyFetchFactory.class);
        if (serviceBeanMap != null && !serviceBeanMap.isEmpty()) {
            for (Object serviceBean : serviceBeanMap.values()) {
                ProxyFetchFactory service = (ProxyFetchFactory) serviceBean;
                service.buildProxy();
            }
        }
    }

    /**
     * 按定時(shí)調(diào)度觸發(fā)所有業(yè)務(wù)爬取任務(wù)
     */
    @Scheduled(cron = "0 0 0 1 * ?") //每天凌晨0點(diǎn)執(zhí)行一次
    public void scheduleAllProcessors() {
        LOGGER.info("schedule task start with " + sdf.format(new Date()));
        startAllProcessors();
        LOGGER.info("schedule task end with " + sdf.format(new Date()));
    }

    /**
     * 觸發(fā)所有業(yè)務(wù)爬取任務(wù)
     */
    public void startAllProcessors() {
        ExecutorService es = Executors.newFixedThreadPool(10);
        List processors = getAllListProcessors();
        if (processors == null || processors.isEmpty()) {
            return;
        }
        for (AbstractListProcessor listProcessor : processors) {
            for (AbstractItemProcessor itemProcessor : listProcessor.getItemProcessors()) {
                es.execute(itemProcessor);
            }
        }
        try {
            es.awaitTermination(60, TimeUnit.SECONDS);//最大等待60s
        } catch (InterruptedException e) {
        } finally {
            es.shutdown();
        }
    }

    /**
     * 按業(yè)務(wù)名稱觸發(fā)爬取任務(wù)
     *
     * @param business
     */
    public void startProcessorsByBusiness(String business) {
        List processors = getAllListProcessors();
        if (processors == null || processors.isEmpty()) {
            return;
        }
        ExecutorService es = Executors.newFixedThreadPool(10);
        for (AbstractListProcessor listProcessor : processors) {
            if (business.equalsIgnoreCase(listProcessor.getBusiness())) {
                for (AbstractItemProcessor itemProcessor : listProcessor.getItemProcessors()) {
                    es.execute(itemProcessor);
                }
                break;
            }
        }
        try {
            es.awaitTermination(60, TimeUnit.SECONDS);//最大等待60s
        } catch (InterruptedException e) {
        } finally {
            es.shutdown();
        }
    }

    private List getAllListProcessors() {
        List processors = new ArrayList<>();
        Map serviceBeanMap = context.getBeansOfType(AbstractListProcessor.class);
        if (serviceBeanMap != null && !serviceBeanMap.isEmpty()) {
            for (Object serviceBean : serviceBeanMap.values()) {
                AbstractListProcessor processor = (AbstractListProcessor) serviceBean;
                processors.add(processor);
            }
        }
        LOGGER.info("fetch list processor size=" + processors.size());
        return processors;
    }

    private List getAllItemProcessors() {
        List processors = new ArrayList<>();
        Map serviceBeanMap = context.getBeansWithAnnotation(SpiderProcessor.class);
        if (serviceBeanMap != null && !serviceBeanMap.isEmpty()) {
            for (Object serviceBean : serviceBeanMap.values()) {
                AbstractItemProcessor processor = (AbstractItemProcessor) serviceBean;
                processors.add(processor);
            }
        }
        LOGGER.info("fetch processor size=" + processors.size());
        return processors;
    }
}

主要方法的說明已經(jīng)在代碼中注釋，定義了代理爬取任務(wù)和業(yè)務(wù)爬取任務(wù)等方法，還記得前面有提到過業(yè)務(wù)的內(nèi)容處理器是一個(gè)線程類，所以在上面調(diào)度時(shí)，就是直接在線程池中加入了對應(yīng)的內(nèi)容處理器，從而實(shí)現(xiàn)并發(fā)爬取。

具體的項(xiàng)目全過程調(diào)用我準(zhǔn)備放在 Spring Boot 的 Application 類中來實(shí)現(xiàn)，不過原則上，代理的爬取任務(wù)是要先于業(yè)務(wù)爬取任務(wù)的，不然業(yè)務(wù)爬取時(shí)沒有足夠的代理使用。下面我們來實(shí)現(xiàn) Spring Boot 的 Application 類的內(nèi)容：

@ComponentScan(basePackages = {"com.alany.spider"}) //掃描該包路徑下的所有Spring組件
@SpringBootApplication
public class SpiderApplication {

    public static void main(String[] args) throws InterruptedException {
        SpringApplication.run(SpiderApplication.class, args);
        AsyncProcessTask asyncProcessTask = SpringContext.getBean(AsyncProcessTask.class);
        asyncProcessTask.initProxy();
        Thread.sleep(1000 * 60); //等60s
        asyncProcessTask.startAllProcessors();
    }

}

代碼說明：

由于 Spring Boot 運(yùn)行時(shí)當(dāng)前上下文 Context 是在 Tomcat 的 Web 容器中，所以這里需要再實(shí)現(xiàn)一個(gè) Spring 的上下文 ApplicationContext 工具類來獲取 Spring 容器中的組件，代碼如下：

@Component
public class SpringContext implements ApplicationContextAware {

    private static ApplicationContext context;

    @Override
    public void setApplicationContext(ApplicationContext applicationContext) throws BeansException {
        if (context == null) {
            context = applicationContext;
        }
        //startAllItemProcessors();
    }

    //獲取applicationContext
    public static ApplicationContext getApplicationContext() {
        return context;
    }

    //通過name獲取 Bean.
    public static Object getBean(String name){
        return getApplicationContext().getBean(name);
    }

    //通過class獲取Bean.
    public static  T getBean(Class clazz){
        return getApplicationContext().getBean(clazz);
    }

    //通過name,以及Clazz返回指定的Bean
    public static  T getBean(String name,Class clazz){
        return getApplicationContext().getBean(name, clazz);
    }
}

運(yùn)行 SpiderApplication 的 main 方法可以看到爬蟲歡快的跑起來了，并且成功入庫云數(shù)據(jù)庫。貼一小段爬取成功的 log 來證明框架是真的可以跑起來的：

2019-08-01 16:56:32.786  INFO 9020 --- [pool-1-thread-3] c.a.spider.core.proxy.ProxyFetchFactory  : github - 185.6.138.28 : false
2019-08-01 16:56:32.925  INFO 9020 --- [pool-1-thread-8] c.a.spider.core.proxy.ProxyFetchFactory  : github - 78.186.237.112 : false
2019-08-01 16:56:32.944  INFO 9020 --- [pool-1-thread-2] c.a.spider.core.proxy.ProxyFetchFactory  : github - 78.186.237.112 : false
2019-08-01 16:56:32.947  INFO 9020 --- [           main] c.a.spider.core.proxy.ProxyFetchFactory  : valid host total size: 299
2019-08-01 16:57:02.951  INFO 9020 --- [           main] c.a.spider.core.task.AsyncProcessTask    : fetch list processor size=1
2019-08-01 16:57:02.960  INFO 9020 --- [l-3809-thread-2] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524010}', httpResult=null}
2019-08-01 16:57:02.960  INFO 9020 --- [l-3809-thread-3] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524020}', httpResult=null}
2019-08-01 16:57:02.961  INFO 9020 --- [l-3809-thread-4] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524030}', httpResult=null}
2019-08-01 16:57:02.962  INFO 9020 --- [l-3809-thread-5] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524040}', httpResult=null}
2019-08-01 16:57:02.962  INFO 9020 --- [l-3809-thread-6] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524050}', httpResult=null}
2019-08-01 16:57:02.960  INFO 9020 --- [l-3809-thread-1] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524000}', httpResult=null}
2019-08-01 16:57:02.964  INFO 9020 --- [l-3809-thread-9] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524080}', httpResult=null}
2019-08-01 16:57:02.964  INFO 9020 --- [l-3809-thread-7] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524060}', httpResult=null}
2019-08-01 16:57:02.965  INFO 9020 --- [-3809-thread-10] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524090}', httpResult=null}
2019-08-01 16:57:02.966  INFO 9020 --- [l-3809-thread-3] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 51.158.99.51:8811
2019-08-01 16:57:02.966  INFO 9020 --- [l-3809-thread-5] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 39.137.107.98:8080
2019-08-01 16:57:02.966  INFO 9020 --- [l-3809-thread-9] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 49.48.144.160:8080
2019-08-01 16:57:02.966  INFO 9020 --- [l-3809-thread-8] c.a.s.c.process.AbstractItemProcessor    : executing processor: ExecuteContent{start=0, end=0, business='CSDN', url='https://www.csdn.net/api/articles?type=more&category=news&shown_offset=', params='{"offset":1564649090524070}', httpResult=null}
2019-08-01 16:57:02.967  INFO 9020 --- [l-3809-thread-8] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 178.128.229.122:8080
2019-08-01 16:57:02.968  INFO 9020 --- [l-3809-thread-7] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 92.49.188.250:8080
2019-08-01 16:57:02.969  INFO 9020 --- [-3809-thread-10] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 36.90.158.178:8080
2019-08-01 16:57:02.970  INFO 9020 --- [l-3809-thread-2] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 159.138.22.112:80
2019-08-01 16:57:02.972  INFO 9020 --- [l-3809-thread-6] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 110.74.199.125:35604
2019-08-01 16:57:02.972  INFO 9020 --- [l-3809-thread-1] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 13.115.109.245:8080
2019-08-01 16:57:02.973  INFO 9020 --- [l-3809-thread-4] com.alany.spider.core.http.HttpRequest   : url: https://www.csdn.net/api/articles?type=more&category=news&shown_offset=, proxy: 185.80.130.17:80
2019-08-01 16:57:10.161  INFO 9020 --- [l-3809-thread-1] c.a.s.c.dao.impl.AbstractCouldDBService  : success to insert bean [{"imgUrl":"https://profile.csdnimg.cn/B/0/7/1_l659292998","publishTime":"2019-08-01 16:57:09","author":"恒指交易指導(dǎo)","articleId":"95589302","description":"恒指7.12號開盤前瞻：美股再創(chuàng)新高，恒指能否踏上新臺階。\n美國減息預(yù)期持續(xù)升溫，推動道指和標(biāo)普500再創(chuàng)新高。周四(7月11日)美國三大股指漲跌不一，道指收報(bào)27088點(diǎn)，升227點(diǎn)或0.85%，歷史首次突破27000點(diǎn)大關(guān)，創(chuàng)收市新高;標(biāo)指收報(bào)2999點(diǎn)，升6點(diǎn)或0.23%，同創(chuàng)收市新高;以科技股為主的納指微跌6點(diǎn)或0.08%，收報(bào)8196點(diǎn)。\n近日，美聯(lián)儲主席鮑威爾一展“鴿”喉，7月降息之門...","sourceName":"CSDN","title":"進(jìn)退博弈：恒指7.12號早盤資訊及交易計(jì)劃","category":"恒指,恒指早盤,進(jìn)退博弈","url":"https://blog.csdn.net/L659292998/article/details/95589302","status":0}]
2019-08-01 16:57:10.381  INFO 9020 --- [l-3809-thread-1] c.a.s.c.dao.impl.AbstractCouldDBService  : success to insert bean [{"imgUrl":"https://profile.csdnimg.cn/F/1/1/1_win_le","publishTime":"2019-08-01 16:57:09","author":"吾昂王","articleId":"94658782","description":"linux目錄結(jié)構(gòu):\n類似倒掛的樹.   \n/ : 根目錄,所有數(shù)據(jù)都在此目錄下(Windows中,通常說文件夾,文件. LINUX中通常為 目錄 和 文件 )\nlinux中 一切 皆文件!!!!\n例 : /dev : 存放系統(tǒng)設(shè)備相關(guān)的數(shù)據(jù)\n\nlinux磁盤表示:\nhd,表示IDE設(shè)備\nsd:表示SCSI設(shè)備\n\n命令行提示符:\n[當(dāng)前用戶名@當(dāng)前主機(jī)名 當(dāng)前所在位置] # $ (#為此用戶為超級...","sourceName":"CSDN","title":"Linux基礎(chǔ)中的基礎(chǔ) 一","url":"https://blog.csdn.net/Win_Le/article/details/94658782","status":0}]
2019-08-01 16:57:10.590  INFO 9020 --- [l-3809-thread-1] c.a.s.c.dao.impl.AbstractCouldDBService  : success to insert bean [{"imgUrl":"https://profile.csdnimg.cn/0/E/1/1_qq_42391904","publishTime":"2019-08-01 16:57:09","author":"Maker星蔚","articleId":"95529565","description":"Android四大組件分別為：1.活動Activity，2.服務(wù)Service，3.廣播Broadcast Receiver，4.內(nèi)容提供器Content Provider\n1.Service概念與要點(diǎn)\n1.定義：服務(wù)(Service)是Android中使程序在后臺運(yùn)行的方法。\n(1) 服務(wù)適合去執(zhí)行不需要用戶交互卻要長期執(zhí)行的任務(wù)，例如一個(gè)音樂app可在當(dāng)程序在后臺時(shí)仍然播放音樂，用戶使用其他程...","sourceName":"CSDN","title":"Android基礎(chǔ)-四大組件之Service（基礎(chǔ)）","category":"安卓開發(fā),Android四大組件,Android入門","url":"https://blog.csdn.net/qq_42391904/article/details/95529565","status":0}]

此外，之前在架構(gòu)圖中還提到過一種通過 RESTful 接口的方式觸發(fā)任務(wù)，那么把對應(yīng)的控制器也實(shí)現(xiàn)下，代碼如下：

@RestController
@RequestMapping("/exec")
public class ProcessExecController {

    private AsyncProcessTask asyncProcessTask = SpringContext.getBean(AsyncProcessTask.class);

    @RequestMapping(value = "/all")
    public void execAll(){
        asyncProcessTask.startAllProcessors();
    }

    @RequestMapping(value = "/biz")
    public void execByBiz(String business){
        asyncProcessTask.startProcessorsByBusiness(business);
    }

}

代碼非常簡單，就是對 AsyncProcessTask 類的簡單調(diào)用，項(xiàng)目啟動后在瀏覽器中輸入 URL——http://localhost:8080/exec/all，也可以觸發(fā)所有業(yè)務(wù)的爬取任務(wù)。

至此，整個(gè)爬蟲項(xiàng)目的框架就搭建完畢，內(nèi)容和代碼較多，涉及到 Spring、SpringBoot、Java 并發(fā)編程、HTTP、JSON 解析、jsoup 頁面解析、設(shè)計(jì)模式等多方面的知識，所以存在不熟悉的知識點(diǎn)需要額外補(bǔ)一下短板哈，此外在閱讀文章的同時(shí)邊上機(jī)實(shí)踐效果會更好，最后祝愿大家都能有所收獲。

測試開發(fā)棧

軟件測試開發(fā)合并必將是趨勢，不懂開發(fā)的測試、不懂測試的開發(fā)都將可能被逐漸替代，因此前瞻的技術(shù)儲備和知識積累是我們以后在職場和行業(yè)脫穎而出的法寶，期望我們的經(jīng)驗(yàn)和技術(shù)分享能讓你每天都成長和進(jìn)步，早日成為測試開發(fā)棧上的技術(shù)大牛~~

長按二維碼/微信掃描關(guān)注

歡迎加入QQ群交流和提問：427020613

互聯(lián)網(wǎng)測試開發(fā)一站式全棧分享平臺

從 0 到 1 帶你搭建 Java 并發(fā)爬蟲框架（全文）

文章目錄

前言

爬蟲的原理與流程

爬蟲框架搭建

一、新建 Spring Boot 項(xiàng)目

二、免費(fèi)代理服務(wù)

三、HTTP 請求封裝

四、業(yè)務(wù)爬取邏輯

五、云數(shù)據(jù)庫

六、任務(wù)調(diào)度