大廠技術(shù)??高級前端??Node進階

點擊上方?程序員成長指北，關(guān)注公眾號

回復(fù)1，加入高級Node交流群

前言

今天，我們將使用TS這門語言搭建一款爬蟲工具。目標(biāo)網(wǎng)址是什么呢？我們?nèi)ド暇W(wǎng)一搜，經(jīng)過幾番排查之后，我們選定了這一個網(wǎng)站。

https://www.hanju.run/

一個視頻網(wǎng)站，我們的目的主要是爬取這個網(wǎng)站上視頻的播放鏈接。下面，我們就開始進行第一步。

第一步

俗話說，萬事開頭難。不過對于這個項目而言，恰恰相反。你需要做以下幾個事情：

我們需要創(chuàng)建一個項目文件夾
鍵入命令，初始化項目
```
npm?init?-y
```
局部安裝typescript
```
npm?install?typescript?-D
```
接著鍵入命令，生成ts配置文件
```
tsc?--init
```
局部安裝ts-node，用于命令行輸出命令

npm?install?-D?ts-node

在項目文件夾中創(chuàng)建一個src文件夾

然后我們在src文件夾中創(chuàng)建一個crawler.ts文件。

在package.json文件中修改快捷啟動命令

"scripts":?{
????"dev-t":?"ts-node?./src/crawler.ts"
??}

第二步

接下來，我們將進行實戰(zhàn)操作，也就是上文中crawler.ts文件是我們的主戰(zhàn)場。

我們首先需要引用的這幾個依賴，分別是

import?superagent?from?"superagent";
import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?path?from?"path";

所以，我們會這樣安裝依賴：

superagent作用是獲取遠(yuǎn)程網(wǎng)址html的內(nèi)容。

npm?install?superagent

cheerio作用是可以通過jQ語法獲取頁面節(jié)點的內(nèi)容。

npm?install?cheerio

剩余兩個依賴fs，path。它們是node內(nèi)置依賴，直接引入即可。

我們完成了安裝依賴，但是會發(fā)現(xiàn)你安裝的依賴上會有紅色報錯。原因是這樣的，superagent和cheerio內(nèi)部都是用JS寫的，并不是TS寫的，而我們現(xiàn)在的環(huán)境是TS。所以我們需要翻譯一下，我們將這種翻譯文件又稱類型定義文件（以.d.ts為后綴）。我們可以使用以下命令安裝類型定義文件。

npm?install?-D?@types/superagent

npm?install?-D?@types/cheerio

接下來，我們就認(rèn)認(rèn)真真看源碼了。

安裝完兩個依賴后，我們需要創(chuàng)建一個Crawler類，并且將其實例化。

import?superagent?from?"superagent";
import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?path?from?"path";

class?Crawler?{
??constructor()?{
????
??}
}

const?crawler?=?new?Crawler();

我們確定下要爬取的網(wǎng)址，然后賦給一個私有變量。最后我們會封裝一個getRawHtml方法來獲取對應(yīng)網(wǎng)址的內(nèi)容。

getRawHtml方法中我們使用了async/await關(guān)鍵字，主要用于異步獲取頁面內(nèi)容，然后返回值。

import?superagent?from?"superagent";
import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?path?from?"path";

class?Crawler?{
??private?url?=?"https://www.hanju.run/play/39221-4-0.html";

??async?getRawHtml()?{
????const?result?=?await?superagent.get(this.url);
????return?result.text;
??}

??async?initSpiderProcess()?{
????const?html?=?await?this.getRawHtml();
??}

??constructor()?{
????this.initSpiderProcess();
??}
}

const?crawler?=?new?Crawler();

使用cheerio依賴內(nèi)置的方法獲取對應(yīng)的節(jié)點內(nèi)容。

我們通過getRawHtml方法異步獲取網(wǎng)頁的內(nèi)容，然后我們傳給getJsonInfo這個方法，注意是string類型。我們這里通過cheerio.load(html)這條語句處理，就可以通過jQ語法來獲取對應(yīng)的節(jié)點內(nèi)容。我們獲取到了網(wǎng)頁中視頻的標(biāo)題以及鏈接，通過鍵值對的方式添加到一個對象中。注：我們在這里定義了一個接口，定義鍵值對的類型。

import?superagent?from?"superagent";
import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?path?from?"path";

interface?Info?{
??name:?string;
??url:?string;
}

class?Crawler?{
??private?url?=?"https://www.hanju.run/play/39221-4-0.html";

??getJsonInfo(html:?string)?{
????const?$?=?cheerio.load(html);
????const?info:?Info[]?=?[];
????const?scpt:?string?=?String($(".play>script:nth-child(1)").html());
????const?url?=?unescape(
??????scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g,?"")
????);
????const?name:?string?=?String($("title").html());
????info.push({
??????name,
??????url,
????});
????const?result?=?{
??????time:?new?Date().getTime(),
??????data:?info,
????};
????return?result;
??}

??async?getRawHtml()?{
????const?result?=?await?superagent.get(this.url);
????return?result.text;
??}

??async?initSpiderProcess()?{
????const?html?=?await?this.getRawHtml();
????const?info?=?this.getJsonInfo(html);
??}

??constructor()?{
????this.initSpiderProcess();
??}
}

const?crawler?=?new?Crawler();

我們首先要在項目根目錄下創(chuàng)建一個data文件夾。然后我們將獲取的內(nèi)容我們存入文件夾內(nèi)的url.json文件（文件自動生成）中。
我們將其封裝成getJsonContent方法，在這里我們使用了path.resolve來獲取文件的路徑。fs.readFileSync來讀取文件內(nèi)容，fs.writeFileSync來將內(nèi)容寫入文件。注：我們分別定義了兩個接口objJson與InfoResult。

import?superagent?from?"superagent";
import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?path?from?"path";

interface?objJson?{
??[propName:?number]:?Info[];
}

interface?Info?{
??name:?string;
??url:?string;
}

interface?InfoResult?{
??time:?number;
??data:?Info[];
}

class?Crawler?{
??private?url?=?"https://www.hanju.run/play/39221-4-0.html";

??getJsonInfo(html:?string)?{
????const?$?=?cheerio.load(html);
????const?info:?Info[]?=?[];
????const?scpt:?string?=?String($(".play>script:nth-child(1)").html());
????const?url?=?unescape(
??????scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g,?"")
????);
????const?name:?string?=?String($("title").html());
????info.push({
??????name,
??????url,
????});
????const?result?=?{
??????time:?new?Date().getTime(),
??????data:?info,
????};
????return?result;
??}

??async?getRawHtml()?{
????const?result?=?await?superagent.get(this.url);
????return?result.text;
??}

??getJsonContent(info:?InfoResult)?{
????const?filePath?=?path.resolve(__dirname,?"../data/url.json");
????let?fileContent:?objJson?=?{};
????if?(fs.existsSync(filePath))?{
??????fileContent?=?JSON.parse(fs.readFileSync(filePath,?"utf-8"));
????}
????fileContent[info.time]?=?info.data;
????fs.writeFileSync(filePath,?JSON.stringify(fileContent));
??}

??async?initSpiderProcess()?{
????const?html?=?await?this.getRawHtml();
????const?info?=?this.getJsonInfo(html);
????this.getJsonContent(info);
??}

??constructor()?{
????this.initSpiderProcess();
??}
}

const?crawler?=?new?Crawler();

運行命令
```
npm?run?dev-t
```

查看生成文件的效果

{
??"1610738046569":?[
????{
??????"name":?"《復(fù)仇者聯(lián)盟4：終局之戰(zhàn)》HD1080P中字m3u8在線觀看-韓劇網(wǎng)",
??????"url":?"https://wuxian.xueyou-kuyun.com/20190728/16820_302c7858/index.m3u8"
????}
??],
??"1610738872042":?[
????{
??????"name":?"《鋼鐵俠2》HD高清m3u8在線觀看-韓劇網(wǎng)",
??????"url":?"https://www.yxlmbbs.com:65/20190920/54uIR9hI/index.m3u8"
????}
??],
??"1610739069969":?[
????{
??????"name":?"《鋼鐵俠2》中英特效m3u8在線觀看-韓劇網(wǎng)",
??????"url":?"https://tv.youkutv.cc/2019/11/12/mjkHyHycfh0LyS4r/playlist.m3u8"
????}
??]
}

準(zhǔn)結(jié)語

到這里真的結(jié)束了嗎？

不！

真的沒有結(jié)束。

我們會看到上面一坨代碼，真的很臭~

我們將分別使用組合模式與單例模式將其優(yōu)化。

優(yōu)化一：組合模式

組合模式（Composite Pattern），又叫部分整體模式，是用于把一組相似的對象當(dāng)作一個單一的對象。組合模式依據(jù)樹形結(jié)構(gòu)來組合對象，用來表示部分以及整體層次。這種類型的設(shè)計模式屬于結(jié)構(gòu)型模式，它創(chuàng)建了對象組的樹形結(jié)構(gòu)。

這種模式創(chuàng)建了一個包含自己對象組的類。該類提供了修改相同對象組的方式。

簡言之，就是可以像處理簡單元素一樣來處理復(fù)雜元素。

首先，我們在src文件夾下創(chuàng)建一個combination文件夾，然后在其文件夾下分別在創(chuàng)建兩個文件crawler.ts和urlAnalyzer.ts。

crawler.ts

crawler.ts文件的作用主要是處理獲取頁面內(nèi)容以及存入文件內(nèi)。

import?superagent?from?"superagent";
import?fs?from?"fs";
import?path?from?"path";
import?UrlAnalyzer?from?"./urlAnalyzer.ts";

export?interface?Analyzer?{
??analyze:?(html:?string,?filePath:?string)?=>?string;
}

class?Crowller?{
??private?filePath?=?path.resolve(__dirname,?"../../data/url.json");

??async?getRawHtml()?{
????const?result?=?await?superagent.get(this.url);
????return?result.text;
??}

??writeFile(content:?string)?{
????fs.writeFileSync(this.filePath,?content);
??}

??async?initSpiderProcess()?{
????const?html?=?await?this.getRawHtml();
????const?fileContent?=?this.analyzer.analyze(html,?this.filePath);
????this.writeFile(fileContent);
??}

??constructor(private?analyzer:?Analyzer,?private?url:?string)?{
????this.initSpiderProcess();
??}
}
const?url?=?"https://www.hanju.run/play/39257-1-1.html";

const?analyzer?=?new?UrlAnalyzer();
new?Crowller(analyzer,?url);

urlAnalyzer.ts

urlAnalyzer.ts文件的作用主要是處理獲取頁面節(jié)點內(nèi)容的具體邏輯。

import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?{?Analyzer?}?from?"./crawler.ts";

interface?objJson?{
??[propName:?number]:?Info[];
}
interface?InfoResult?{
??time:?number;
??data:?Info[];
}
interface?Info?{
??name:?string;
??url:?string;
}

export?default?class?UrlAnalyzer?implements?Analyzer?{
??private?getJsonInfo(html:?string)?{
????const?$?=?cheerio.load(html);
????const?info:?Info[]?=?[];
????const?scpt:?string?=?String($(".play>script:nth-child(1)").html());
????const?url?=?unescape(
??????scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g,?"")
????);
????const?name:?string?=?String($("title").html());
????info.push({
??????name,
??????url,
????});
????const?result?=?{
??????time:?new?Date().getTime(),
??????data:?info,
????};
????return?result;
??}

??private?getJsonContent(info:?InfoResult,?filePath:?string)?{
????let?fileContent:?objJson?=?{};
????if?(fs.existsSync(filePath))?{
??????fileContent?=?JSON.parse(fs.readFileSync(filePath,?"utf-8"));
????}
????fileContent[info.time]?=?info.data;
????return?fileContent;
??}

??public?analyze(html:?string,?filePath:?string)?{
????const?info?=?this.getJsonInfo(html);
????console.log(info);
????const?fileContent?=?this.getJsonContent(info,?filePath);
????return?JSON.stringify(fileContent);
??}
}

可以在package.json文件中定義快捷啟動命令。

??"scripts":?{
????"dev-c":?"ts-node?./src/combination/crawler.ts"
??},

然后使用npm run dev-c啟動即可。

優(yōu)化二：單例模式

**單例模式（Singleton Pattern）**是 Java 中最簡單的設(shè)計模式之一。這種類型的設(shè)計模式屬于創(chuàng)建型模式，它提供了一種創(chuàng)建對象的最佳方式。

這種模式涉及到一個單一的類，該類負(fù)責(zé)創(chuàng)建自己的對象，同時確保只有單個對象被創(chuàng)建。這個類提供了一種訪問其唯一的對象的方式，可以直接訪問，不需要實例化該類的對象。

應(yīng)用實例：

1、一個班級只有一個班主任。
2、Windows 是多進程多線程的，在操作一個文件的時候，就不可避免地出現(xiàn)多個進程或線程同時操作一個文件的現(xiàn)象，所以所有文件的處理必須通過唯一的實例來進行。
3、一些設(shè)備管理器常常設(shè)計為單例模式，比如一個電腦有兩臺打印機，在輸出的時候就要處理不能兩臺打印機打印同一個文件。

同樣，我們在src文件夾下創(chuàng)建一個singleton文件夾，然后在其文件夾下分別在創(chuàng)建兩個文件crawler1.ts和urlAnalyzer.ts。

這兩個文件的作用與上文同樣，只不過代碼書寫不一樣。

crawler1.ts

import?superagent?from?"superagent";
import?fs?from?"fs";
import?path?from?"path";
import?UrlAnalyzer?from?"./urlAnalyzer.ts";

export?interface?Analyzer?{
??analyze:?(html:?string,?filePath:?string)?=>?string;
}

class?Crowller?{
??private?filePath?=?path.resolve(__dirname,?"../../data/url.json");

??async?getRawHtml()?{
????const?result?=?await?superagent.get(this.url);
????return?result.text;
??}

??private?writeFile(content:?string)?{
????fs.writeFileSync(this.filePath,?content);
??}

??private?async?initSpiderProcess()?{
????const?html?=?await?this.getRawHtml();
????const?fileContent?=?this.analyzer.analyze(html,?this.filePath);
????this.writeFile(JSON.stringify(fileContent));
??}

??constructor(private?analyzer:?Analyzer,?private?url:?string)?{
????this.initSpiderProcess();
??}
}
const?url?=?"https://www.hanju.run/play/39257-1-1.html";

const?analyzer?=?UrlAnalyzer.getInstance();
new?Crowller(analyzer,?url);

urlAnalyzer.ts

import?cheerio?from?"cheerio";
import?fs?from?"fs";
import?{?Analyzer?}?from?"./crawler1.ts";

interface?objJson?{
??[propName:?number]:?Info[];
}
interface?InfoResult?{
??time:?number;
??data:?Info[];
}
interface?Info?{
??name:?string;
??url:?string;
}
export?default?class?UrlAnalyzer?implements?Analyzer?{
??static?instance:?UrlAnalyzer;

??static?getInstance()?{
????if?(!UrlAnalyzer.instance)?{
??????UrlAnalyzer.instance?=?new?UrlAnalyzer();
????}
????return?UrlAnalyzer.instance;
??}

??private?getJsonInfo(html:?string)?{
????const?$?=?cheerio.load(html);
????const?info:?Info[]?=?[];
????const?scpt:?string?=?String($(".play>script:nth-child(1)").html());
????const?url?=?unescape(
??????scpt.split(";")[3].split("(")[1].split(")")[0].replace(/\"/g,?"")
????);
????const?name:?string?=?String($("title").html());
????info.push({
??????name,
??????url,
????});
????const?result?=?{
??????time:?new?Date().getTime(),
??????data:?info,
????};
????return?result;
??}

??private?getJsonContent(info:?InfoResult,?filePath:?string)?{
????let?fileContent:?objJson?=?{};
????if?(fs.existsSync(filePath))?{
??????fileContent?=?JSON.parse(fs.readFileSync(filePath,?"utf-8"));
????}
????fileContent[info.time]?=?info.data;
????return?fileContent;
??}

??public?analyze(html:?string,?filePath:?string)?{
?????const?info?=?this.getJsonInfo(html);
?????console.log(info);
????const?fileContent?=?this.getJsonContent(info,?filePath);
????return?JSON.stringify(fileContent);
??}

??private?constructor()?{}
}

可以在package.json文件中定義快捷啟動命令。

?"scripts":?{
?????"dev-s":?"ts-node?./src/singleton/crawler1.ts",
??},

然后使用npm run dev-s啟動即可。

結(jié)語

這下真的結(jié)束了，謝謝閱讀。希望可以幫到你。

完整源碼地址：

https://github.com/maomincoding/TsCrawler

Node 社群

我組建了一個氛圍特別好的 Node.js 社群，里面有很多 Node.js小伙伴，如果你對Node.js學(xué)習(xí)感興趣的話（后續(xù)有計劃也可以），我們可以一起進行Node.js相關(guān)的交流、學(xué)習(xí)、共建。下方加考拉好友回復(fù)「Node」即可。

???“分享、點贊、在看” 支持一波??

基于TypeScript/Node從0到1搭建一款爬蟲工具

大廠技術(shù)??高級前端??Node進階

前言

第一步

第二步

準(zhǔn)結(jié)語

優(yōu)化一：組合模式

優(yōu)化二：單例模式

結(jié)語