數(shù)據(jù)自動抓取到底有多簡單?
說到數(shù)據(jù)自動抓取,很多人可能會聯(lián)想到網(wǎng)絡爬蟲,不自覺地就會認為這超出了自己的能力圈邊界。但是,當我們對互聯(lián)網(wǎng)上發(fā)現(xiàn)的某些數(shù)據(jù)感興趣的時候,我們知道如何自動保存這些網(wǎng)頁數(shù)據(jù)以供日后數(shù)據(jù)分析,節(jié)省下來手工復制下載的時間,喝杯咖啡、刷會兒抖音,是不是很香?web 開發(fā)有兩個基本概念:前端和后端。簡單來說,瀏覽器端、客戶端就是前端,提供真實服務的服務器端就是后端。網(wǎng)頁開發(fā)模式主要分為兩類:前后端分離模式和非前后端分離模式。前后端分離模式,是指前端通過 AJAX 向后端 API 服務發(fā)送數(shù)據(jù)請求,后端響應請求并返回數(shù)據(jù)給前端,前端使用后端返回的數(shù)據(jù)(通常是 JSON 數(shù)據(jù))渲染呈現(xiàn)網(wǎng)頁。非前后端分離模式是由服務器端渲染生成 HTML 網(wǎng)頁,直接發(fā)送給瀏覽器端渲染。前后端分離模式是當前網(wǎng)頁開發(fā)的主流模式,互聯(lián)網(wǎng)大廠幾乎都使用這種模式,但是一些政府事業(yè)單位的網(wǎng)站以外包為主,更新迭代緩慢,非前后端分離模式較多。關于這兩種開發(fā)模式,只需當作科普理解即可。CSV vs JSONJSON 數(shù)據(jù)是 web 開發(fā)中經(jīng)常用到的數(shù)據(jù)格式,很多人經(jīng)常使用 CSV 格式,但是對 JSON 數(shù)據(jù)格式不太熟悉。現(xiàn)在有很多在線工具可以輕松實現(xiàn) CSV 格式與 JSON 格式之間的相互轉換。便于理解 JSON 數(shù)據(jù),這里簡單對比一下 CSV 和 JSON。
CSV 格式:
JSON格式:ID,Name,Season,Points1,LeBron James,2019-20,25.21,LeBron James,2018-19,27.41,LeBron James,2017-18,27.
[{ "id": 1, "name": "LeBron James", "season": "2019-20", "points": 25.2 },{ "id": 1, "name": "LeBron James", "season": "2018-19", "points": 27.4 },{ "id": 1, "name": "LeBron James", "season": "2017-18", "points": 27.5 }]
可以看出,相比CSV,JSON有更多的符號以及重復的表頭字段名。雖然這個例子是一種比較常見的 JSON 格式,但有時候 JSON 也會寫成 和 CSV 相似的樣式,比如 stats.nba.com 這個網(wǎng)站里使用的樣式:
直接使用 API 爬取數(shù)據(jù)現(xiàn)在流行的前后端分離開發(fā)模式下,web 網(wǎng)頁應用一般通過 AJAX 向后端 API 服務接口發(fā)送數(shù)據(jù)請求,前端使用 API 返回的數(shù)據(jù)(通常是 JSON 格式)渲染呈現(xiàn)網(wǎng)頁。因此,我們只要找到返回請求數(shù)據(jù)的 API 接口,并模擬發(fā)送請求給這個 API,我們就能拿到想要的數(shù)據(jù)了。這里我們用 stats.nba.com 這個網(wǎng)站為例,使用 Node.js 來演示如何使用 API 自動抓取 NBA 籃球運動員的年度職業(yè)數(shù)據(jù)。1. 檢查數(shù)據(jù)是否動態(tài)加載首先,我們使用 Chrome 瀏覽器去 stats.nba.com 網(wǎng)站上找到詹姆斯的數(shù)據(jù)主頁https://www.nba.com/stats/player/2544/{"headers": ["id", "name", "season", "points"],"rows": [[1, "LeBron James", "2019-20", 25.2],[1, "LeBron James", "2018-19", 27.4],[1, "LeBron James", "2017-18", 27.5]]}



2.?尋找返回數(shù)據(jù)的?API 接口數(shù)據(jù)不在 HTML頁面里,到底在哪里呢?我們可以使用瀏覽器的開發(fā)人員工具尋找線索。在網(wǎng)頁上鼠標單擊右鍵,選擇?檢查?菜單調(diào)出開發(fā)人員工具。




3. 寫一個 Node.js 腳本自動爬取多頁面數(shù)據(jù)自動爬取數(shù)據(jù)你可以使用任何語言,這里只是以 node.js 為例,給大家演示。安裝 request 庫
復制下面代碼到 index.js 文件npm init -ynpm install --save request request-promise-native
命令窗口運行腳本const rp = require("request-promise-native");const fs = require("fs");async function main() {console.log("Making API Request...");// request the data from the JSON APIconst results = await rp({????uri:?"https://stats.nba.com/stats/playerdashboardbyyearoveryear?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=2544&PlusMinus=N&Rank=N&Season=2020-21&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&Split=yoy&VsConference=&VsDivision=" ,headers: {"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","x-nba-stats-origin": "stats","Referer": "https://stats.nba.com/player/2544/"},json: true});console.log("Got results =", results);// save the JSON to diskawait fs.promises.writeFile("output.json", JSON.stringify(results, null, 2));console.log("Done!")}// start the main scriptmain();
node index.js這樣,我們就獲取了詹姆斯的職業(yè)數(shù)據(jù),但是這還不夠,如果我們能一起獲得其他籃球運動員的職業(yè)數(shù)據(jù),才是真正的提高效率??匆幌挛覀冋业降臄?shù)據(jù)請求 API 路徑,這里有個 PlayerID 的參數(shù)https://stats.nba.com/stats/playerdashboardbyyearoveryear?對比其他球員職業(yè)數(shù)據(jù)的 API 接口路徑,發(fā)現(xiàn)只要修改這個 playerID 就可以變成對應球員的 API 接口。根據(jù)球員主頁的路徑,能夠獲取球員對應的 playerID。
DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&
Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&
PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&
PlayerID=2544&
PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&
SeasonType=Regular+Season&ShotClockRange=&Split=yoy&
VsConference=&VsDivision=
| Player ID | Player | URL |
| 2544 | LeBron James | https://stats.nba.com/player/2544/ |
| 1629029 | Luka Doncic | https://stats.nba.com/player/1629029/ |
| 201935 | James Harden | https://stats.nba.com/player/201935/ |
| 202695 | Kawhi Leonard | https://stats.nba.com/player/202695/ |
我們提取了一個請求數(shù)據(jù)的公共函數(shù)?fetchPlayerYearOverYear,遍歷 ID 數(shù)組,并調(diào)用這個函數(shù)就能一次獲取需要的全部數(shù)據(jù)。另外,出于對服務器的保護以及道德操守,我們一般要設置延遲來避免造成服務器堵塞,影響其他用戶的訪問。通過上述步驟,我們就成功的使用 API 實現(xiàn)了數(shù)據(jù)自動抓取。那如果是非前后端分離模式的 web 網(wǎng)頁應用,沒有 API 接口,我們怎么實現(xiàn)自動抓取呢?自動抓取服務端渲染的 HTML 網(wǎng)頁數(shù)據(jù)非前后端分離模式的 web 應用是將數(shù)據(jù)直接呈現(xiàn)在 HTML 頁面內(nèi),這種情況我們只要下載、解析 HTML 就能提取到我們的目標數(shù)據(jù)。下面還是用案例來演示,目標是從 espn.com 網(wǎng)站上爬取 NBA 比賽數(shù)據(jù)。1. 驗證數(shù)據(jù)是否在?HTML?頁面內(nèi)和上一個案例相似,我們目標是爬取籃球比賽得分數(shù)據(jù),地址:?https://www.espn.com/nba/boxscore?gameId=401160888。const rp = require("request-promise-native");const fs = require("fs");// helper to delay execution by 300ms to 1100msasync function delay() {const durationMs = Math.random() * 800 + 300;return new Promise(resolve => {setTimeout(() => resolve(), durationMs);});}async function fetchPlayerYearOverYear(playerId) {console.log(`Making API Request for ${playerId}...`);// add the playerId to the URI and the Referer header// NOTE: we could also have used the `qs` option for the// query parameters.const results = await rp({uri: "https://stats.nba.com/stats/playerdashboardbyyearoveryear?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&" +`PlayerID=${playerId}` +"&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&Split=yoy&VsConference=&VsDivision=",headers: {"Connection": "keep-alive","User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36","x-nba-stats-origin": "stats","Referer": `https://stats.nba.com/player/${playerId}/`},json: true});// save to disk with playerID as the file nameawait fs.promises.writeFile(`${playerId}.json`,JSON.stringify(results, null, 2));}async function main() {// PlayerIDs for LeBron, Harden, Kawhi, Lukaconst playerIds = [2544, 201935, 202695, 1629029];console.log("Starting script for players", playerIds);// make an API request for each playerfor (const playerId of playerIds) {await fetchPlayerYearOverYear(playerId);// be polite to our friendly data hosts and// don't crash their serversawait delay();}console.log("Done!");}main();


2. 找到可以定位目標數(shù)據(jù)的 CSS 元素選擇器我們需要用某些標識來識別出HTML里目標數(shù)據(jù),而 CSS 選擇器是一個非常好的標識。鼠標右鍵單擊頁面-檢查,調(diào)出開發(fā)人員工具。


document.querySelectorAll('tr')輸出:

document.querySelectorAll('.gamepackage-away-wrap tbody tr')
ok,大功告成了!接下來,我們要做的是使用Node.js 的 request 庫下載 HTML頁面,使用cheerio 庫解析HTML頁面并提取目標數(shù)據(jù)。document.querySelectorAll('.gamepackage-away-wrap tbody tr:not(.highlight)')> NodeList(13) [tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr]
3.?下載 HMTL 頁面和上個案例一樣,我們先按照 request 庫,命令行窗口輸入:
然后新建一個 index.js 文件,輸入以下代碼npm init -ynpm install --save request request-promise-native
運行這個程序const rp = require('request-promise-native');const fs = require('fs');async function downloadBoxScoreHtml() {// where to download the HTML fromconst uri = 'https://www.espn.com/nba/boxscore?gameId=401160888';// the output filenameconst filename = 'boxscore.html';// check if we already have the fileconst fileExists = fs.existsSync(filename);if (fileExists) {console.log(`Skipping download for ${uri} since ${filename} already exists.`);return;}// download the HTML from the web serverconsole.log(`Downloading HTML from ${uri}...`);const results = await rp({ uri: uri });// save the HTML to diskawait fs.promises.writeFile(filename, results);}async function main() {console.log('Starting...');await downloadBoxScoreHtml();console.log('Done!');}main();
node index.js哇哦,你成功下載到 HTML文件了!4. 使用 Cheerio 解析 HTMLCheerio 是從 HTML 頁面里提取數(shù)據(jù)的最佳工具庫,跟 jQuery 非常類似。安裝 Cheerio
npm install --save cheerio如何使用Cheerio?第一步,使用?Cheerio.load 函數(shù)讀取 html文件內(nèi)容
第二步,使用 css 選擇器提取數(shù)據(jù),跟?jQuery 語法一樣// the input filenameconst htmlFilename = 'boxscore.html';// read the HTML from diskconst html = await fs.promises.readFile(htmlFilename);// parse the HTML with Cheerioconst $ = cheerio.load(html);
const $trs = $('.gamepackage-away-wrap tbody tr:not(.highlight)')這樣我們就得到了表格數(shù)據(jù)的所有 html 元素片段,如下:進一步使用選擇器遍歷我們提取到的 html 片段,就能得到目標數(shù)據(jù)。<tr><td class="name"><a name="&lpos=nba:game:boxscore:playercard" href="https://www.espn.com/nba/player/_/id/6440/tobias-harris" data-player-uid="s:40~l:46~a:6440"><span>T. Harris</span><span class="abbr">T. Harris</span></a><span class="position">SF</span></td><td class="min">38</td><td class="fg">7-17</td><td class="3pt">3-7</td><td class="ft">1-1</td><td class="oreb">0</td><td class="dreb">5</td><td class="reb">5</td><td class="ast">2</td><td class="stl">1</td><td class="blk">0</td><td class="to">2</td><td class="pf">3</td><td class="plusminus">-10</td><td class="pts">18</td></tr>...
數(shù)據(jù)結果:const values = $trs.toArray().map(tr => {// find all children <td>const tds = $(tr).find('td').toArray();// create a player object based on the <td> valuesconst player = {};for (td of tds) {// parse the <td>const $td = $(td);// map the td class attr to its valueconst key = $td.attr('class');const value = $td.text();player[key] = value;}return player;});
現(xiàn)在你學會了使用 cheerio 庫。接下來我們新建 index.js 文件,輸入以下代碼并運行,即可獲取到我們的目標數(shù)據(jù)。[{"name": "T. HarrisT. HarrisSF","min": "38","fg": "7-17","3pt": "3-7","ft": "1-1","oreb": "0","dreb": "5","reb": "5","ast": "2","stl": "1","blk": "0","to": "2","pf": "3","plusminus": "-10","pts": "18"}...]
小結總結一下,自動抓取數(shù)據(jù)根據(jù) web 應用開發(fā)模式的不同,分為兩種方法。前后端分離模式下,我們可以直接使用 API 自動抓取數(shù)據(jù);而非前后端分離模式下,我們需要先下載 HTML 網(wǎng)頁,再解析 HTML 并提取數(shù)據(jù)。你學會了嗎?參考網(wǎng)站:https://beshaimakes.com/js-scrape-data最后,如果覺得不錯,不要忘記你的三連,留下你的三連是對我的最大支持!const rp = require('request-promise-native');const fs = require('fs');const cheerio = require('cheerio');async function downloadBoxScoreHtml() {// where to download the HTML fromconst uri = 'https://www.espn.com/nba/boxscore?gameId=401160888';// the output filenameconst filename = 'boxscore.html';// check if we already have the fileconst fileExists = fs.existsSync(filename);if (fileExists) {console.log(`Skipping download for ${uri} since ${filename} already exists.`);return;}// download the HTML from the web serverconsole.log(`Downloading HTML from ${uri}...`);const results = await rp({ uri: uri });// save the HTML to diskawait fs.promises.writeFile(filename, results);}async function parseBoxScore() {console.log('Parsing box score HTML...');// the input filenameconst htmlFilename = 'boxscore.html';// read the HTML from diskconst html = await fs.promises.readFile(htmlFilename);// parse the HTML with Cheerioconst $ = cheerio.load(html);// Get our rowsconst $trs = $('.gamepackage-away-wrap tbody tr:not(.highlight)');const values = $trs.toArray().map(tr => {// find all children <td>const tds = $(tr).find('td').toArray();// create a player object based on the <td> valuesconst player = {};for (td of tds) {const $td = $(td);// map the td class attr to its valueconst key = $td.attr('class');let value;if (key === 'name') {value = $td.find('a span:first-child').text();} else {value = $td.text();}player[key] = isNaN(+value) ? value : +value;}return player;});return values;}async function main() {console.log('Starting...');await downloadBoxScoreHtml();const boxScore = await parseBoxScore();// save the scraped results to diskawait fs.promises.writeFile('boxscore.json',JSON.stringify(boxScore, null, 2));console.log('Done!');}main();

如果你喜歡,請點點關注啦~
以上
