亚洲AV无码一区二区三区桃色,成人伊人影视在线,欧美色图清纯唯美亚洲色图 ,小h片,久久,大香蕉,操操操电影,艹逼网123,另类综合激情

說到數(shù)據(jù)自動抓取，很多人可能會聯(lián)想到網(wǎng)絡爬蟲，不自覺地就會認為這超出了自己的能力圈邊界。但是，當我們對互聯(lián)網(wǎng)上發(fā)現(xiàn)的某些數(shù)據(jù)感興趣的時候，我們知道如何自動保存這些網(wǎng)頁數(shù)據(jù)以供日后數(shù)據(jù)分析，節(jié)省下來手工復制下載的時間，喝杯咖啡、刷會兒抖音，是不是很香？web 開發(fā)有兩個基本概念：前端和后端。簡單來說，瀏覽器端、客戶端就是前端，提供真實服務的服務器端就是后端。網(wǎng)頁開發(fā)模式主要分為兩類：前后端分離模式和非前后端分離模式。前后端分離模式，是指前端通過 AJAX 向后端 API 服務發(fā)送數(shù)據(jù)請求，后端響應請求并返回數(shù)據(jù)給前端，前端使用后端返回的數(shù)據(jù)（通常是 JSON 數(shù)據(jù)）渲染呈現(xiàn)網(wǎng)頁。非前后端分離模式是由服務器端渲染生成 HTML 網(wǎng)頁，直接發(fā)送給瀏覽器端渲染。前后端分離模式是當前網(wǎng)頁開發(fā)的主流模式，互聯(lián)網(wǎng)大廠幾乎都使用這種模式，但是一些政府事業(yè)單位的網(wǎng)站以外包為主，更新迭代緩慢，非前后端分離模式較多。關于這兩種開發(fā)模式，只需當作科普理解即可。CSV vs JSONJSON 數(shù)據(jù)是 web 開發(fā)中經(jīng)常用到的數(shù)據(jù)格式，很多人經(jīng)常使用 CSV 格式，但是對 JSON 數(shù)據(jù)格式不太熟悉。現(xiàn)在有很多在線工具可以輕松實現(xiàn) CSV 格式與 JSON 格式之間的相互轉換。便于理解 JSON 數(shù)據(jù)，這里簡單對比一下 CSV 和 JSON。
CSV 格式：

ID,Name,Season,Points1,LeBron James,2019-20,25.21,LeBron James,2018-19,27.41,LeBron James,2017-18,27.

JSON格式：

[  { "id": 1, "name": "LeBron James", "season": "2019-20", "points": 25.2 },  { "id": 1, "name": "LeBron James", "season": "2018-19", "points": 27.4 },  { "id": 1, "name": "LeBron James", "season": "2017-18", "points": 27.5 }]

可以看出，相比CSV，JSON有更多的符號以及重復的表頭字段名。雖然這個例子是一種比較常見的 JSON 格式，但有時候 JSON 也會寫成和 CSV 相似的樣式，比如 stats.nba.com 這個網(wǎng)站里使用的樣式:

{  "headers": ["id", "name", "season", "points"],  "rows": [    [1, "LeBron James", "2019-20", 25.2],    [1, "LeBron James", "2018-19", 27.4],    [1, "LeBron James", "2017-18", 27.5]  ]}

直接使用 API 爬取數(shù)據(jù)現(xiàn)在流行的前后端分離開發(fā)模式下，web 網(wǎng)頁應用一般通過 AJAX 向后端 API 服務接口發(fā)送數(shù)據(jù)請求，前端使用 API 返回的數(shù)據(jù)（通常是 JSON 格式）渲染呈現(xiàn)網(wǎng)頁。因此，我們只要找到返回請求數(shù)據(jù)的 API 接口，并模擬發(fā)送請求給這個 API，我們就能拿到想要的數(shù)據(jù)了。這里我們用 stats.nba.com 這個網(wǎng)站為例，使用 Node.js 來演示如何使用 API 自動抓取 NBA 籃球運動員的年度職業(yè)數(shù)據(jù)。1. 檢查數(shù)據(jù)是否動態(tài)加載首先，我們使用 Chrome 瀏覽器去 stats.nba.com 網(wǎng)站上找到詹姆斯的數(shù)據(jù)主頁https://www.nba.com/stats/player/2544/

表格內(nèi)的數(shù)據(jù)是我們的目標數(shù)據(jù)，我們要做的是確認這個表格里的數(shù)據(jù)是在HTML網(wǎng)頁中的靜態(tài)數(shù)據(jù)，還是通過 API 動態(tài)加載的數(shù)據(jù)。那要怎么確認呢？我們使用 Chrome 瀏覽器，打開頁面，右鍵單擊網(wǎng)頁-顯示網(wǎng)頁源代碼，查看網(wǎng)頁源代碼。

通過?Ctrl+F?搜索可以發(fā)現(xiàn)，我們在源代碼里是找不到我們需要的表格數(shù)據(jù)的（比如 2020-21，LAL），這說明了表格數(shù)據(jù)是通過 API 動態(tài)加載的。
2.?尋找返回數(shù)據(jù)的?API 接口數(shù)據(jù)不在 HTML頁面里，到底在哪里呢？我們可以使用瀏覽器的開發(fā)人員工具尋找線索。在網(wǎng)頁上鼠標單擊右鍵，選擇?檢查?菜單調(diào)出開發(fā)人員工具。

打開后默認一般是 Elements 選項卡，我們需要選擇?Network 選項卡。

打開 Network 后，選擇?XHR 并重新刷新頁面，這樣我們就能看到這個頁面所有的 API 數(shù)據(jù)請求。

接下來就需要考驗我們的分析能力了，逐個點擊這些數(shù)據(jù)請求，根據(jù)數(shù)據(jù)預覽?preview 來確定哪一個數(shù)據(jù)請求是我們要找的 API 接口。

我們結合表格的名字以及表格數(shù)據(jù)等多方面的特征，經(jīng)過分析，確定圖中紅框的數(shù)據(jù)請求接口是我們需要的 API 接口。
3. 寫一個 Node.js 腳本自動爬取多頁面數(shù)據(jù)自動爬取數(shù)據(jù)你可以使用任何語言，這里只是以 node.js 為例，給大家演示。安裝 request 庫

npm init -ynpm install --save request request-promise-native

復制下面代碼到 index.js 文件

const rp = require("request-promise-native");const fs = require("fs");async function main() {  console.log("Making API Request...");  // request the data from the JSON API  const results = await rp({????uri:?"https://stats.nba.com/stats/playerdashboardbyyearoveryear?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerID=2544&PlusMinus=N&Rank=N&Season=2020-21&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&Split=yoy&VsConference=&VsDivision=" ,    headers: {      "Connection": "keep-alive",      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",      "x-nba-stats-origin": "stats",      "Referer": "https://stats.nba.com/player/2544/"    },    json: true  });  console.log("Got results =", results);  // save the JSON to disk  await fs.promises.writeFile("output.json", JSON.stringify(results, null, 2));  console.log("Done!")}// start the main scriptmain();

命令窗口運行腳本

node index.js

這樣，我們就獲取了詹姆斯的職業(yè)數(shù)據(jù)，但是這還不夠，如果我們能一起獲得其他籃球運動員的職業(yè)數(shù)據(jù)，才是真正的提高效率?？匆幌挛覀冋业降臄?shù)據(jù)請求 API 路徑，這里有個 PlayerID 的參數(shù)

https://stats.nba.com/stats/playerdashboardbyyearoveryear?
DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&
Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&
PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&
PlayerID=2544&
PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&
SeasonType=Regular+Season&ShotClockRange=&Split=yoy&
VsConference=&VsDivision=

對比其他球員職業(yè)數(shù)據(jù)的 API 接口路徑，發(fā)現(xiàn)只要修改這個 playerID 就可以變成對應球員的 API 接口。根據(jù)球員主頁的路徑，能夠獲取球員對應的 playerID。

Player ID	Player	URL
2544	LeBron James	https://stats.nba.com/player/2544/
1629029	Luka Doncic	https://stats.nba.com/player/1629029/
201935	James Harden	https://stats.nba.com/player/201935/
202695	Kawhi Leonard	https://stats.nba.com/player/202695/

這樣，使用 playerID 作為參數(shù)修改上述的腳本文件，就能夠實現(xiàn)我們的終極需求了。

const rp = require("request-promise-native");const fs = require("fs");// helper to delay execution by 300ms to 1100msasync function delay() {  const durationMs = Math.random() * 800 + 300;  return new Promise(resolve => {    setTimeout(() => resolve(), durationMs);  });}async function fetchPlayerYearOverYear(playerId) {  console.log(`Making API Request for ${playerId}...`);  // add the playerId to the URI and the Referer header  // NOTE: we could also have used the `qs` option for the  // query parameters.  const results = await rp({    uri: "https://stats.nba.com/stats/playerdashboardbyyearoveryear?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&" +      `PlayerID=${playerId}` +      "&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&Split=yoy&VsConference=&VsDivision=",    headers: {      "Connection": "keep-alive",      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",      "x-nba-stats-origin": "stats",      "Referer": `https://stats.nba.com/player/${playerId}/`    },    json: true  });  // save to disk with playerID as the file name  await fs.promises.writeFile(    `${playerId}.json`,    JSON.stringify(results, null, 2)  );}async function main() {  // PlayerIDs for LeBron, Harden, Kawhi, Luka  const playerIds = [2544, 201935, 202695, 1629029];  console.log("Starting script for players", playerIds);  // make an API request for each player  for (const playerId of playerIds) {    await fetchPlayerYearOverYear(playerId);    // be polite to our friendly data hosts and    // don't crash their servers    await delay();  }  console.log("Done!");}main();

我們提取了一個請求數(shù)據(jù)的公共函數(shù)?fetchPlayerYearOverYear，遍歷 ID 數(shù)組，并調(diào)用這個函數(shù)就能一次獲取需要的全部數(shù)據(jù)。另外，出于對服務器的保護以及道德操守，我們一般要設置延遲來避免造成服務器堵塞，影響其他用戶的訪問。通過上述步驟，我們就成功的使用 API 實現(xiàn)了數(shù)據(jù)自動抓取。那如果是非前后端分離模式的 web 網(wǎng)頁應用，沒有 API 接口，我們怎么實現(xiàn)自動抓取呢？自動抓取服務端渲染的 HTML 網(wǎng)頁數(shù)據(jù)非前后端分離模式的 web 應用是將數(shù)據(jù)直接呈現(xiàn)在 HTML 頁面內(nèi)，這種情況我們只要下載、解析 HTML 就能提取到我們的目標數(shù)據(jù)。下面還是用案例來演示，目標是從 espn.com 網(wǎng)站上爬取 NBA 比賽數(shù)據(jù)。1. 驗證數(shù)據(jù)是否在?HTML?頁面內(nèi)和上一個案例相似，我們目標是爬取籃球比賽得分數(shù)據(jù)，地址：?https://www.espn.com/nba/boxscore?gameId=401160888。

鼠標右鍵單擊頁面-顯示網(wǎng)頁源代碼，Ctrl+F 快捷鍵查找表格內(nèi)的數(shù)據(jù)，比如?J.Embiid行的數(shù)字 0-11。

ok，我們發(fā)現(xiàn)目標數(shù)據(jù)存在 HTML 頁面內(nèi)。
2. 找到可以定位目標數(shù)據(jù)的 CSS 元素選擇器我們需要用某些標識來識別出HTML里目標數(shù)據(jù)，而 CSS 選擇器是一個非常好的標識。鼠標右鍵單擊頁面-檢查，調(diào)出開發(fā)人員工具。

通過開發(fā)人員工具我們能找到表格數(shù)據(jù)對應的 HTML 元素，如下圖

這時我們可以切換到 Console（控制臺）選項，使用?document.querySelectorAll()?函數(shù)來交互調(diào)試選擇器，直到我們找到正確的選擇器為止。比如，我們可以首先選擇表格的所有行（<tr>）

document.querySelectorAll('tr')

輸出：

此時，我們發(fā)現(xiàn)僅僅使用 tr 太過籠統(tǒng)了，它返回了41行數(shù)據(jù)，而目標表格只有13個行數(shù)據(jù)。我們可以通過 tr 的父元素 ID 或者 classes 進一步縮小返回的數(shù)據(jù)范圍。

所以我們可以這樣寫選擇代碼

document.querySelectorAll('.gamepackage-away-wrap tbody tr')

這時我們發(fā)現(xiàn)已經(jīng)篩選出了15行表格數(shù)據(jù)，離我們的目標越來越近了，僅僅多了兩個 tr.highlight，我們進一步更新選擇器：

document.querySelectorAll('.gamepackage-away-wrap tbody tr:not(.highlight)')> NodeList(13) [tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr, tr]

ok，大功告成了！接下來，我們要做的是使用Node.js 的 request 庫下載 HTML頁面，使用cheerio 庫解析HTML頁面并提取目標數(shù)據(jù)。
3.?下載 HMTL 頁面和上個案例一樣，我們先按照 request 庫，命令行窗口輸入：

npm init -ynpm install --save request request-promise-native

然后新建一個 index.js 文件，輸入以下代碼

const rp = require('request-promise-native');const fs = require('fs');async function downloadBoxScoreHtml() {  // where to download the HTML from  const uri = 'https://www.espn.com/nba/boxscore?gameId=401160888';   // the output filename  const filename = 'boxscore.html';  // check if we already have the file const fileExists = fs.existsSync(filename); if (fileExists) {   console.log(`Skipping download for ${uri} since ${filename} already exists.`);   return; }  // download the HTML from the web server  console.log(`Downloading HTML from ${uri}...`);  const results = await rp({ uri: uri });  // save the HTML to disk  await fs.promises.writeFile(filename, results);}async function main() {  console.log('Starting...');  await downloadBoxScoreHtml();  console.log('Done!');}main();

運行這個程序

node index.js

哇哦，你成功下載到 HTML文件了！
4. 使用 Cheerio 解析 HTMLCheerio 是從 HTML 頁面里提取數(shù)據(jù)的最佳工具庫，跟 jQuery 非常類似。安裝 Cheerio

npm install --save cheerio

如何使用Cheerio？
第一步，使用?Cheerio.load 函數(shù)讀取 html文件內(nèi)容

// the input filenameconst htmlFilename = 'boxscore.html';// read the HTML from diskconst html = await fs.promises.readFile(htmlFilename);// parse the HTML with Cheerioconst $ = cheerio.load(html);

第二步，使用 css 選擇器提取數(shù)據(jù)，跟?jQuery 語法一樣

const $trs = $('.gamepackage-away-wrap tbody tr:not(.highlight)')

這樣我們就得到了表格數(shù)據(jù)的所有 html 元素片段，如下：

<tr>  <td class="name">    <a name="&amp;lpos=nba:game:boxscore:playercard" href="https://www.espn.com/nba/player/_/id/6440/tobias-harris" data-player-uid="s:40~l:46~a:6440">      <span>T. Harris</span>      <span class="abbr">T. Harris</span>    </a>    <span class="position">SF</span>  </td>  <td class="min">38</td>  <td class="fg">7-17</td>  <td class="3pt">3-7</td>  <td class="ft">1-1</td>  <td class="oreb">0</td>  <td class="dreb">5</td>  <td class="reb">5</td>  <td class="ast">2</td>  <td class="stl">1</td>  <td class="blk">0</td>  <td class="to">2</td>  <td class="pf">3</td>  <td class="plusminus">-10</td>  <td class="pts">18</td></tr>...

進一步使用選擇器遍歷我們提取到的 html 片段，就能得到目標數(shù)據(jù)。

const values = $trs.toArray().map(tr => {  // find all children <td>  const tds = $(tr).find('td').toArray();  // create a player object based on the <td> values  const player = {};  for (td of tds) {    // parse the <td>    const $td = $(td);    // map the td class attr to its value    const key = $td.attr('class');    const value = $td.text();    player[key] = value;  }  return player;});

數(shù)據(jù)結果：

[  {    "name": "T. HarrisT. HarrisSF",    "min": "38",    "fg": "7-17",    "3pt": "3-7",    "ft": "1-1",    "oreb": "0",    "dreb": "5",    "reb": "5",    "ast": "2",    "stl": "1",    "blk": "0",    "to": "2",    "pf": "3",    "plusminus": "-10",    "pts": "18"  }  ...]

現(xiàn)在你學會了使用 cheerio 庫。接下來我們新建 index.js 文件，輸入以下代碼并運行，即可獲取到我們的目標數(shù)據(jù)。

const rp = require('request-promise-native');const fs = require('fs');const cheerio = require('cheerio');async function downloadBoxScoreHtml() {  // where to download the HTML from  const uri = 'https://www.espn.com/nba/boxscore?gameId=401160888';  // the output filename  const filename = 'boxscore.html';  // check if we already have the file  const fileExists = fs.existsSync(filename);  if (fileExists) {    console.log(`Skipping download for ${uri} since ${filename} already exists.`);    return;  }  // download the HTML from the web server  console.log(`Downloading HTML from ${uri}...`);  const results = await rp({ uri: uri });  // save the HTML to disk  await fs.promises.writeFile(filename, results);}async function parseBoxScore() {  console.log('Parsing box score HTML...');  // the input filename  const htmlFilename = 'boxscore.html';  // read the HTML from disk  const html = await fs.promises.readFile(htmlFilename);  // parse the HTML with Cheerio  const $ = cheerio.load(html);  // Get our rows  const $trs = $('.gamepackage-away-wrap tbody tr:not(.highlight)');  const values = $trs.toArray().map(tr => {    // find all children <td>    const tds = $(tr).find('td').toArray();    // create a player object based on the <td> values    const player = {};    for (td of tds) {      const $td = $(td);      // map the td class attr to its value      const key = $td.attr('class');      let value;      if (key === 'name') {        value = $td.find('a span:first-child').text();      } else {        value = $td.text();      }      player[key] = isNaN(+value) ? value : +value;    }    return player;  });  return values;}async function main() {  console.log('Starting...');  await downloadBoxScoreHtml();  const boxScore = await parseBoxScore();  // save the scraped results to disk  await fs.promises.writeFile(    'boxscore.json',    JSON.stringify(boxScore, null, 2)  );  console.log('Done!');}main();

小結總結一下，自動抓取數(shù)據(jù)根據(jù) web 應用開發(fā)模式的不同，分為兩種方法。前后端分離模式下，我們可以直接使用 API 自動抓取數(shù)據(jù)；而非前后端分離模式下，我們需要先下載 HTML 網(wǎng)頁，再解析 HTML 并提取數(shù)據(jù)。你學會了嗎？參考網(wǎng)站：https://beshaimakes.com/js-scrape-data最后，如果覺得不錯，不要忘記你的三連，留下你的三連是對我的最大支持！

如果你喜歡，請點點關注啦~

以上

數(shù)據(jù)自動抓取到底有多簡單?