Python實(shí)戰(zhàn)破解『梨視頻』反爬機(jī)制
作者 | 李運(yùn)辰來(lái)源 | Python爬蟲(chóng)數(shù)據(jù)分析挖掘
1
前言
前面講了很多期的爬蟲(chóng)、數(shù)據(jù)分析、數(shù)據(jù)可視化。其中關(guān)鍵的一環(huán)就是爬蟲(chóng),如果數(shù)據(jù)爬取不下來(lái)就無(wú)法進(jìn)行分析和可視化。
因此本文分析『反爬機(jī)制』,講解遇到這類反爬應(yīng)該如何解決!
下面以『梨視頻』為真實(shí)案例進(jìn)行講解!
2
獲取視頻列表
1.查看反爬類型


上圖就是異步加載的鏈接,通過(guò)異步加載,將數(shù)據(jù)填充到網(wǎng)頁(yè)!
2.分析異步加載鏈接

上面這兩個(gè)鏈接的效果(返回的數(shù)據(jù)是一樣的)

只需要更改start參數(shù)的值就可以獲取更多的視頻頁(yè)面鏈接!
規(guī)律:start 以12遞增
3.頁(yè)面分析

在網(wǎng)頁(yè)中:
class=categoryem,對(duì)應(yīng)的是列表。
class="vervideo-title"]/text(),對(duì)應(yīng)視頻的標(biāo)題。
class="vervideo-bd"]/a/@href,對(duì)應(yīng)視頻網(wǎng)頁(yè)鏈接(非真實(shí)播放鏈接)。
4.編程實(shí)現(xiàn)
###獲取視頻列表def getlist():url = "https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=5&start=36"res = requests.get(url, headers=headers)res.encoding = 'utf-8'text = res.textselector = etree.HTML(text)list = selector.xpath('//*[@class="categoryem"]')for i in list:href = i.xpath('.//div[@class="vervideo-bd"]/a/@href')[0]title = i.xpath('.//div[@class="vervideo-title"]/text()')[0]print(title)print(href)

3
解析mp4播放地址
1.網(wǎng)頁(yè)分析
下面將以這個(gè)視頻網(wǎng)頁(yè)鏈接為例進(jìn)行分析
https://www.pearvideo.com/video_1721926
在class=main-video-box,標(biāo)簽內(nèi)可以看到mp4地址,但這個(gè)是js加密過(guò)來(lái)的

在原網(wǎng)頁(yè)上是沒(méi)有mp4播放地址的,因此我們需要去異步獲取mp4播放地址!
2.分析數(shù)據(jù)包
https://www.pearvideo.com/videoStatus.jsp?contId=1721926&mrd=0.7353562335379842
這個(gè)數(shù)據(jù)包可以看到mp4地址,但是訪問(wèn)時(shí),發(fā)現(xiàn)又有反爬!

原因:
其中contid是視頻的id
mrd是隨機(jī)數(shù)(這才是反爬的限制)
因此我們需要去構(gòu)造隨機(jī)數(shù)。
3.構(gòu)造隨機(jī)數(shù)
url = "https://www.pearvideo.com/videoStatus.jsp?contId=" + str(countid) + "&mrd=" + str(random.random())headers_id = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36','cookie': '__secdyid=d95e39d0b5a512e1c35c5fdea59dcdd21b2758d39550e2c8021615635656; JSESSIONID=2DAE0DABD2DE9BB5D05335B2DE3AF8FF; PEAR_UUID=c32ee57d-4445-49f6-859a-a0a0fc054f1b; _uab_collina=161563565701204982722716; p_h5_u=C94D957E-E1CB-4130-9DF3-DDA387A42A8B; Hm_lvt_9707bc8d5f6bba210e7218b8496f076a=1615635658; UM_distinctid=1782b63b68f38-0331dcb7c2707d-5771133-100200-1782b63b690362; acw_tc=76b20f7116156407099735155e6bf791512b521ca0b97962f9cc398bc56d3a; CNZZDATA1260553744=1236015902-1615633517-%7C1615639181; Hm_lpvt_9707bc8d5f6bba210e7218b8496f076a=1615641462; SERVERID=ed8d5ad7d9b044d0dd5993c7c771ef48|1615641673|1615635656','Host': 'www.pearvideo.com','Referer': 'https://www.pearvideo.com/video_' + str(countid),}res = requests.get(url, headers=headers_id)res.encoding = 'utf-8'
通過(guò)random.random(),可以為mrd參數(shù)賦值隨機(jī)數(shù)!
避坑:
Referer,請(qǐng)求頭headers,中這個(gè)Referer可不能缺少?。。。?!
countid = "1721926"url = "https://www.pearvideo.com/videoStatus.jsp?contId=" + str(countid) + "&mrd=" + str(random.random())headers_id = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36','cookie': '__secdyid=d95e39d0b5a512e1c35c5fdea59dcdd21b2758d39550e2c8021615635656; JSESSIONID=2DAE0DABD2DE9BB5D05335B2DE3AF8FF; PEAR_UUID=c32ee57d-4445-49f6-859a-a0a0fc054f1b; _uab_collina=161563565701204982722716; p_h5_u=C94D957E-E1CB-4130-9DF3-DDA387A42A8B; Hm_lvt_9707bc8d5f6bba210e7218b8496f076a=1615635658; UM_distinctid=1782b63b68f38-0331dcb7c2707d-5771133-100200-1782b63b690362; acw_tc=76b20f7116156407099735155e6bf791512b521ca0b97962f9cc398bc56d3a; CNZZDATA1260553744=1236015902-1615633517-%7C1615639181; Hm_lpvt_9707bc8d5f6bba210e7218b8496f076a=1615641462; SERVERID=ed8d5ad7d9b044d0dd5993c7c771ef48|1615641673|1615635656','Host': 'www.pearvideo.com','Referer': 'https://www.pearvideo.com/video_' + str(countid),}res = requests.get(url, headers=headers_id)res.encoding = 'utf-8'text = json.loads(res.text)

通過(guò)json解析可以取出mp4地址
videoInfo = text['videoInfo']['videos']['srcUrl']
這樣就可以獲取到視頻的mp4地址?。。。?strong>可惜,這里的mp4地址只是一個(gè)虛擬的,需要進(jìn)一步破解)

下面開(kāi)始根據(jù)這個(gè)虛擬mp4地址去還原真實(shí)mp4地址?。。?/strong>
4.還原真實(shí)mp4地址

真實(shí)mp4播放地址包含:cont-1721926(視頻id)
因此需要將虛擬地址拼接成真實(shí)地址??!
s1 = videoInfo.split("-")[0][0:-13]s2 = "cont-" + str(countid) + "-"murl = videoInfo.split("-")s3 = murl[1] + "-" + murl[2] + "-hd.mp4"
最后為了方便使用,封裝成一個(gè)函數(shù),根據(jù)視頻id就可以獲取真實(shí)播放地址!
#獲取到真實(shí)的MP4地址def getmp4(countid):#countid = "1721926"url = "https://www.pearvideo.com/videoStatus.jsp?contId=" + str(countid) + "&mrd=" + str(random.random())headers_id = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36','cookie': '__secdyid=d95e39d0b5a512e1c35c5fdea59dcdd21b2758d39550e2c8021615635656; JSESSIONID=2DAE0DABD2DE9BB5D05335B2DE3AF8FF; PEAR_UUID=c32ee57d-4445-49f6-859a-a0a0fc054f1b; _uab_collina=161563565701204982722716; p_h5_u=C94D957E-E1CB-4130-9DF3-DDA387A42A8B; Hm_lvt_9707bc8d5f6bba210e7218b8496f076a=1615635658; UM_distinctid=1782b63b68f38-0331dcb7c2707d-5771133-100200-1782b63b690362; acw_tc=76b20f7116156407099735155e6bf791512b521ca0b97962f9cc398bc56d3a; CNZZDATA1260553744=1236015902-1615633517-%7C1615639181; Hm_lpvt_9707bc8d5f6bba210e7218b8496f076a=1615641462; SERVERID=ed8d5ad7d9b044d0dd5993c7c771ef48|1615641673|1615635656','Host': 'www.pearvideo.com','Referer': 'https://www.pearvideo.com/video_' + str(countid),}res = requests.get(url, headers=headers_id)res.encoding = 'utf-8'text = json.loads(res.text)videoInfo = text['videoInfo']['videos']['srcUrl']s1 = videoInfo.split("-")[0][0:-13]s2 = "cont-" + str(countid) + "-"murl = videoInfo.split("-")s3 = murl[1] + "-" + murl[2] + "-hd.mp4"return s1+s2+s3
4
測(cè)試效果
for i in list:href = i.xpath('.//div[@class="vervideo-bd"]/a/@href')[0]title = i.xpath('.//div[@class="vervideo-title"]/text()')[0]mp4ulr = getmp4(href.replace("video_",""))print("標(biāo)題="+str(title))print("mp4播放地址="+str(mp4ulr))

這樣就可以獲取到視頻的1.標(biāo)題和2.真實(shí)mp4鏈接。
5
下載視頻
###下載視頻def down(name,url):headers_down = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36','cookie': '__secdyid=d95e39d0b5a512e1c35c5fdea59dcdd21b2758d39550e2c8021615635656; JSESSIONID=2DAE0DABD2DE9BB5D05335B2DE3AF8FF; PEAR_UUID=c32ee57d-4445-49f6-859a-a0a0fc054f1b; _uab_collina=161563565701204982722716; p_h5_u=C94D957E-E1CB-4130-9DF3-DDA387A42A8B; Hm_lvt_9707bc8d5f6bba210e7218b8496f076a=1615635658; UM_distinctid=1782b63b68f38-0331dcb7c2707d-5771133-100200-1782b63b690362; acw_tc=76b20f7116156407099735155e6bf791512b521ca0b97962f9cc398bc56d3a; CNZZDATA1260553744=1236015902-1615633517-%7C1615639181; Hm_lpvt_9707bc8d5f6bba210e7218b8496f076a=1615641462; SERVERID=ed8d5ad7d9b044d0dd5993c7c771ef48|1615641673|1615635656','Host': 'video.pearvideo.com',}r = requests.get(url,headers=headers_down)with open("lyc/"+str(name)+".mp4", 'wb+') as f:f.write(r.content)
將下載代碼封裝成函數(shù),通過(guò)視頻名稱和視頻鏈接就可以將視頻下載保存到本地??!

ok,這樣就有可以破解『梨視頻』反爬機(jī)制,輕松實(shí)現(xiàn)批量視頻下載!
源碼:https://gitee.com/lyc96/pear-video-anti-climbing
6
總結(jié)


1.獲取視頻列表(反爬1:異步加載)
2.解析真實(shí)的mp4播放鏈接(反爬2:根據(jù)視頻id獲取虛擬mp4地址,通過(guò)拼接方式獲取真實(shí)mp4地址)
3.根據(jù)mp4地址去下載視頻,實(shí)現(xiàn)批量下載。
