利用Python爬取豆瓣電影TOP250并進(jìn)行數(shù)據(jù)分析，爬取’排名’,‘電影名稱(chēng)’,‘導(dǎo)演’,‘上映年份’,‘制作國(guó)家’,‘類(lèi)型’,‘評(píng)分’,‘評(píng)價(jià)分?jǐn)?shù)’,'短評(píng)’等字段。

手動(dòng)聲明版權(quán)聲明：本文為博主原創(chuàng)文章，創(chuàng)作不易本文鏈接：https://beishan.blog.csdn.net/article/details/112735850

數(shù)據(jù)爬取

翻頁(yè)操作

#https://beishan.blog.csdn.net/article/details/112735850
第一頁(yè):https://movie.douban.com/top250
第二頁(yè)：https://movie.douban.com/top250?start=25&amp;filter=
第三頁(yè)：https://movie.douban.com/top250?start=50&amp;filter=

觀察可知，我們只需要修改start參數(shù)即可

通過(guò)headers字段來(lái)反爬

headers中有很多字段，這些字段都有可能會(huì)被對(duì)方服務(wù)器拿過(guò)來(lái)進(jìn)行判斷是否為爬蟲(chóng)

1.1 通過(guò)headers中的User-Agent字段來(lái)反爬

反爬原理：爬蟲(chóng)默認(rèn)情況下沒(méi)有User-Agent，而是使用模塊默認(rèn)設(shè)置- 解決方法：請(qǐng)求之前添加User-Agent即可；更好的方式是使用User-Agent池來(lái)解決（收集一堆User-Agent的方式，或者是隨機(jī)生成User-Agent） 1.2 通過(guò)referer字段或者是其他字段來(lái)反爬
反爬原理：爬蟲(chóng)默認(rèn)情況下不會(huì)帶上referer字段，服務(wù)器端通過(guò)判斷請(qǐng)求發(fā)起的源頭，以此判斷請(qǐng)求是否合法- 解決方法：添加referer字段 1.3 通過(guò)cookie來(lái)反爬
反爬原因：通過(guò)檢查cookies來(lái)查看發(fā)起請(qǐng)求的用戶(hù)是否具備相應(yīng)權(quán)限，以此來(lái)進(jìn)行反爬- 解決方案：進(jìn)行模擬登陸，成功獲取cookies之后在進(jìn)行數(shù)據(jù)爬取

通過(guò)請(qǐng)求參數(shù)來(lái)反爬

請(qǐng)求參數(shù)的獲取方法有很多，向服務(wù)器發(fā)送請(qǐng)求，很多時(shí)候需要攜帶請(qǐng)求參數(shù)，通常服務(wù)器端可以通過(guò)檢查請(qǐng)求參數(shù)是否正確來(lái)判斷是否為爬蟲(chóng)

2.1 通過(guò)從html靜態(tài)文件中獲取請(qǐng)求數(shù)據(jù)(github登錄數(shù)據(jù))

反爬原因：通過(guò)增加獲取請(qǐng)求參數(shù)的難度進(jìn)行反爬- 解決方案：仔細(xì)分析抓包得到的每一個(gè)包，搞清楚請(qǐng)求之間的聯(lián)系 2.2 通過(guò)發(fā)送請(qǐng)求獲取請(qǐng)求數(shù)據(jù)
反爬原因：通過(guò)增加獲取請(qǐng)求參數(shù)的難度進(jìn)行反爬- 解決方案：仔細(xì)分析抓包得到的每一個(gè)包，搞清楚請(qǐng)求之間的聯(lián)系，搞清楚請(qǐng)求參數(shù)的來(lái)源 2.3 通過(guò)js生成請(qǐng)求參數(shù)
反爬原理：js生成了請(qǐng)求參數(shù)- 解決方法：分析js，觀察加密的實(shí)現(xiàn)過(guò)程，通過(guò)js2py獲取js的執(zhí)行結(jié)果，或者使用selenium來(lái)實(shí)現(xiàn) 2.4 通過(guò)驗(yàn)證碼來(lái)反爬
反爬原理：對(duì)方服務(wù)器通過(guò)彈出驗(yàn)證碼強(qiáng)制驗(yàn)證用戶(hù)瀏覽行為- 解決方法：打碼平臺(tái)或者是機(jī)器學(xué)習(xí)的方法識(shí)別驗(yàn)證碼，其中打碼平臺(tái)廉價(jià)易用，更值得推薦

在這里我們只需要添加請(qǐng)求頭即可

數(shù)據(jù)定位

# -*- coding: utf-8 -*-
# @Author: Kun
import requests 
from lxml import etree
import pandas as pd
df = []
headers = {<!-- -->'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
           'Referer': 'https://movie.douban.com/top250'}
columns = ['排名','電影名稱(chēng)','導(dǎo)演','上映年份','制作國(guó)家','類(lèi)型','評(píng)分','評(píng)價(jià)分?jǐn)?shù)','短評(píng)']
def get_data(html):
    xp = etree.HTML(html)
    lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
    for li in lis:
        """排名、標(biāo)題、導(dǎo)演、演員、"""
        ranks = li.xpath('div/div[1]/em/text()')
        titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
        directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
        infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
        dates,areas,genres = infos[0],infos[1],infos[2]
        ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
        scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
        quotes = li.xpath('.//p[@class="quote"]/span/text()')
        for rank,title,director in zip(ranks,titles,directors):
            if len(quotes) == 0:
                quotes = None
            else:
                quotes = quotes[0]
            df.append([rank,title,director,dates,areas,genres,ratings,scores,quotes])
        d = pd.DataFrame(df,columns=columns)
        d.to_excel('Top250.xlsx',index=False)
for i in range(0,251,25):
    url = "https://movie.douban.com/top250?start={}&amp;filter=".format(str(i))
    res = requests.get(url,headers=headers)
    html = res.text
    get_data(html)

結(jié)果如下：

使用面向?qū)ο?線程

# -*- coding: utf-8 -*-
"""
Created on Tue Feb  2 15:19:29 2021

@author: 北山啦
"""
import pandas as pd
import time
import requests
from lxml import etree
from queue import Queue
from threading import Thread, Lock

class Movie():
    def __init__(self):
        self.df = []
        self.headers ={<!-- -->'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
                       'Referer': 'https://movie.douban.com/top250'}
        self.columns = ['排名','電影名稱(chēng)','導(dǎo)演','上映年份','制作國(guó)家','類(lèi)型','評(píng)分','評(píng)價(jià)分?jǐn)?shù)','短評(píng)']
        self.lock = Lock()
        self.url_list = Queue()
    
    def get_url(self):
       url = 'https://movie.douban.com/top250?start={}&amp;filter='
       for i in range(0,250,25):
           self.url_list.put(url.format(str(i)))
    
    def get_html(self):
        while True:
            if not self.url_list.empty():
                url = self.url_list.get()
                resp = requests.get(url,headers=self.headers)
                html = resp.text
                self.xpath_parse(html)
            else:
                break   
    def xpath_parse(self,html):
        xp = etree.HTML(html)
        lis = xp.xpath('//*[@id="content"]/div/div[1]/ol/li')
        for li in lis:
            """排名、標(biāo)題、導(dǎo)演、演員、"""
            ranks = li.xpath('div/div[1]/em/text()')
            titles = li.xpath('div/div[2]/div[1]/a/span[1]/text()')
            directors = li.xpath('div/div[2]/div[2]/p[1]/text()')[0].strip().replace("\xa0\xa0\xa0","\t").split("\t")
            infos = li.xpath('div/div[2]/div[2]/p[1]/text()')[1].strip().replace('\xa0','').split('/')
            dates,areas,genres = infos[0],infos[1],infos[2]
            ratings = li.xpath('.//div[@class="star"]/span[2]/text()')[0]
            scores = li.xpath('.//div[@class="star"]/span[4]/text()')[0][:-3]
            quotes = li.xpath('.//p[@class="quote"]/span/text()')
            for rank,title,director in zip(ranks,titles,directors):
                if len(quotes) == 0:
                    quotes = None
                else:
                    quotes = quotes[0]
                self.df.append([rank,title,director,dates,areas,genres,ratings,scores,quotes])
            d = pd.DataFrame(self.df,columns=self.columns)
            d.to_excel('douban.xlsx',index=False)
            
            
    def main(self):
        start_time = time.time()
        self.get_url()

        th_list = []
        for i in range(5):
            th = Thread(target=self.get_html)
            th.start()
            th_list.append(th)

        for th in th_list:
            th.join()
        end_time = time.time()
        print(end_time-start_time)
if __name__ == '__main__':
    spider = Movie()
    spider.main()

數(shù)據(jù)分析

獲取數(shù)據(jù)后，就可以對(duì)自己感興趣的內(nèi)容進(jìn)行分析了

數(shù)據(jù)預(yù)處理

df = pd.read_excel("Top250.xlsx",index_col=False)
df.head()

上映年份格式不統(tǒng)一

year = []
for i in df["上映年份"]:
    i = i[0:4]
    year.append(i)
df["上映年份"] = year
df["上映年份"].value_counts()
x1 = list(df["上映年份"].value_counts().sort_index().index)
y1 = list(df["上映年份"].value_counts().sort_index().values)
y1 = [str(i) for i in y1]

上映年份分布

from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.faker import Faker
c1 = (
    Bar()
    .add_xaxis(x1)
    .add_yaxis("影片數(shù)量", y1)
    .set_global_opts(
        title_opts=opts.TitleOpts(title="Top250年份分布"),
        datazoom_opts=opts.DataZoomOpts(),
    )
    .render("1.html")
)

這里可以看出豆瓣電影TOP250里，電影的上映年份，多分布于80年代以后。其中有好幾年是在10部及以上的。- 從年份的分布情況看，大部分高分電影都上映在 1987 年之后，并且隨著時(shí)間逐漸增加，而近兩年的高分電影的數(shù)量相對(duì)比較少。

評(píng)分分布情況

plt.figure(figsize=(10,6))
plt.hist(list(df["評(píng)分"]),bins=8,facecolor="blue", edgecolor="black", alpha=0.7)
plt.show()

從上圖分析，隨著評(píng)分升高，排名也基本靠前，評(píng)分主要集中在 8.4~9.2 之間。同時(shí)可以通過(guò) pandas 計(jì)算平均數(shù)，眾數(shù)和相關(guān)系數(shù)，平均分為 8.83 分，眾數(shù)為 8.7 分，而相關(guān)系數(shù)為 -0.6882，評(píng)分與排名強(qiáng)相關(guān)。1. 大多分布于「8.5」到「9.2」之間。最低「8.3」，最高「9.6」

排名與評(píng)分分布情況

plt.figure(figsize=(10,5), dpi=100)
plt.scatter(df.index,df['評(píng)分'])
plt.show()

總的來(lái)說(shuō)，排名越靠前，評(píng)價(jià)人數(shù)越多，并且分?jǐn)?shù)也越高。

評(píng)論人數(shù)TOP10

c2 = (
    Bar()
    .add_xaxis(df1["電影名稱(chēng)"].to_list())
    .add_yaxis("評(píng)論數(shù)", df1["評(píng)價(jià)分?jǐn)?shù)"].to_list(),color=Faker.rand_color())
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
    .set_global_opts(title_opts=opts.TitleOpts(title="電影評(píng)論Top10"))
    .render("2.html")
)

讓我們來(lái)看看人氣最高的有哪些影片，你又看過(guò)幾部呢？

導(dǎo)演排名

可以看到這些導(dǎo)演很??呀

電影類(lèi)型圖

from collections import Counter
colors = ' '.join([i for i in df[ '類(lèi)型']]).strip().split()
c = dict(Counter(colors))
c

發(fā)現(xiàn)有個(gè)錯(cuò)誤值

d = c.pop('1978(中國(guó)大陸)')

刪除即可

對(duì)于刪除字典的值有以下方法

方法一 pop(key[,default])

d = {<!-- -->'a':1,'b':2,'c':3}
# 刪除key值為'a'的元素，并賦值給變量e1
e1 = d.pop('a')
print(e1)
# 如果key不存在，則可以設(shè)置返回值
e2 = d.pop('m','404')
print(e2)
# 如果key不存在，不設(shè)置返回值就報(bào)錯(cuò)
e3 = d.pop('m')

方法二 del[d[key]]

d = {<!-- -->'a':1,'b':2,'c':3}
# 刪除給定key的元素
del d['a']
print(d)
# 刪除不存在的元素
del d['m']

clear一次性刪除所有字典元素

d = {<!-- -->'a':1,'b':2,'c':3}
print(d)
# 刪除所有元素，允許d為{}
d.clear()
print(d)

統(tǒng)計(jì)展示

可視化展示

c = (
    WordCloud()
    .add(
        "",
        words,
        word_size_range=[20, 100],
        textstyle_opts=opts.TextStyleOpts(font_family="cursive"),
    )
    .set_global_opts(title_opts=opts.TitleOpts(title="WordCloud-自定義文字樣式"))
    .render("wordcloud_custom_font_style.html")
)
## https://blog.csdn.net/qq_45176548/article/details/112735850