TMDB電影數(shù)據(jù)分析報(bào)告

前言

數(shù)據(jù)分析的基本流程：

提出問題
理解數(shù)據(jù)
數(shù)據(jù)清洗
構(gòu)建模型
數(shù)據(jù)可視化
形成報(bào)告

一、提出問題

本次報(bào)告的主要任務(wù)是：根據(jù)歷史電影數(shù)據(jù)，分析哪種電影收益能力更好，未來電影的流行趨勢，以及為電影拍攝提供建議。細(xì)化為以下幾個(gè)小問題：

電影風(fēng)格隨時(shí)間變化的趨勢；1. 不同風(fēng)格電影的收益能力；1. 不同風(fēng)格電影的受歡迎程度1. 不同風(fēng)格電影的評分比較；1. 原創(chuàng)電影與改編電影對比；1. 影響票房收入的因素；1. 比較行業(yè)內(nèi)兩家巨頭公司Universal Pictures和Paramount Pictures.

二、理解數(shù)據(jù)

從Kaggle平臺上下載原始數(shù)據(jù)集：tmdb_5000_movies和tmdb_5000_credits，前者為電影基本信息，包含20個(gè)變量，后者為演職員信息，包含4個(gè)變量。導(dǎo)入數(shù)據(jù)集后，通過對數(shù)據(jù)的查看，并結(jié)合要分析的問題，篩選出以下9個(gè)要重點(diǎn)分析的變量：

|序號|變量名|說明 |------ |1|budget|電影預(yù)算（單位：美元） |2|genres|電影風(fēng)格 |3|keywords|電影關(guān)鍵字 |4|popularity|受歡迎程度 |5|production_companies|制作公司 |6|release_year|上映時(shí)間 |7|revenue|票房收入（單位：美元） |8|vote_average|平均評分 |9|vote_count|評分次數(shù)

三、數(shù)據(jù)清洗

針對本數(shù)據(jù)集，數(shù)據(jù)清洗主要包括三個(gè)步驟：1.數(shù)據(jù)預(yù)處理 2.特征提取 3.特征選擇

數(shù)據(jù)預(yù)處理：（1）通過查看數(shù)據(jù)集信息，發(fā)現(xiàn)’runtime’列有一條缺失數(shù)據(jù)，‘release_date’列有一條缺失數(shù)據(jù)，‘homepage’有30條缺失數(shù)據(jù)，只對‘release’列和‘runtime’列進(jìn)行缺失值填補(bǔ)。具體操作方法是：通過索引的方式找到具體是哪一部電影，然后上網(wǎng)搜索準(zhǔn)確的數(shù)據(jù)，將其填補(bǔ)。（詳見后續(xù)代碼）（2）對‘release_date’列進(jìn)行格式轉(zhuǎn)化，并從中抽取出“年份”信息。1. ?特征提?。海?）credits數(shù)據(jù)集中cast，crew這兩列都是json格式，需要將演員、導(dǎo)演分別從這兩列中提取出來；movies數(shù)據(jù)集中的genres，keywords，production_companies都是json格式，需要將其轉(zhuǎn)化為字符串格式。（2）處理過程：通過json.loads先將json格式轉(zhuǎn)換為字典列表"[{},{},{}]"的形式，再遍歷每個(gè)字典，取出鍵(key)為‘name’所對應(yīng)的值(value)，并將這些值(value)用“，”分隔，形成一個(gè)“多選題”的結(jié)構(gòu)。在進(jìn)行具體問題分析的時(shí)候，再將“多選題”編碼為虛擬變量，即所有多選題的每一個(gè)不重復(fù)的選項(xiàng)，拿出來作為新變量，每一條觀測包含該選項(xiàng)則填1，否則填0。1. ?特征選擇：在分析每一個(gè)小問題之前，都要通過特征提取，選擇最適合分析的變量，即在分析每一個(gè)小問題時(shí)，都要先構(gòu)造一個(gè)數(shù)據(jù)框，放入要分析的變量，而不是在原數(shù)據(jù)框中亂涂亂畫。

四、數(shù)據(jù)可視化

本次數(shù)據(jù)分析只是對數(shù)據(jù)集進(jìn)行了基本的描述性分析和相關(guān)性分析，構(gòu)建模型步驟均與特征選取、新建數(shù)據(jù)框一起完成，本案例不屬于機(jī)器學(xué)習(xí)范疇，因此不涉及構(gòu)建模型問題。本次數(shù)據(jù)可視化用到的圖形有：折線圖、柱狀圖、直方圖、餅圖、散點(diǎn)圖、詞云圖。（詳見后續(xù)代碼）

五、形成數(shù)據(jù)分析報(bào)告

代碼部分：

導(dǎo)入包，并讀取數(shù)據(jù)集：

import?numpy?as?np
import?pandas?as?pd
import?matplotlib
import?matplotlib.pyplot?as?plt
import?seaborn?as?sns
sns.set_style('darkgrid')
from?pandas?import?DataFrame,?Series
import?json
from?wordcloud?import?WordCloud,?STOPWORDS
plt.rcParams['font.sans-serif']?=?['SimHei']
#讀取數(shù)據(jù)集：電影信息、演員信息
movies?=?pd.read_csv('tmdb_5000_movies.csv',encoding?=?'utf_8')
credits?=?pd.read_csv('tmdb_5000_credits.csv',encoding?=?'utf_8')

處理json數(shù)據(jù)格式，將兩張表合并為一張表，并刪除不需要的字段：

#將json數(shù)據(jù)轉(zhuǎn)換為字符串
#credits：json數(shù)據(jù)解析
json_cols?=?['cast',?'crew']
for?i?in?json_cols:
????credits[i]?=?credits[i].apply(json.loads)
#提取演員
def?get_names(x):
????return?','.join([i['name']?for?i?in?x])
credits['cast']?=?credits['cast'].apply(get_names)
credits.head()
#提取導(dǎo)演
def?get_directors(x):
????for?i?in?x:
????????if?i['job']?==?'Director':
????????????return?i['name']
credits['crew']?=?credits['crew'].apply(get_directors)
#將字段‘crew’改為‘director’
credits.rename(columns={'crew':'director'},?inplace?=?True)

#movies：json數(shù)據(jù)解析
json_cols?=?['genres',?'keywords',?'spoken_languages',?'production_companies',?'production_countries']
for?i?in?json_cols:
????movies[i]?=?movies[i].apply(json.loads)
def?get_names(x):
????return?','.join([i['name']?for?i?in?x])
movies['genres']?=?movies['genres'].apply(get_names)
movies['keywords']?=?movies['keywords'].apply(get_names)
movies['spoken_languages']?=?movies['spoken_languages'].apply(get_names)
movies['production_countries']?=?movies['production_countries'].apply(get_names)
movies['production_companies']?=?movies['production_companies'].apply(get_names)
#合并數(shù)據(jù)
#credits,?movies兩個(gè)表中都有字段id,?title，檢查兩個(gè)字段是否相同
(movies['title']?==?credits['title']).describe()
#刪除重復(fù)字段
del?movies['title']
#合并兩張表,參數(shù)代表合并方式
df?=?credits.merge(right?=?movies,?how?=?'inner',?left_on?=?'movie_id',?right_on?=?'id')
#刪除分析不需要的字段
del?df['overview']
del?df['original_title']
del?df['id']
del?df['homepage']
del?df['spoken_languages']
del?df['tagline']

填補(bǔ)缺失值，并抽取“年份”信息：

#填補(bǔ)缺失值
#首先查找出缺失值記錄
df[df.release_date.isnull()]
#然后在網(wǎng)上查詢到該電影的發(fā)行年份，進(jìn)行填補(bǔ)
df['release_date']?=?df['release_date'].fillna('2014-06-01')
#電影時(shí)長也和上面的處理一樣
df.loc[2656]?=?df.loc[2656].fillna(94)
df.loc[4140]?=?df.loc[2656].fillna(81)
#轉(zhuǎn)換日期格式，只保留年份信息
df['release_year']?=?pd.to_datetime(df.release_date,?format?=?'%Y-%m-%d').dt.year

不同電影風(fēng)格的數(shù)量占比分析，以及隨時(shí)間變化的趨勢：

#獲取電影類型信息
genre?=?set()
for?i?in?df['genres'].str.split(','):
????genre?=?set().union(i,genre)
#轉(zhuǎn)化為列表
genre?=?list(genre)
#移除列表中無用的字段
genre.remove('')
#對電影類型進(jìn)行one-hot編碼
for?genr?in?genre:
????df[genr]?=?df['genres'].str.contains(genr).apply(lambda?x:?1?if?x?else?0)
df_gy?=?df.loc[:,?genre]
df_gy.index?=?df['release_year']
#各種電影類型的總數(shù)量
df_gysum?=?df_gy.sum().sort_values(ascending?=?True)
df_gysum.plot.barh(label='genre',?figsize=(10,6))
plt.xlabel('數(shù)量',fontsize=15)
plt.ylabel('電影風(fēng)格',fontsize=15)
plt.title('不同電影風(fēng)格的總數(shù)量',fontsize=20)
plt.grid(False)
#電影類型隨時(shí)間變化的趨勢
df_gy1?=?df_gy.sort_index(ascending?=?False)
df_gys?=?df_gy1.groupby('release_year').sum()
df_sub_gys?=?df_gys[[?'Drama',?'Comedy',?'Thriller',?'Romance',?
????????????????????'Adventure',?'Crime',?'Science?Fiction',?'Horror']].loc[1960:,:]
plt.figure(figsize=(10,6))
plt.plot(df_sub_gys,?label?=?df_sub_gys.columns)
plt.legend(df_sub_gys)
plt.xticks(range(1915,2018,10))
plt.xlabel('年份',?fontsize=15)
plt.ylabel('數(shù)量',?fontsize=15)
plt.title('電影風(fēng)格隨時(shí)間變化趨勢',?fontsize=20)
plt.show()

不同電影風(fēng)格的受歡迎程度分析：

#定義一個(gè)數(shù)據(jù)框，以電影類型為索引，以每種電影類型的受歡迎程度為值
df_gen_popu?=?pd.DataFrame(index?=?genre)
#計(jì)算每種電影類型的平均受歡迎程度
list?=?[]
for?genr?in?genre:
????list.append(df.groupby(genr)['popularity'].mean())
list2?=?[]
for?i?in?range(len(genre)):
????list2.append(list[i][1])
df_gen_popu['mean_popularity']?=?list2
df_gen_popu.sort_values(by?=?'mean_popularity',?ascending=True).plot.barh(label?=?genre,?figsize=(10,6))
plt.xlabel('受歡迎程度',?fontsize=15)
plt.ylabel('電影風(fēng)格',?fontsize=15)
plt.title('不同電影風(fēng)格的受歡迎程度',?fontsize=20)
plt.grid(False)
plt.show()
#keywords關(guān)鍵詞分析
keywords_list?=?[]
for?i?in?df['keywords']:
????keywords_list.append(i)
????keywords_list
#把字符串列表連接成一個(gè)長字符串
lis?=?''.join(keywords_list)
lis.replace('\'s','')
#設(shè)置停用詞
stopwords?=?set(STOPWORDS)
stopwords.add('film')
wordcloud?=?WordCloud(
????????????????background_color?=?'white',
????????????????stopwords?=?stopwords,
????????????????max_words?=?3000,
????????????????scale=1).generate(lis)
plt.figure(figsize=(10,6))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

不同電影風(fēng)格的收益能力分析：

#不同電影風(fēng)格的收益能力分析
#增加收益列
df['profit']?=?df['revenue']?-?df['budget']
#創(chuàng)建收益數(shù)據(jù)框
profit_df?=?pd.DataFrame()
profit_df?=?pd.concat([df.loc[:,genre],?df['profit']],?axis=1)
#創(chuàng)建一個(gè)Series,其index為各個(gè)genre，值為按genre分類計(jì)算的profit之和
profit_sum_by_genre?=?pd.Series(index=genre)
for?genr?in?genre:
????profit_sum_by_genre.loc[genr]?=?profit_df.loc[:,?[genr,?'profit']].groupby(genr,?as_index?=?False).sum().loc[1,?'profit']
#創(chuàng)建一個(gè)Series,其index為各個(gè)genre，值為按genre分類計(jì)算的budget之和
budget_df?=?pd.concat([df.loc[:,genre],?df['budget']],?axis=1)
budget_by_genre?=?pd.Series(index=genre)
for?genr?in?genre:
???budget_by_genre.loc[genr]?=?budget_df.loc[:,?[genr,?'budget']].groupby(genr,?as_index?=?False).sum().loc[1,?'budget']
#橫向合并數(shù)據(jù)框
profit_rate?=?pd.concat([profit_sum_by_genre,?budget_by_genre],?axis=1)
profit_rate.columns?=?['profit',?'budget']
#添加收益率列
profit_rate['profit_rate']?=?(profit_rate['profit']/profit_rate['budget'])*100
profit_rate.sort_values(by=['profit',?'profit_rate'],?ascending=False,?inplace=True)
#xl為索引實(shí)際值
xl?=?profit_rate.index
#可視化不同風(fēng)格電影的收益（柱狀圖）和收益率（折線圖）
fig?=?plt.figure(figsize=(10,6))
ax1?=?fig.add_subplot(1,1,1)
plt.bar(range(0,20),?profit_rate['profit'],?label='profit',?alpha=0.7)
plt.xticks(range(0,20),xl,rotation=60,?fontsize=12)
plt.yticks(fontsize=12)
ax1.set_xlabel('電影風(fēng)格',?fontsize=15)
ax1.set_ylabel('利潤',?fontsize=15)
ax1.set_title('不同電影風(fēng)格的收益能力',?fontsize=20)
ax1.set_ylim(0,1.2e11)

#次縱坐標(biāo)軸標(biāo)簽設(shè)置為百分比顯示
import?matplotlib.ticker?as?mtick
ax2?=?ax1.twinx()
ax2.plot(range(0,20),?profit_rate['profit_rate'],?'ro-',?lw=2,?label='profit_rate')
fmt='%.2f%%'
yticks?=?mtick.FormatStrFormatter(fmt)
ax2.yaxis.set_major_formatter(yticks)
plt.xticks(range(0,20),xl,rotation=60,?fontsize=12)
plt.yticks(fontsize=15)
ax2.set_ylabel('收益率',?fontsize=15)
plt.grid(False)

不同電影風(fēng)格的平均收益能力分析：

#不同電影風(fēng)格的平均收益能力分析
#創(chuàng)建一個(gè)Series,其index為各個(gè)genre，值為按genre分類計(jì)算的profit平均值
profit_mean_by_genre?=?pd.Series(index=genre)
for?genr?in?genre:
????profit_mean_by_genre.loc[genr]?=?profit_df.loc[:,?[genr,?'profit']].groupby(genr,?as_index?=?False).mean().loc[1,?'profit']
#創(chuàng)建一個(gè)Series,其index為各個(gè)genre，值為按genre分類計(jì)算的budget之和
budget_df?=?pd.concat([df.loc[:,genre],?df['budget']],?axis=1)
budget_mean_by_genre?=?pd.Series(index=genre)
for?genr?in?genre:
???budget_mean_by_genre.loc[genr]?=?budget_df.loc[:,?[genr,?'budget']].groupby(genr,?as_index?=?False).mean().loc[1,?'budget']
#橫向合并數(shù)據(jù)框
profit_rate_mean?=?pd.concat([profit_mean_by_genre,?budget_mean_by_genre],?axis=1)
profit_rate_mean.columns?=?['mean_profit',?'mean_budget']
#添加收益率列
profit_rate_mean['mean_profit_rate']?=?(profit_rate_mean['mean_profit']/profit_rate_mean['mean_budget'])*100
profit_rate_mean.sort_values(by=['mean_profit',?'mean_profit_rate'],?ascending=False,?inplace=True)
#xl為索引實(shí)際值
xl?=?profit_rate_mean.index
#可視化不同風(fēng)格電影的收益（柱狀圖）和收益率（折線圖）
fig?=?plt.figure(figsize=(10,6))
ax3?=?fig.add_subplot(1,1,1)
plt.bar(range(0,20),?profit_rate_mean['mean_profit'],?label='mean_profit',?alpha=0.7)
plt.xticks(range(0,20),xl,rotation=60,?fontsize=12)
plt.yticks(fontsize=12)
ax3.set_xlabel('電影風(fēng)格',?fontsize=15)
ax3.set_ylabel('平均利潤',?fontsize=15)
ax3.set_title('不同電影風(fēng)格的平均收益能力',?fontsize=20)
#ax3.set_ylim(0,1.2e11)

#次縱坐標(biāo)軸標(biāo)簽設(shè)置為百分比顯示
import?matplotlib.ticker?as?mtick
ax4?=?ax3.twinx()
ax4.plot(range(0,20),?profit_rate_mean['mean_profit_rate'],?'ro-',?lw=2,?label='mean_profit_rate')
fmt='%.2f%%'
yticks?=?mtick.FormatStrFormatter(fmt)
ax4.yaxis.set_major_formatter(yticks)
plt.xticks(range(0,20),xl,rotation=60,?fontsize=12)
plt.yticks(fontsize=15)
ax4.set_ylabel('平均收益率',?fontsize=15)
plt.grid(False)
plt.show()

不同電影風(fēng)格的預(yù)算分析：

#可視化不同風(fēng)格電影的預(yù)算
profit_rate_mean.sort_values(by=['mean_budget'],?ascending=False,?inplace=True)
xl?=?profit_rate_mean.index
fig?=?plt.figure(figsize=(10,6))
ax5?=?fig.add_subplot(1,1,1)
plt.bar(range(0,20),?profit_rate_mean['mean_budget'],?label='mean_budget',?alpha=0.7)
plt.xticks(range(0,20),xl,rotation=60,?fontsize=12)
plt.yticks(fontsize=12)
ax5.set_xlabel('電影風(fēng)格',?fontsize=15)
ax5.set_ylabel('平均預(yù)算',?fontsize=15)
ax5.set_title('不同電影風(fēng)格的平均預(yù)算',?fontsize=20)

#定義一個(gè)數(shù)據(jù)框，以電影類型為索引，以每種電影類型的受歡迎程度為值
df_gen_popu?=?pd.DataFrame(index?=?genre)
#計(jì)算每種電影類型的平均受歡迎程度
list?=?[]
for?genr?in?genre:
????list.append(df.groupby(genr)['popularity'].mean())
list2?=?[]
for?i?in?range(len(genre)):
????list2.append(list[i][1])
df_gen_popu['mean_popularity']?=?list2
df_gen_popu.sort_values(by?=?'mean_popularity',?ascending=True).plot.barh(label?=?genre,?figsize=(10,6))
plt.xlabel('受歡迎程度',?fontsize=15)
plt.ylabel('電影風(fēng)格',?fontsize=15)
plt.title('不同電影風(fēng)格的受歡迎程度',?fontsize=20)
plt.grid(False)
plt.show()

不同電影風(fēng)格的平均評分分析：

#創(chuàng)建平均評分?jǐn)?shù)據(jù)框
vote_avg_df?=?pd.concat([df.loc[:,genre],?df['vote_average']],?axis=1)
voteavg_mean_list?=?[]
for?genr?in?genre:
????voteavg_mean_list.append(vote_avg_df.groupby(genr,?as_index?=?False).mean().loc[1,?'vote_average'])
#創(chuàng)建不同風(fēng)格電影平均評分?jǐn)?shù)據(jù)框
voteavg_mean_by_genre?=?pd.DataFrame(index?=?genre)
voteavg_mean_by_genre['voteavg_mean']?=?voteavg_mean_list
df['popularity'].corr(df['vote_average'])
#可視化不同風(fēng)格電影的平均評分
voteavg_mean_by_genre.sort_values(by='voteavg_mean',?ascending=False,?inplace?=?True)
fig?=?plt.figure(figsize=(10,6))
ax?=?fig.add_subplot(1,1,1)
voteavg_mean_by_genre.plot.bar(ax=ax)
plt.title('不同電影風(fēng)格的平均評分',?fontsize=20)
plt.xlabel('電影風(fēng)格',?fontsize?=?15)
plt.ylabel('平均評分',?fontsize?=?15)
plt.xticks(rotation=45)
plt.ylim(5,?7,?0.5)

#可視化所有電影的評分分布
fig,?ax?=?plt.subplots(figsize=(8,5))
ax?=?sns.distplot(df['vote_average'],?bins=10)
plt.title('電影平均評分分布',?fontsize=20)
plt.xlabel('數(shù)量',?fontsize?=?15)
plt.ylabel('平均評分',?fontsize?=?15)
plt.xticks(np.arange(11))
plt.grid(True)
plt.show()

原創(chuàng)電影與改編電影對比分析：

#原創(chuàng)電影與改編電影對比分析
original_novel?=?pd.DataFrame()
original_novel['keywords']?=?df['keywords'].str.contains('based?on').map(lambda?x:?1?if?x?else?0)
original_novel['profit']?=?df['profit']
novel_count?=?original_novel['keywords'].sum()
original_count?=?original_novel['keywords'].count()?-?original_novel['keywords'].sum()
original_novel?=?original_novel.groupby('keywords',?as_index?=?False).mean()
#創(chuàng)建原創(chuàng)與改編對比的數(shù)據(jù)框
org_vs_novel?=?pd.DataFrame()
org_vs_novel['count']?=?[original_count,?novel_count]
org_vs_novel['profit']?=?original_novel['profit']
org_vs_novel.index?=?['original?works',?'based_on_novel']
#可視化原創(chuàng)電影與改編電影的數(shù)量占比（餅圖）、片均受益（柱狀圖）
fig=?plt.figure(figsize?=?(12,5))
ax1?=?plt.subplot(1,?2,?1)
ax1?=?plt.pie(org_vs_novel['count'],?labels=org_vs_novel.index,?autopct='%.2f%%',?startangle=90,?pctdistance=0.6)
plt.title('原創(chuàng)電影 VS 改編電影：占比分析',?fontsize=15)
ax2?=?plt.subplot(1,?2,?2)
ax2?=?org_vs_novel['profit'].plot.bar()
plt.xticks(rotation=0)
plt.ylabel('收入',?fontsize=12)
plt.title('原創(chuàng)電影 VS 改編電影：利潤對比',?fontsize=15)
plt.grid(False)
plt.show()

票房收入影響因素分析：

#通過相關(guān)性分析觀察影響票房的因素
df[['budget',?'popularity',?'revenue',?'runtime',?'vote_average',?'vote_count']].corr()
#從相關(guān)性結(jié)果中發(fā)現(xiàn)對票房影響比較大的是預(yù)算、受歡迎度、評分次數(shù)
revenue_corr?=?df[['popularity',?'vote_count',?'budget',?'revenue']]
#可視化票房收入分別與受歡迎程度（藍(lán)）、評價(jià)次數(shù)（綠）、電影預(yù)算（紅）的相關(guān)性散點(diǎn)圖
fig?=?plt.figure(figsize=(12,5))
ax1?=?plt.subplot(1,3,1)
ax1?=?sns.regplot(x='popularity',?y='revenue',?data=revenue_corr,?x_jitter=0.1)
ax1.text(400,3e9,?'r=0.64',?fontsize=15)
plt.title('受歡迎程度對票房的影響',?fontsize=15)
plt.xlabel('受歡迎程度',?fontsize=12)
plt.ylabel('票房收入',?fontsize=12)

ax2?=?plt.subplot(1,3,2)
ax2?=?sns.regplot(x='vote_count',?y='revenue',?data=revenue_corr,?x_jitter=0.1,?color='g',?marker='+')
ax2.text(5800,2.2e9,?'r=0.78',?fontsize=15)
plt.title('評價(jià)次數(shù)對票房的影響',?fontsize=15)
plt.xlabel('評價(jià)次數(shù)',?fontsize=12)
plt.ylabel('票房收入',?fontsize=12)

ax3?=?plt.subplot(1,3,3)
ax3?=?sns.regplot(x='budget',?y='revenue',?data=revenue_corr,?x_jitter=0.1,?color='r',?marker='^')
ax3.text(1.6e8,2.2e9,?'r=0.73',?fontsize=15)
plt.title('預(yù)算對票房的影響',?fontsize=15)
plt.xlabel('預(yù)算',?fontsize=12)
plt.ylabel('票房收入',?fontsize=12)

行業(yè)內(nèi)兩巨頭公司對比分析：

#創(chuàng)建公司數(shù)據(jù)框
company_list?=?['Universal?Pictures',?'Paramount?Pictures']
df_company?=?pd.DataFrame()
for?company?in?company_list:
????df_company[company]?=?df['production_companies'].str.contains(company).map(lambda?x:?1?if?x?else?0)
df_company?=?pd.concat([df['release_year'],?df_company,?df.loc[:,genre],?df['revenue'],?df['profit']],?axis=1)
#創(chuàng)建巨頭對比數(shù)據(jù)框
Uni_vs_Para?=?pd.DataFrame(index=company_list,?columns?=?df_company.columns[3:])
#計(jì)算兩公司的收益總額
Uni_vs_Para.loc['Universal?Pictures']?=?df_company.groupby('Universal?Pictures',?as_index=False).sum().iloc[1,3:-1]
Uni_vs_Para.loc['Paramount?Pictures']?=?df_company.groupby('Paramount?Pictures',?as_index=False).sum().iloc[1,3:-1]
#可視化兩公司票房收入對比
fig?=?plt.figure(figsize=(8,6))
ax?=?fig.add_subplot(1,1,1)
Uni_vs_Para['revenue'].plot(ax=ax,?kind='bar')
plt.title('Universal?VS.?Paramount?票房總收入',?fontsize=15)
plt.xticks(rotation=0)
plt.ylabel('票房收入',?fontsize=12)
plt.grid(False)
plt.show()
#建立兩家公司的利潤對話框
df_company_profit?=?df_company[['Universal?Pictures',?'Paramount?Pictures',?'profit']].reset_index(drop=True)
df_company_profit.index?=?df['release_year']
#將兩家公司的利潤提取出來，并合并每年的利潤
df_company_profit['Universal?Pictures_profit']?=?df_company_profit['Universal?Pictures']*df_company_profit['profit']
df_company_profit['Paramount?Pictures_profit']?=?df_company_profit['Paramount?Pictures']*df_company_profit['profit']
company1?=?df_company_profit['Universal?Pictures_profit'].groupby('release_year').sum()
company2?=?df_company_profit['Paramount?Pictures_profit'].groupby('release_year').sum()
#繪制兩家公司的總利潤隨時(shí)間變化折線圖
fig?=?plt.figure(figsize?=?(10,6))
ax1?=?fig.add_subplot(1,1,1)
company1.plot(x=df_company_profit.index,?y=df_company_profit['Universal?Pictures_profit'],?label='Universal?Pictures',?ax=ax1)
company2.plot(x=df_company_profit.index,?y=df_company_profit['Paramount?Pictures_profit'],?label='Paramount?Pictures',?ax=ax1)
plt.title('Universal?VS.?Paramount?每年總利潤',?fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('年份',?fontsize=12)
plt.ylabel('利潤',?fontsize=12)
plt.legend(fontsize=12)
plt.show()
#轉(zhuǎn)置
Uni_vs_Para?=?Uni_vs_Para.T
Universal?=?Uni_vs_Para['Universal?Pictures'].iloc[:-1]?
Paramount?=?Uni_vs_Para['Paramount?Pictures'].iloc[:-1]?
Universal['others']?=?Universal.sort_values(ascending=False).iloc[8:].sum()
Universal?=?Universal.sort_values(ascending=True).iloc[-9:]
Paramount['others']?=?Paramount.sort_values(ascending=False).iloc[8:].sum()
Paramount?=?Paramount.sort_values(ascending=True).iloc[-9:]
#兩公司電影風(fēng)格可視化
fig?=?plt.figure(figsize=(13,?6))
ax1?=?plt.subplot(1,2,1)
ax1?=?plt.pie(Universal,?labels?=?Universal.index,?autopct='%.2f%%',?startangle=90,?pctdistance=0.75)
plt.title('Universal?Pictures',?fontsize=20)
ax2?=?plt.subplot(1,2,2)
ax2?=?plt.pie(Paramount,?labels?=?Paramount.index,?autopct='%.2f%%',?startangle=90,?pctdistance=0.75)
plt.title('Paramount?Pictures',?fontsize=20)
plt.show()