【Python私活案例】Pandas找到公司內(nèi)最相似的員工(100元)
今日份在螞蟻老師的vip群里,有一位群友提出一個(gè)需求

我馬上想到可以用所學(xué)的pandas相關(guān)知識(shí)解決
讓我們來(lái)看看群友的具體需求

我們根據(jù)群友的需求,構(gòu)造了出了相應(yīng)的表格數(shù)據(jù)

在這里,我在name字段下先設(shè)定了一些值
開始寫代碼!
首先引入numpy和pandas,這倆是數(shù)據(jù)分析領(lǐng)域必不可少的模塊
import?numpy?as?np
#?這行代碼的意思是通過(guò)設(shè)置固定的隨機(jī)數(shù)種子,讓你我生成的隨機(jī)數(shù)是一樣的
np.random.seed(666)
import?pandas?as?pd
#?讀取我們的excel文件
data?=?pd.read_excel("data.xlsx")
觀察數(shù)據(jù),我們有十六名員工,除了姓名,其它都是空值
data
| name | Java | Python | AS400 | ITID | Oracle | Bigdata | SQL | Leadership | Management | Creativity | Communication | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 劉備 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 關(guān)羽 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 張飛 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 諸葛亮 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 趙云 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 司馬懿 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 孫權(quán) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 曹操 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | 張角 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | 姜維 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | 司馬昭 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | 司馬師 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | 魏延 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | 徐庶 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | 陸遜 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 15 | 魯肅 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
將空值用0到1之間的隨機(jī)數(shù)代替
def?initialize(item):
????#?如果發(fā)現(xiàn)當(dāng)前這個(gè)元素是空值,就用隨機(jī)數(shù)代替
????if?pd.isnull(item):
????????return?np.random.random()
????#?否則就原樣返回(對(duì)應(yīng)name那列)
????else:
????????return?item
通過(guò)applymap函數(shù),將上面的自定義函數(shù)應(yīng)用到表格中的每一個(gè)數(shù)據(jù)上
new_data?=?data.applymap(initialize)
再來(lái)看看我們構(gòu)造好的數(shù)據(jù)new_data,發(fā)現(xiàn)已經(jīng)得到預(yù)期值了
new_data
| name | Java | Python | AS400 | ITID | Oracle | Bigdata | SQL | Leadership | Management | Creativity | Communication | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | 0.577906 | 0.273414 |
| 1 | 關(guān)羽 | 0.155675 | 0.440437 | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | 0.719190 | 0.035426 |
| 2 | 張飛 | 0.138446 | 0.237509 | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | 0.802278 | 0.424132 |
| 3 | 諸葛亮 | 0.045878 | 0.295690 | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | 0.006340 | 0.132899 |
| 4 | 趙云 | 0.926776 | 0.158741 | 0.212268 | 0.999313 | 0.060130 | 0.593934 | 0.296281 | 0.425722 | 0.665509 | 0.910555 | 0.824788 |
| 5 | 司馬懿 | 0.158174 | 0.686519 | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | 0.236982 | 0.178578 |
| 6 | 孫權(quán) | 0.234794 | 0.854272 | 0.953005 | 0.973668 | 0.483947 | 0.904404 | 0.803289 | 0.522370 | 0.949673 | 0.096465 | 0.439397 |
| 7 | 曹操 | 0.218246 | 0.707425 | 0.881069 | 0.376662 | 0.801466 | 0.755265 | 0.291798 | 0.938050 | 0.485826 | 0.346437 | 0.066456 |
| 8 | 張角 | 0.498340 | 0.563020 | 0.019941 | 0.686956 | 0.438963 | 0.942209 | 0.535298 | 0.068399 | 0.169205 | 0.638146 | 0.246636 |
| 9 | 姜維 | 0.866382 | 0.336534 | 0.459816 | 0.121275 | 0.117527 | 0.086200 | 0.724672 | 0.041077 | 0.792852 | 0.620457 | 0.794938 |
| 10 | 司馬昭 | 0.387280 | 0.911111 | 0.361018 | 0.024711 | 0.352168 | 0.874307 | 0.560931 | 0.066454 | 0.303852 | 0.267894 | 0.524910 |
| 11 | 司馬師 | 0.069331 | 0.654498 | 0.133358 | 0.349297 | 0.092552 | 0.527592 | 0.559391 | 0.041385 | 0.411011 | 0.391010 | 0.340453 |
| 12 | 魏延 | 0.057241 | 0.201021 | 0.491702 | 0.580754 | 0.186123 | 0.807221 | 0.324736 | 0.729737 | 0.920088 | 0.080613 | 0.537976 |
| 13 | 徐庶 | 0.036221 | 0.205992 | 0.953927 | 0.383384 | 0.494315 | 0.416861 | 0.089345 | 0.640795 | 0.513479 | 0.412078 | 0.759755 |
| 14 | 陸遜 | 0.570870 | 0.410881 | 0.757321 | 0.260093 | 0.916110 | 0.689473 | 0.087644 | 0.199368 | 0.570718 | 0.741445 | 0.307660 |
| 15 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | 0.135349 | 0.436991 |
接下來(lái)我想將上面的表格進(jìn)行自表連接,便于計(jì)算,所以設(shè)置了一個(gè)用于連接的公共字段one
new_data["one"]?=?1
new_data.head(5)
| name | Java | Python | AS400 | ITID | Oracle | Bigdata | SQL | Leadership | Management | Creativity | Communication | one | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | 0.577906 | 0.273414 | 1 |
| 1 | 關(guān)羽 | 0.155675 | 0.440437 | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | 0.719190 | 0.035426 | 1 |
| 2 | 張飛 | 0.138446 | 0.237509 | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | 0.802278 | 0.424132 | 1 |
| 3 | 諸葛亮 | 0.045878 | 0.295690 | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | 0.006340 | 0.132899 | 1 |
| 4 | 趙云 | 0.926776 | 0.158741 | 0.212268 | 0.999313 | 0.060130 | 0.593934 | 0.296281 | 0.425722 | 0.665509 | 0.910555 | 0.824788 | 1 |
進(jìn)行自表連接,可以發(fā)現(xiàn)我們連接后的表有256行,也就是16*16,符合預(yù)期
需要注意的是,為了區(qū)分一左右兩個(gè)表的數(shù)據(jù)來(lái)源,會(huì)將來(lái)源于左表的字段加_x,來(lái)源于右表則加_y
new_data_merge?=?pd.merge(left=new_data,?right=new_data,?left_on="one",?right_on="one")
new_data_merge
| name_x | Java_x | Python_x | AS400_x | ITID_x | Oracle_x | Bigdata_x | SQL_x | Leadership_x | Management_x | ... | Python_y | AS400_y | ITID_y | Oracle_y | Bigdata_y | SQL_y | Leadership_y | Management_y | Creativity_y | Communication_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | ... | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | 0.577906 | 0.273414 |
| 1 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | ... | 0.440437 | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | 0.719190 | 0.035426 |
| 2 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | ... | 0.237509 | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | 0.802278 | 0.424132 |
| 3 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | ... | 0.295690 | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | 0.006340 | 0.132899 |
| 4 | 劉備 | 0.036712 | 0.354109 | 0.897044 | 0.510947 | 0.847097 | 0.819443 | 0.788641 | 0.285868 | 0.994770 | ... | 0.158741 | 0.212268 | 0.999313 | 0.060130 | 0.593934 | 0.296281 | 0.425722 | 0.665509 | 0.910555 | 0.824788 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 251 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.654498 | 0.133358 | 0.349297 | 0.092552 | 0.527592 | 0.559391 | 0.041385 | 0.411011 | 0.391010 | 0.340453 |
| 252 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.201021 | 0.491702 | 0.580754 | 0.186123 | 0.807221 | 0.324736 | 0.729737 | 0.920088 | 0.080613 | 0.537976 |
| 253 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.205992 | 0.953927 | 0.383384 | 0.494315 | 0.416861 | 0.089345 | 0.640795 | 0.513479 | 0.412078 | 0.759755 |
| 254 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.410881 | 0.757321 | 0.260093 | 0.916110 | 0.689473 | 0.087644 | 0.199368 | 0.570718 | 0.741445 | 0.307660 |
| 255 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | 0.135349 | 0.436991 |
256 rows × 25 columns
緊接著,我們看一下特征屬性,要剔除name字段以及用于自表連接的one字段
我們也可以通過(guò)修改columns里面的值,自定義關(guān)注哪些字段
columns?=?list(new_data.columns)
columns.remove("name")
columns.remove("one")
columns
['Java',
'Python',
'AS400',
'ITID',
'Oracle',
'Bigdata',
'SQL',
'Leadership',
'Management',
'Creativity',
'Communication']
計(jì)算相似度的函數(shù)
我們通過(guò)“歐式距離”來(lái)表征相似度,歐式距離越大,說(shuō)明兩者之間差距越大,相似度越小,反之亦然
def?similarity(row):
????#?傳進(jìn)來(lái)的row是表中的一行
????#?設(shè)定相似度的初始值是0
????sim_value?=?0.0
????#?取出一行當(dāng)中的每一個(gè)特征值
????#?相應(yīng)的特征值相減之后的結(jié)果再進(jìn)行平方,最后全部加起來(lái),也就是“歐氏距離”的概念
????for?column?in?columns:
????????sim_value?+=?(float(row[column+"_x"])?-?float(row[column+"_y"]))**2
????return?sim_value
通過(guò)設(shè)定axis=1來(lái)指定對(duì)表格中的每一行進(jìn)行計(jì)算相似度的操作
new_data_merge["sim"]?=?new_data_merge.apply(similarity,?axis=1)
來(lái)看看計(jì)算后的表格,發(fā)現(xiàn)多了一個(gè)字段sim
new_data_merge.sample(15)
| name_x | Java_x | Python_x | AS400_x | ITID_x | Oracle_x | Bigdata_x | SQL_x | Leadership_x | Management_x | ... | AS400_y | ITID_y | Oracle_y | Bigdata_y | SQL_y | Leadership_y | Management_y | Creativity_y | Communication_y | sim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218 | 徐庶 | 0.036221 | 0.205992 | 0.953927 | 0.383384 | 0.494315 | 0.416861 | 0.089345 | 0.640795 | 0.513479 | ... | 0.361018 | 0.024711 | 0.352168 | 0.874307 | 0.560931 | 0.066454 | 0.303852 | 0.267894 | 0.524910 | 2.002230 |
| 62 | 諸葛亮 | 0.045878 | 0.295690 | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | ... | 0.757321 | 0.260093 | 0.916110 | 0.689473 | 0.087644 | 0.199368 | 0.570718 | 0.741445 | 0.307660 | 2.315648 |
| 245 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | 0.236982 | 0.178578 | 1.749616 |
| 129 | 張角 | 0.498340 | 0.563020 | 0.019941 | 0.686956 | 0.438963 | 0.942209 | 0.535298 | 0.068399 | 0.169205 | ... | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | 0.719190 | 0.035426 | 1.935264 |
| 46 | 張飛 | 0.138446 | 0.237509 | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | ... | 0.757321 | 0.260093 | 0.916110 | 0.689473 | 0.087644 | 0.199368 | 0.570718 | 0.741445 | 0.307660 | 1.645900 |
| 149 | 姜維 | 0.866382 | 0.336534 | 0.459816 | 0.121275 | 0.117527 | 0.086200 | 0.724672 | 0.041077 | 0.792852 | ... | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | 0.236982 | 0.178578 | 3.124456 |
| 65 | 趙云 | 0.926776 | 0.158741 | 0.212268 | 0.999313 | 0.060130 | 0.593934 | 0.296281 | 0.425722 | 0.665509 | ... | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | 0.719190 | 0.035426 | 2.615905 |
| 20 | 關(guān)羽 | 0.155675 | 0.440437 | 0.554369 | 0.147062 | 0.121957 | 0.378483 | 0.868364 | 0.670942 | 0.700497 | ... | 0.212268 | 0.999313 | 0.060130 | 0.593934 | 0.296281 | 0.425722 | 0.665509 | 0.910555 | 0.824788 | 2.615905 |
| 135 | 張角 | 0.498340 | 0.563020 | 0.019941 | 0.686956 | 0.438963 | 0.942209 | 0.535298 | 0.068399 | 0.169205 | ... | 0.881069 | 0.376662 | 0.801466 | 0.755265 | 0.291798 | 0.938050 | 0.485826 | 0.346437 | 0.066456 | 2.136879 |
| 246 | 魯肅 | 0.412343 | 0.426180 | 0.444081 | 0.238445 | 0.200476 | 0.665652 | 0.848165 | 0.510707 | 0.965271 | ... | 0.953005 | 0.973668 | 0.483947 | 0.904404 | 0.803289 | 0.522370 | 0.949673 | 0.096465 | 0.439397 | 1.155614 |
| 115 | 曹操 | 0.218246 | 0.707425 | 0.881069 | 0.376662 | 0.801466 | 0.755265 | 0.291798 | 0.938050 | 0.485826 | ... | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | 0.006340 | 0.132899 | 1.806937 |
| 166 | 司馬昭 | 0.387280 | 0.911111 | 0.361018 | 0.024711 | 0.352168 | 0.874307 | 0.560931 | 0.066454 | 0.303852 | ... | 0.953005 | 0.973668 | 0.483947 | 0.904404 | 0.803289 | 0.522370 | 0.949673 | 0.096465 | 0.439397 | 2.016103 |
| 194 | 魏延 | 0.057241 | 0.201021 | 0.491702 | 0.580754 | 0.186123 | 0.807221 | 0.324736 | 0.729737 | 0.920088 | ... | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | 0.802278 | 0.424132 | 1.701496 |
| 86 | 司馬懿 | 0.158174 | 0.686519 | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | ... | 0.953005 | 0.973668 | 0.483947 | 0.904404 | 0.803289 | 0.522370 | 0.949673 | 0.096465 | 0.439397 | 1.376950 |
| 61 | 諸葛亮 | 0.045878 | 0.295690 | 0.875218 | 0.014569 | 0.041040 | 0.406339 | 0.696076 | 0.306823 | 0.178275 | ... | 0.953927 | 0.383384 | 0.494315 | 0.416861 | 0.089345 | 0.640795 | 0.513479 | 0.412078 | 0.759755 | 1.505521 |
15 rows × 26 columns
然后,在自表連接的時(shí)候肯定有“自己連自己的情況”
也就是name_x字段的值等于name_y
這些是無(wú)意義的數(shù)據(jù),剔除
new_data_merge?=?new_data_merge[new_data_merge["name_x"]?!=?new_data_merge["name_y"]].copy()
隨機(jī)選5條清理后的數(shù)據(jù)看看
new_data_merge.sample(5)
| name_x | Java_x | Python_x | AS400_x | ITID_x | Oracle_x | Bigdata_x | SQL_x | Leadership_x | Management_x | ... | AS400_y | ITID_y | Oracle_y | Bigdata_y | SQL_y | Leadership_y | Management_y | Creativity_y | Communication_y | sim | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 114 | 曹操 | 0.218246 | 0.707425 | 0.881069 | 0.376662 | 0.801466 | 0.755265 | 0.291798 | 0.938050 | 0.485826 | ... | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | 0.802278 | 0.424132 | 1.327640 |
| 37 | 張飛 | 0.138446 | 0.237509 | 0.688492 | 0.986374 | 0.515409 | 0.910156 | 0.170967 | 0.799865 | 0.020661 | ... | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | 0.236982 | 0.178578 | 1.572089 |
| 156 | 姜維 | 0.866382 | 0.336534 | 0.459816 | 0.121275 | 0.117527 | 0.086200 | 0.724672 | 0.041077 | 0.792852 | ... | 0.491702 | 0.580754 | 0.186123 | 0.807221 | 0.324736 | 0.729737 | 0.920088 | 0.080613 | 0.537976 | 2.417639 |
| 92 | 司馬懿 | 0.158174 | 0.686519 | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | ... | 0.491702 | 0.580754 | 0.186123 | 0.807221 | 0.324736 | 0.729737 | 0.920088 | 0.080613 | 0.537976 | 1.720346 |
| 90 | 司馬懿 | 0.158174 | 0.686519 | 0.989481 | 0.328145 | 0.843783 | 0.894061 | 0.314048 | 0.292965 | 0.305031 | ... | 0.361018 | 0.024711 | 0.352168 | 0.874307 | 0.560931 | 0.066454 | 0.303852 | 0.267894 | 0.524910 | 1.065205 |
5 rows × 26 columns
找每個(gè)員工與他最相似的10個(gè)員工
def?get_top_student(df_sub):
????#?傳入的是清洗后的數(shù)據(jù)結(jié)果groupby后的子表,對(duì)應(yīng)單個(gè)員工的數(shù)據(jù)
????#?例如劉備與其他人連接的所有行,曹操與其他人連接的所有行
????
????#?對(duì)這個(gè)子表按sim值,也就是相似度進(jìn)行升序排序,取出前10條數(shù)據(jù)
????df_sort?=?df_sub.sort_values(by="sim",?ascending=True).head(10)
????#?將前十個(gè)人的名字取出
????names?=?",".join(list(df_sort["name_y"]))
????#?將前十個(gè)人的值相似度值取出
????sims?=?",".join([str(x)?for?x?in?list(df_sort["sim"])])
????#?打包成Series返回給調(diào)用它的地方
????return?pd.Series({"names":?names,?"sims":?sims})
對(duì)清洗后的表先按name_x進(jìn)行g(shù)roupby分組,對(duì)每一個(gè)分組,調(diào)用上述函數(shù)
result?=?new_data_merge.groupby("name_x").apply(get_top_student)
結(jié)果如下
例如,與關(guān)羽最相似的前十位依次是魯肅,司馬師,諸葛亮,劉備,曹操,魏延,姜維,徐庶,司馬昭,陸遜
相似值依次是 0.7735408527877448,1.0708763430514607,1.113466...
result
| names | sims | |
|---|---|---|
| name_x | ||
| 關(guān)羽 | 魯肅,司馬師,諸葛亮,劉備,曹操,魏延,姜維,徐庶,司馬昭,陸遜 | 0.7735408527877448,1.0708763430514607,1.113466... |
| 劉備 | 孫權(quán),司馬懿,陸遜,魯肅,曹操,關(guān)羽,魏延,徐庶,司馬師,司馬昭 | 0.9632560988057848,0.9990413893821443,1.099269... |
| 司馬師 | 司馬昭,張角,魯肅,關(guān)羽,諸葛亮,魏延,姜維,司馬懿,陸遜,徐庶 | 0.5730340940360173,0.7408606894082557,0.994871... |
| 司馬懿 | 曹操,陸遜,劉備,司馬昭,徐庶,諸葛亮,孫權(quán),張飛,司馬師,張角 | 0.5130746529222359,0.75366469445295,0.99904138... |
| 司馬昭 | 司馬師,張角,司馬懿,魯肅,諸葛亮,陸遜,關(guān)羽,姜維,曹操,魏延 | 0.5730340940360173,0.9338616306533777,1.065204... |
| 姜維 | 魯肅,司馬師,趙云,關(guān)羽,司馬昭,陸遜,徐庶,諸葛亮,張角,魏延 | 1.1997951508166318,1.522733479990684,1.5613112... |
| 孫權(quán) | 劉備,魯肅,魏延,曹操,司馬懿,徐庶,司馬昭,關(guān)羽,司馬師,陸遜 | 0.9632560988057848,1.1556135829895424,1.206446... |
| 張角 | 司馬師,司馬昭,陸遜,張飛,趙云,司馬懿,魯肅,關(guān)羽,劉備,魏延 | 0.7408606894082557,0.9338616306533777,1.439048... |
| 張飛 | 徐庶,曹操,張角,司馬懿,陸遜,魏延,趙云,劉備,司馬師,孫權(quán) | 1.2290138375053403,1.3276402597791603,1.527254... |
| 徐庶 | 魏延,曹操,陸遜,司馬懿,張飛,劉備,諸葛亮,魯肅,關(guān)羽,司馬師 | 0.8881419714636151,1.1138655046817798,1.144920... |
| 曹操 | 司馬懿,陸遜,徐庶,劉備,張飛,孫權(quán),魏延,關(guān)羽,魯肅,司馬昭 | 0.5130746529222359,1.067403409629831,1.1138655... |
| 諸葛亮 | 關(guān)羽,司馬師,魯肅,司馬懿,司馬昭,徐庶,魏延,曹操,劉備,姜維 | 1.1134665626313842,1.143403490075634,1.2731639... |
| 趙云 | 姜維,張角,張飛,魏延,魯肅,陸遜,司馬師,徐庶,關(guān)羽,司馬昭 | 1.561311245891365,1.5859586070097609,1.9056652... |
| 陸遜 | 司馬懿,曹操,劉備,徐庶,司馬昭,張角,張飛,司馬師,魯肅,關(guān)羽 | 0.75366469445295,1.067403409629831,1.099269594... |
| 魏延 | 魯肅,徐庶,孫權(quán),劉備,曹操,司馬師,關(guān)羽,諸葛亮,張飛,司馬懿 | 0.6536739741495201,0.8881419714636151,1.206446... |
| 魯肅 | 魏延,關(guān)羽,司馬師,司馬昭,劉備,孫權(quán),姜維,諸葛亮,徐庶,曹操 | 0.6536739741495201,0.7735408527877448,0.994871... |
生成excel文件,方便觀看
result.to_excel("相似計(jì)算結(jié)果.xlsx",?index=True)
最后,推薦螞蟻老師的Pandas數(shù)據(jù)分析課程
課程名:《Python使用Pandas入門數(shù)據(jù)分析》
部分大綱:

掃碼購(gòu)買:
購(gòu)買課程后,加我vx:ant_learn_python,拉付費(fèi)VIP群
點(diǎn)擊《閱讀原文》,也可以到達(dá)課程頁(yè)面。
評(píng)論
圖片
表情
