<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          【Python私活案例】Pandas找到公司內(nèi)最相似的員工(100元)

          共 12248字,需瀏覽 25分鐘

           ·

          2021-11-12 21:58

          今日份在螞蟻老師的vip群里,有一位群友提出一個(gè)需求

          我馬上想到可以用所學(xué)的pandas相關(guān)知識(shí)解決

          讓我們來(lái)看看群友的具體需求

          我們根據(jù)群友的需求,構(gòu)造了出了相應(yīng)的表格數(shù)據(jù)

          在這里,我在name字段下先設(shè)定了一些值

          開始寫代碼!

          首先引入numpy和pandas,這倆是數(shù)據(jù)分析領(lǐng)域必不可少的模塊

          import?numpy?as?np
          #?這行代碼的意思是通過(guò)設(shè)置固定的隨機(jī)數(shù)種子,讓你我生成的隨機(jī)數(shù)是一樣的
          np.random.seed(666)
          import?pandas?as?pd
          #?讀取我們的excel文件
          data?=?pd.read_excel("data.xlsx")

          觀察數(shù)據(jù),我們有十六名員工,除了姓名,其它都是空值

          data

          nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunication
          0劉備NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          1關(guān)羽NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          2張飛NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          3諸葛亮NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          4趙云NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          5司馬懿NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          6孫權(quán)NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          7曹操NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          8張角NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          9姜維NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          10司馬昭NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          11司馬師NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          12魏延NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          13徐庶NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          14陸遜NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
          15魯肅NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

          將空值用0到1之間的隨機(jī)數(shù)代替

          def?initialize(item):
          ????#?如果發(fā)現(xiàn)當(dāng)前這個(gè)元素是空值,就用隨機(jī)數(shù)代替
          ????if?pd.isnull(item):
          ????????return?np.random.random()
          ????#?否則就原樣返回(對(duì)應(yīng)name那列)
          ????else:
          ????????return?item

          通過(guò)applymap函數(shù),將上面的自定義函數(shù)應(yīng)用到表格中的每一個(gè)數(shù)據(jù)上

          new_data?=?data.applymap(initialize)

          再來(lái)看看我們構(gòu)造好的數(shù)據(jù)new_data,發(fā)現(xiàn)已經(jīng)得到預(yù)期值了

          new_data

          nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunication
          0劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.273414
          1關(guān)羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.035426
          2張飛0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.424132
          3諸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.132899
          4趙云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.824788
          5司馬懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.178578
          6孫權(quán)0.2347940.8542720.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.439397
          7曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.4858260.3464370.066456
          8張角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.1692050.6381460.246636
          9姜維0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.7928520.6204570.794938
          10司馬昭0.3872800.9111110.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.524910
          11司馬師0.0693310.6544980.1333580.3492970.0925520.5275920.5593910.0413850.4110110.3910100.340453
          12魏延0.0572410.2010210.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.537976
          13徐庶0.0362210.2059920.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.759755
          14陸遜0.5708700.4108810.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.307660
          15魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.9652710.1353490.436991

          接下來(lái)我想將上面的表格進(jìn)行自表連接,便于計(jì)算,所以設(shè)置了一個(gè)用于連接的公共字段one

          new_data["one"]?=?1
          new_data.head(5)

          nameJavaPythonAS400ITIDOracleBigdataSQLLeadershipManagementCreativityCommunicationone
          0劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.2734141
          1關(guān)羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354261
          2張飛0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321
          3諸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.1328991
          4趙云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.8247881

          進(jìn)行自表連接,可以發(fā)現(xiàn)我們連接后的表有256行,也就是16*16,符合預(yù)期

          需要注意的是,為了區(qū)分一左右兩個(gè)表的數(shù)據(jù)來(lái)源,會(huì)將來(lái)源于左表的字段加_x,來(lái)源于右表則加_y

          new_data_merge?=?pd.merge(left=new_data,?right=new_data,?left_on="one",?right_on="one")
          new_data_merge

          name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...Python_yAS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_y
          0劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.3541090.8970440.5109470.8470970.8194430.7886410.2858680.9947700.5779060.273414
          1劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.4404370.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.035426
          2劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.2375090.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.424132
          3劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.2956900.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.132899
          4劉備0.0367120.3541090.8970440.5109470.8470970.8194430.7886410.2858680.994770...0.1587410.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.824788
          ..................................................................
          251魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.6544980.1333580.3492970.0925520.5275920.5593910.0413850.4110110.3910100.340453
          252魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.2010210.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.537976
          253魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.2059920.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.759755
          254魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.4108810.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.307660
          255魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.4261800.4440810.2384450.2004760.6656520.8481650.5107070.9652710.1353490.436991

          256 rows × 25 columns

          緊接著,我們看一下特征屬性,要剔除name字段以及用于自表連接的one字段

          我們也可以通過(guò)修改columns里面的值,自定義關(guān)注哪些字段

          columns?=?list(new_data.columns)
          columns.remove("name")
          columns.remove("one")
          columns
          ['Java',
          'Python',
          'AS400',
          'ITID',
          'Oracle',
          'Bigdata',
          'SQL',
          'Leadership',
          'Management',
          'Creativity',
          'Communication']

          計(jì)算相似度的函數(shù)

          我們通過(guò)“歐式距離”來(lái)表征相似度,歐式距離越大,說(shuō)明兩者之間差距越大,相似度越小,反之亦然

          def?similarity(row):
          ????#?傳進(jìn)來(lái)的row是表中的一行
          ????#?設(shè)定相似度的初始值是0
          ????sim_value?=?0.0
          ????#?取出一行當(dāng)中的每一個(gè)特征值
          ????#?相應(yīng)的特征值相減之后的結(jié)果再進(jìn)行平方,最后全部加起來(lái),也就是“歐氏距離”的概念
          ????for?column?in?columns:
          ????????sim_value?+=?(float(row[column+"_x"])?-?float(row[column+"_y"]))**2
          ????return?sim_value

          通過(guò)設(shè)定axis=1來(lái)指定對(duì)表格中的每一行進(jìn)行計(jì)算相似度的操作

          new_data_merge["sim"]?=?new_data_merge.apply(similarity,?axis=1)

          來(lái)看看計(jì)算后的表格,發(fā)現(xiàn)多了一個(gè)字段sim

          new_data_merge.sample(15)

          name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...AS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_ysim
          218徐庶0.0362210.2059920.9539270.3833840.4943150.4168610.0893450.6407950.513479...0.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.5249102.002230
          62諸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.178275...0.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.3076602.315648
          245魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785781.749616
          129張角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.169205...0.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354261.935264
          46張飛0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.020661...0.7573210.2600930.9161100.6894730.0876440.1993680.5707180.7414450.3076601.645900
          149姜維0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.792852...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785783.124456
          65趙云0.9267760.1587410.2122680.9993130.0601300.5939340.2962810.4257220.665509...0.5543690.1470620.1219570.3784830.8683640.6709420.7004970.7191900.0354262.615905
          20關(guān)羽0.1556750.4404370.5543690.1470620.1219570.3784830.8683640.6709420.700497...0.2122680.9993130.0601300.5939340.2962810.4257220.6655090.9105550.8247882.615905
          135張角0.4983400.5630200.0199410.6869560.4389630.9422090.5352980.0683990.169205...0.8810690.3766620.8014660.7552650.2917980.9380500.4858260.3464370.0664562.136879
          246魯肅0.4123430.4261800.4440810.2384450.2004760.6656520.8481650.5107070.965271...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393971.155614
          115曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.485826...0.8752180.0145690.0410400.4063390.6960760.3068230.1782750.0063400.1328991.806937
          166司馬昭0.3872800.9111110.3610180.0247110.3521680.8743070.5609310.0664540.303852...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393972.016103
          194魏延0.0572410.2010210.4917020.5807540.1861230.8072210.3247360.7297370.920088...0.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321.701496
          86司馬懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.9530050.9736680.4839470.9044040.8032890.5223700.9496730.0964650.4393971.376950
          61諸葛亮0.0458780.2956900.8752180.0145690.0410400.4063390.6960760.3068230.178275...0.9539270.3833840.4943150.4168610.0893450.6407950.5134790.4120780.7597551.505521

          15 rows × 26 columns

          然后,在自表連接的時(shí)候肯定有“自己連自己的情況”

          也就是name_x字段的值等于name_y

          這些是無(wú)意義的數(shù)據(jù),剔除

          new_data_merge?=?new_data_merge[new_data_merge["name_x"]?!=?new_data_merge["name_y"]].copy()

          隨機(jī)選5條清理后的數(shù)據(jù)看看

          new_data_merge.sample(5)

          name_xJava_xPython_xAS400_xITID_xOracle_xBigdata_xSQL_xLeadership_xManagement_x...AS400_yITID_yOracle_yBigdata_ySQL_yLeadership_yManagement_yCreativity_yCommunication_ysim
          114曹操0.2182460.7074250.8810690.3766620.8014660.7552650.2917980.9380500.485826...0.6884920.9863740.5154090.9101560.1709670.7998650.0206610.8022780.4241321.327640
          37張飛0.1384460.2375090.6884920.9863740.5154090.9101560.1709670.7998650.020661...0.9894810.3281450.8437830.8940610.3140480.2929650.3050310.2369820.1785781.572089
          156姜維0.8663820.3365340.4598160.1212750.1175270.0862000.7246720.0410770.792852...0.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.5379762.417639
          92司馬懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.4917020.5807540.1861230.8072210.3247360.7297370.9200880.0806130.5379761.720346
          90司馬懿0.1581740.6865190.9894810.3281450.8437830.8940610.3140480.2929650.305031...0.3610180.0247110.3521680.8743070.5609310.0664540.3038520.2678940.5249101.065205

          5 rows × 26 columns

          找每個(gè)員工與他最相似的10個(gè)員工

          def?get_top_student(df_sub):
          ????#?傳入的是清洗后的數(shù)據(jù)結(jié)果groupby后的子表,對(duì)應(yīng)單個(gè)員工的數(shù)據(jù)
          ????#?例如劉備與其他人連接的所有行,曹操與其他人連接的所有行
          ????
          ????#?對(duì)這個(gè)子表按sim值,也就是相似度進(jìn)行升序排序,取出前10條數(shù)據(jù)
          ????df_sort?=?df_sub.sort_values(by="sim",?ascending=True).head(10)
          ????#?將前十個(gè)人的名字取出
          ????names?=?",".join(list(df_sort["name_y"]))
          ????#?將前十個(gè)人的值相似度值取出
          ????sims?=?",".join([str(x)?for?x?in?list(df_sort["sim"])])
          ????#?打包成Series返回給調(diào)用它的地方
          ????return?pd.Series({"names":?names,?"sims":?sims})

          對(duì)清洗后的表先按name_x進(jìn)行g(shù)roupby分組,對(duì)每一個(gè)分組,調(diào)用上述函數(shù)

          result?=?new_data_merge.groupby("name_x").apply(get_top_student)

          結(jié)果如下

          例如,與關(guān)羽最相似的前十位依次是魯肅,司馬師,諸葛亮,劉備,曹操,魏延,姜維,徐庶,司馬昭,陸遜

          相似值依次是 0.7735408527877448,1.0708763430514607,1.113466...

          result

          namessims
          name_x

          關(guān)羽魯肅,司馬師,諸葛亮,劉備,曹操,魏延,姜維,徐庶,司馬昭,陸遜0.7735408527877448,1.0708763430514607,1.113466...
          劉備孫權(quán),司馬懿,陸遜,魯肅,曹操,關(guān)羽,魏延,徐庶,司馬師,司馬昭0.9632560988057848,0.9990413893821443,1.099269...
          司馬師司馬昭,張角,魯肅,關(guān)羽,諸葛亮,魏延,姜維,司馬懿,陸遜,徐庶0.5730340940360173,0.7408606894082557,0.994871...
          司馬懿曹操,陸遜,劉備,司馬昭,徐庶,諸葛亮,孫權(quán),張飛,司馬師,張角0.5130746529222359,0.75366469445295,0.99904138...
          司馬昭司馬師,張角,司馬懿,魯肅,諸葛亮,陸遜,關(guān)羽,姜維,曹操,魏延0.5730340940360173,0.9338616306533777,1.065204...
          姜維魯肅,司馬師,趙云,關(guān)羽,司馬昭,陸遜,徐庶,諸葛亮,張角,魏延1.1997951508166318,1.522733479990684,1.5613112...
          孫權(quán)劉備,魯肅,魏延,曹操,司馬懿,徐庶,司馬昭,關(guān)羽,司馬師,陸遜0.9632560988057848,1.1556135829895424,1.206446...
          張角司馬師,司馬昭,陸遜,張飛,趙云,司馬懿,魯肅,關(guān)羽,劉備,魏延0.7408606894082557,0.9338616306533777,1.439048...
          張飛徐庶,曹操,張角,司馬懿,陸遜,魏延,趙云,劉備,司馬師,孫權(quán)1.2290138375053403,1.3276402597791603,1.527254...
          徐庶魏延,曹操,陸遜,司馬懿,張飛,劉備,諸葛亮,魯肅,關(guān)羽,司馬師0.8881419714636151,1.1138655046817798,1.144920...
          曹操司馬懿,陸遜,徐庶,劉備,張飛,孫權(quán),魏延,關(guān)羽,魯肅,司馬昭0.5130746529222359,1.067403409629831,1.1138655...
          諸葛亮關(guān)羽,司馬師,魯肅,司馬懿,司馬昭,徐庶,魏延,曹操,劉備,姜維1.1134665626313842,1.143403490075634,1.2731639...
          趙云姜維,張角,張飛,魏延,魯肅,陸遜,司馬師,徐庶,關(guān)羽,司馬昭1.561311245891365,1.5859586070097609,1.9056652...
          陸遜司馬懿,曹操,劉備,徐庶,司馬昭,張角,張飛,司馬師,魯肅,關(guān)羽0.75366469445295,1.067403409629831,1.099269594...
          魏延魯肅,徐庶,孫權(quán),劉備,曹操,司馬師,關(guān)羽,諸葛亮,張飛,司馬懿0.6536739741495201,0.8881419714636151,1.206446...
          魯肅魏延,關(guān)羽,司馬師,司馬昭,劉備,孫權(quán),姜維,諸葛亮,徐庶,曹操0.6536739741495201,0.7735408527877448,0.994871...

          生成excel文件,方便觀看

          result.to_excel("相似計(jì)算結(jié)果.xlsx",?index=True)

          最后,推薦螞蟻老師的Pandas數(shù)據(jù)分析課程

          課程名:《Python使用Pandas入門數(shù)據(jù)分析》

          部分大綱:

          掃碼購(gòu)買:

          購(gòu)買課程后,加我vx:ant_learn_python,拉付費(fèi)VIP群

          點(diǎn)擊《閱讀原文》,也可以到達(dá)課程頁(yè)面。

          瀏覽 43
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  伊人三级 | 男人天堂影院 | 久久午夜成人视频 | 韩日色网 | 伊人久久成人免费视频 |