<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          Python私活150元,用隨機(jī)森林填補(bǔ)缺失值

          共 7227字,需瀏覽 15分鐘

           ·

          2022-01-11 19:14

          眾所周知,進(jìn)行機(jī)器學(xué)習(xí)建模的第一步就是數(shù)據(jù)預(yù)處理,在數(shù)據(jù)預(yù)處理的過(guò)程中,處理缺失值則是關(guān)鍵一步。在數(shù)據(jù)集規(guī)模較小的情況下,如果對(duì)缺失值進(jìn)行貿(mào)然的刪除,則會(huì)導(dǎo)致本就不多的數(shù)據(jù)更為稀少。所以我們需要對(duì)缺失值進(jìn)行一定的填補(bǔ)。

          在填補(bǔ)的方法中,有直接用0填補(bǔ)的,有用均值的,有用中位數(shù)的,還有用眾數(shù)的

          這些方法雖然簡(jiǎn)單,但是對(duì)數(shù)據(jù)集的還原程度不高,所以今天為大家介紹如何使用隨機(jī)森林的方法預(yù)測(cè)并且填補(bǔ)缺失值

          我們先來(lái)看看這個(gè)數(shù)據(jù)集

          img

          它有許多缺失值,我們先對(duì)這個(gè)數(shù)據(jù)集進(jìn)行探索

          import?numpy?as?np
          import?pandas?as?pd
          data?=?pd.read_csv("test-2.csv")

          觀察數(shù)據(jù)

          data

          yearselling_pricekm_drivenfuelseller_typetransmissionownermileageengineseats
          02014450000145500DieselIndividualManualFirst Owner23.401248.05.0
          12014370000120000DieselIndividualManualSecond Owner21.141498.05.0
          22006158000140000PetrolIndividualManualThird Owner17.701497.05.0
          32010225000127000DieselIndividualManualFirst Owner23.001396.05.0
          42007130000120000PetrolIndividualManualFirst Owner16.101298.05.0
          .................................
          94200917500055500PetrolDealerManualFirst Owner18.20998.05.0
          95201352500061500PetrolDealerManualFirst Owner18.501197.05.0
          962016600000150000DieselIndividualManualFirst Owner26.591248.05.0
          97201656500072000PetrolDealerAutomaticFirst Owner19.101197.05.0
          98200812000068000PetrolDealerManualThird Owner19.70796.05.0

          99 rows × 10 columns

          data.info()
          'pandas.core.frame.DataFrame'>
          RangeIndex:?99?entries,?0?to?98
          Data?columns?(total?10?columns):
          ?#???Column?????????Non-Null?Count??Dtype??
          ---??------?????????--------------??-----??
          ?0???year???????????99?non-null?????int64??
          ?1???selling_price??99?non-null?????int64??
          ?2???km_driven??????99?non-null?????int64??
          ?3???fuel???????????99?non-null?????object?
          ?4???seller_type????99?non-null?????object?
          ?5???transmission???99?non-null?????object?
          ?6???owner??????????99?non-null?????object?
          ?7???mileage????????94?non-null?????float64
          ?8???engine?????????96?non-null?????float64
          ?9???seats??????????94?non-null?????float64
          dtypes:?float64(3),?int64(3),?object(4)
          memory?usage:?7.9+?KB

          只有mileage、 engine 、seats有缺失值,其余都完整

          重點(diǎn)關(guān)注文本數(shù)據(jù)

          data["fuel"].value_counts()
          Petrol????48
          Diesel????48
          LPG????????2
          CNG????????1
          Name:?fuel,?dtype:?int64
          data["seller_type"].value_counts()
          Individual????65
          Dealer????????34
          Name:?seller_type,?dtype:?int64
          data["transmission"].value_counts()
          Manual???????87
          Automatic????12
          Name:?transmission,?dtype:?int64
          data["owner"].value_counts()
          First?Owner?????69
          Second?Owner????26
          Third?Owner??????4
          Name:?owner,?dtype:?int64

          對(duì)文本數(shù)據(jù),一般采用onehot編碼或者label編碼

          從語(yǔ)義上看owner這個(gè)屬性的值是有明顯的定序特征,不宜采用onehot編碼,而其余都是分類(lèi)屬性,并且屬性值的種類(lèi)不多

          不會(huì)對(duì)隨機(jī)森林算法有過(guò)度的負(fù)面作用,所以可以采用onehot編碼

          def?func(x:?str)?->?int:
          ????if?x?==?"First?Owner":
          ????????return?1
          ????elif?x?==?"Second?Owner":
          ????????return?2
          ????elif?x?==?"Third?Owner":
          ????????return?3

          對(duì)owner進(jìn)行l(wèi)abel編碼

          data["owner"].apply(func)
          data["owner"].value_counts()
          First?Owner?????69
          Second?Owner????26
          Third?Owner??????4
          Name:?owner,?dtype:?int64

          對(duì)其余文本屬性,統(tǒng)一使用get_dummies方法進(jìn)行onehot編碼

          data?=?pd.get_dummies(data)
          data

          yearselling_pricekm_drivenmileageengineseatsfuel_CNGfuel_Dieselfuel_LPGfuel_Petrolseller_type_Dealerseller_type_Individualtransmission_Automatictransmission_Manualowner_First Ownerowner_Second Ownerowner_Third Owner
          0201445000014550023.401248.05.001000101100
          1201437000012000021.141498.05.001000101010
          2200615800014000017.701497.05.000010101001
          3201022500012700023.001396.05.001000101100
          4200713000012000016.101298.05.000010101100
          ......................................................
          9420091750005550018.20998.05.000011001100
          9520135250006150018.501197.05.000011001100
          96201660000015000026.591248.05.001000101100
          9720165650007200019.101197.05.000011010100
          9820081200006800019.70796.05.000011001001

          99 rows × 17 columns

          查看編碼后的數(shù)據(jù)

          data.info()
          'pandas.core.frame.DataFrame'>
          RangeIndex:?99?entries,?0?to?98
          Data?columns?(total?17?columns):
          ?#???Column??????????????????Non-Null?Count??Dtype??
          ---??------??????????????????--------------??-----??
          ?0???year????????????????????99?non-null?????int64??
          ?1???selling_price???????????99?non-null?????int64??
          ?2???km_driven???????????????99?non-null?????int64??
          ?3???mileage?????????????????94?non-null?????float64
          ?4???engine??????????????????96?non-null?????float64
          ?5???seats???????????????????94?non-null?????float64
          ?6???fuel_CNG????????????????99?non-null?????uint8??
          ?7???fuel_Diesel?????????????99?non-null?????uint8??
          ?8???fuel_LPG????????????????99?non-null?????uint8??
          ?9???fuel_Petrol?????????????99?non-null?????uint8??
          ?10??seller_type_Dealer??????99?non-null?????uint8??
          ?11??seller_type_Individual??99?non-null?????uint8??
          ?12??transmission_Automatic??99?non-null?????uint8??
          ?13??transmission_Manual?????99?non-null?????uint8??
          ?14??owner_First?Owner???????99?non-null?????uint8??
          ?15??owner_Second?Owner??????99?non-null?????uint8??
          ?16??owner_Third?Owner???????99?non-null?????uint8??
          dtypes:?float64(3),?int64(3),?uint8(11)
          memory?usage:?5.8?KB

          對(duì)每一列屬性的缺失值個(gè)數(shù)進(jìn)行求和統(tǒng)計(jì)

          data.isnull().sum(axis=0)
          year??????????????????????0
          selling_price?????????????0
          km_driven?????????????????0
          mileage???????????????????5
          engine????????????????????3
          seats?????????????????????5
          fuel_CNG??????????????????0
          fuel_Diesel???????????????0
          fuel_LPG??????????????????0
          fuel_Petrol???????????????0
          seller_type_Dealer????????0
          seller_type_Individual????0
          transmission_Automatic????0
          transmission_Manual???????0
          owner_First?Owner?????????0
          owner_Second?Owner????????0
          owner_Third?Owner?????????0
          dtype:?int64

          隨機(jī)森林算法填充缺失值

          先填充缺失值較少的的列,之后再填多的

          原理:將要填補(bǔ)的列作為目標(biāo)列,其余列作為屬性列,用隨機(jī)森林預(yù)測(cè)目標(biāo)列的值進(jìn)行填充

          用非空的行作為訓(xùn)練集,空的行作為測(cè)試集,訓(xùn)練集中的數(shù)據(jù)有空值,則先用0填充

          #?引入隨機(jī)森林模型和填補(bǔ)缺失值的模型
          from?sklearn.impute?import?SimpleImputer
          from?sklearn.ensemble?import?RandomForestRegressor

          首先去除特定的列得到屬性列,記為X;選取特定的列作為目標(biāo)列,記為Y

          在得到的屬性列中,用0填補(bǔ)缺失值

          在目標(biāo)列中選取非空的行的index作為選取訓(xùn)練集的依據(jù),空行的index作為測(cè)試集的依據(jù)

          這樣就可以在X,Y中得到訓(xùn)練集和測(cè)試集了

          有了訓(xùn)練集就把它們丟到隨機(jī)森林訓(xùn)練,然后用訓(xùn)練好的模型預(yù)測(cè)測(cè)試集中的數(shù)據(jù)得到預(yù)測(cè)值

          最后將預(yù)測(cè)值填到相應(yīng)的位置中

          for?name?in?["engine",?"mileage",?"seats"]:
          ????X?=?data.drop(columns=f"{name}")
          ????Y?=?data.loc[:,?f"{name}"]
          ????X_0?=?SimpleImputer(missing_values=np.nan,?strategy="constant").fit_transform(X)
          ????y_train?=?Y[Y.notnull()]
          ????y_test?=?Y[Y.isnull()]
          ????x_train?=?X_0[y_train.index,?:]
          ????x_test?=?X_0[y_test.index,?:]

          ????rfc?=?RandomForestRegressor(n_estimators=100)
          ????rfc?=?rfc.fit(x_train,?y_train)
          ????y_predict?=?rfc.predict(x_test)

          ????data.loc[Y.isnull(),?f"{name}"]?=?y_predict

          查看填充后的數(shù)據(jù)

          data.info()
          'pandas.core.frame.DataFrame'>
          RangeIndex:?99?entries,?0?to?98
          Data?columns?(total?17?columns):
          ?#???Column??????????????????Non-Null?Count??Dtype??
          ---??------??????????????????--------------??-----??
          ?0???year????????????????????99?non-null?????int64??
          ?1???selling_price???????????99?non-null?????int64??
          ?2???km_driven???????????????99?non-null?????int64??
          ?3???mileage?????????????????99?non-null?????float64
          ?4???engine??????????????????99?non-null?????float64
          ?5???seats???????????????????99?non-null?????float64
          ?6???fuel_CNG????????????????99?non-null?????uint8??
          ?7???fuel_Diesel?????????????99?non-null?????uint8??
          ?8???fuel_LPG????????????????99?non-null?????uint8??
          ?9???fuel_Petrol?????????????99?non-null?????uint8??
          ?10??seller_type_Dealer??????99?non-null?????uint8??
          ?11??seller_type_Individual??99?non-null?????uint8??
          ?12??transmission_Automatic??99?non-null?????uint8??
          ?13??transmission_Manual?????99?non-null?????uint8??
          ?14??owner_First?Owner???????99?non-null?????uint8??
          ?15??owner_Second?Owner??????99?non-null?????uint8??
          ?16??owner_Third?Owner???????99?non-null?????uint8??
          dtypes:?float64(3),?int64(3),?uint8(11)
          memory?usage:?5.8?KB

          可以發(fā)現(xiàn)原有的缺失值已經(jīng)被填好了

          最后把結(jié)果導(dǎo)出為excel文件

          data.to_excel("test-2(填補(bǔ)后).xlsx")


          最后,推薦螞蟻老師的sklearn 100題機(jī)器學(xué)習(xí)課程:

          點(diǎn)擊閱讀原文,也可以到達(dá)課程頁(yè)

          瀏覽 88
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  国内精品视频在线观看 | 日韩AV电影网 | 精品中文字幕在线播放 | 欧洲激情综合网 | 欧美精品中文字幕在线观看 |