Python私活150元,用隨機(jī)森林填補(bǔ)缺失值
眾所周知,進(jìn)行機(jī)器學(xué)習(xí)建模的第一步就是數(shù)據(jù)預(yù)處理,在數(shù)據(jù)預(yù)處理的過(guò)程中,處理缺失值則是關(guān)鍵一步。在數(shù)據(jù)集規(guī)模較小的情況下,如果對(duì)缺失值進(jìn)行貿(mào)然的刪除,則會(huì)導(dǎo)致本就不多的數(shù)據(jù)更為稀少。所以我們需要對(duì)缺失值進(jìn)行一定的填補(bǔ)。
在填補(bǔ)的方法中,有直接用0填補(bǔ)的,有用均值的,有用中位數(shù)的,還有用眾數(shù)的
這些方法雖然簡(jiǎn)單,但是對(duì)數(shù)據(jù)集的還原程度不高,所以今天為大家介紹如何使用隨機(jī)森林的方法預(yù)測(cè)并且填補(bǔ)缺失值
我們先來(lái)看看這個(gè)數(shù)據(jù)集

它有許多缺失值,我們先對(duì)這個(gè)數(shù)據(jù)集進(jìn)行探索
import?numpy?as?np
import?pandas?as?pd
data?=?pd.read_csv("test-2.csv")
觀察數(shù)據(jù)
data
| year | selling_price | km_driven | fuel | seller_type | transmission | owner | mileage | engine | seats | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014 | 450000 | 145500 | Diesel | Individual | Manual | First Owner | 23.40 | 1248.0 | 5.0 |
| 1 | 2014 | 370000 | 120000 | Diesel | Individual | Manual | Second Owner | 21.14 | 1498.0 | 5.0 |
| 2 | 2006 | 158000 | 140000 | Petrol | Individual | Manual | Third Owner | 17.70 | 1497.0 | 5.0 |
| 3 | 2010 | 225000 | 127000 | Diesel | Individual | Manual | First Owner | 23.00 | 1396.0 | 5.0 |
| 4 | 2007 | 130000 | 120000 | Petrol | Individual | Manual | First Owner | 16.10 | 1298.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 2009 | 175000 | 55500 | Petrol | Dealer | Manual | First Owner | 18.20 | 998.0 | 5.0 |
| 95 | 2013 | 525000 | 61500 | Petrol | Dealer | Manual | First Owner | 18.50 | 1197.0 | 5.0 |
| 96 | 2016 | 600000 | 150000 | Diesel | Individual | Manual | First Owner | 26.59 | 1248.0 | 5.0 |
| 97 | 2016 | 565000 | 72000 | Petrol | Dealer | Automatic | First Owner | 19.10 | 1197.0 | 5.0 |
| 98 | 2008 | 120000 | 68000 | Petrol | Dealer | Manual | Third Owner | 19.70 | 796.0 | 5.0 |
99 rows × 10 columns
data.info()
'pandas.core.frame.DataFrame'>
RangeIndex:?99?entries,?0?to?98
Data?columns?(total?10?columns):
?#???Column?????????Non-Null?Count??Dtype??
---??------?????????--------------??-----??
?0???year???????????99?non-null?????int64??
?1???selling_price??99?non-null?????int64??
?2???km_driven??????99?non-null?????int64??
?3???fuel???????????99?non-null?????object?
?4???seller_type????99?non-null?????object?
?5???transmission???99?non-null?????object?
?6???owner??????????99?non-null?????object?
?7???mileage????????94?non-null?????float64
?8???engine?????????96?non-null?????float64
?9???seats??????????94?non-null?????float64
dtypes:?float64(3),?int64(3),?object(4)
memory?usage:?7.9+?KB
只有mileage、 engine 、seats有缺失值,其余都完整
重點(diǎn)關(guān)注文本數(shù)據(jù)
data["fuel"].value_counts()
Petrol????48
Diesel????48
LPG????????2
CNG????????1
Name:?fuel,?dtype:?int64
data["seller_type"].value_counts()
Individual????65
Dealer????????34
Name:?seller_type,?dtype:?int64
data["transmission"].value_counts()
Manual???????87
Automatic????12
Name:?transmission,?dtype:?int64
data["owner"].value_counts()
First?Owner?????69
Second?Owner????26
Third?Owner??????4
Name:?owner,?dtype:?int64
對(duì)文本數(shù)據(jù),一般采用onehot編碼或者label編碼
從語(yǔ)義上看owner這個(gè)屬性的值是有明顯的定序特征,不宜采用onehot編碼,而其余都是分類(lèi)屬性,并且屬性值的種類(lèi)不多
不會(huì)對(duì)隨機(jī)森林算法有過(guò)度的負(fù)面作用,所以可以采用onehot編碼
def?func(x:?str)?->?int:
????if?x?==?"First?Owner":
????????return?1
????elif?x?==?"Second?Owner":
????????return?2
????elif?x?==?"Third?Owner":
????????return?3
對(duì)owner進(jìn)行l(wèi)abel編碼
data["owner"].apply(func)
data["owner"].value_counts()
First?Owner?????69
Second?Owner????26
Third?Owner??????4
Name:?owner,?dtype:?int64
對(duì)其余文本屬性,統(tǒng)一使用get_dummies方法進(jìn)行onehot編碼
data?=?pd.get_dummies(data)
data
| year | selling_price | km_driven | mileage | engine | seats | fuel_CNG | fuel_Diesel | fuel_LPG | fuel_Petrol | seller_type_Dealer | seller_type_Individual | transmission_Automatic | transmission_Manual | owner_First Owner | owner_Second Owner | owner_Third Owner | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014 | 450000 | 145500 | 23.40 | 1248.0 | 5.0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 1 | 2014 | 370000 | 120000 | 21.14 | 1498.0 | 5.0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 2 | 2006 | 158000 | 140000 | 17.70 | 1497.0 | 5.0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 3 | 2010 | 225000 | 127000 | 23.00 | 1396.0 | 5.0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 4 | 2007 | 130000 | 120000 | 16.10 | 1298.0 | 5.0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94 | 2009 | 175000 | 55500 | 18.20 | 998.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 95 | 2013 | 525000 | 61500 | 18.50 | 1197.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 96 | 2016 | 600000 | 150000 | 26.59 | 1248.0 | 5.0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 97 | 2016 | 565000 | 72000 | 19.10 | 1197.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 98 | 2008 | 120000 | 68000 | 19.70 | 796.0 | 5.0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
99 rows × 17 columns
查看編碼后的數(shù)據(jù)
data.info()
'pandas.core.frame.DataFrame'>
RangeIndex:?99?entries,?0?to?98
Data?columns?(total?17?columns):
?#???Column??????????????????Non-Null?Count??Dtype??
---??------??????????????????--------------??-----??
?0???year????????????????????99?non-null?????int64??
?1???selling_price???????????99?non-null?????int64??
?2???km_driven???????????????99?non-null?????int64??
?3???mileage?????????????????94?non-null?????float64
?4???engine??????????????????96?non-null?????float64
?5???seats???????????????????94?non-null?????float64
?6???fuel_CNG????????????????99?non-null?????uint8??
?7???fuel_Diesel?????????????99?non-null?????uint8??
?8???fuel_LPG????????????????99?non-null?????uint8??
?9???fuel_Petrol?????????????99?non-null?????uint8??
?10??seller_type_Dealer??????99?non-null?????uint8??
?11??seller_type_Individual??99?non-null?????uint8??
?12??transmission_Automatic??99?non-null?????uint8??
?13??transmission_Manual?????99?non-null?????uint8??
?14??owner_First?Owner???????99?non-null?????uint8??
?15??owner_Second?Owner??????99?non-null?????uint8??
?16??owner_Third?Owner???????99?non-null?????uint8??
dtypes:?float64(3),?int64(3),?uint8(11)
memory?usage:?5.8?KB
對(duì)每一列屬性的缺失值個(gè)數(shù)進(jìn)行求和統(tǒng)計(jì)
data.isnull().sum(axis=0)
year??????????????????????0
selling_price?????????????0
km_driven?????????????????0
mileage???????????????????5
engine????????????????????3
seats?????????????????????5
fuel_CNG??????????????????0
fuel_Diesel???????????????0
fuel_LPG??????????????????0
fuel_Petrol???????????????0
seller_type_Dealer????????0
seller_type_Individual????0
transmission_Automatic????0
transmission_Manual???????0
owner_First?Owner?????????0
owner_Second?Owner????????0
owner_Third?Owner?????????0
dtype:?int64
隨機(jī)森林算法填充缺失值
先填充缺失值較少的的列,之后再填多的
原理:將要填補(bǔ)的列作為目標(biāo)列,其余列作為屬性列,用隨機(jī)森林預(yù)測(cè)目標(biāo)列的值進(jìn)行填充
用非空的行作為訓(xùn)練集,空的行作為測(cè)試集,訓(xùn)練集中的數(shù)據(jù)有空值,則先用0填充
#?引入隨機(jī)森林模型和填補(bǔ)缺失值的模型
from?sklearn.impute?import?SimpleImputer
from?sklearn.ensemble?import?RandomForestRegressor
首先去除特定的列得到屬性列,記為X;選取特定的列作為目標(biāo)列,記為Y
在得到的屬性列中,用0填補(bǔ)缺失值
在目標(biāo)列中選取非空的行的index作為選取訓(xùn)練集的依據(jù),空行的index作為測(cè)試集的依據(jù)
這樣就可以在X,Y中得到訓(xùn)練集和測(cè)試集了
有了訓(xùn)練集就把它們丟到隨機(jī)森林訓(xùn)練,然后用訓(xùn)練好的模型預(yù)測(cè)測(cè)試集中的數(shù)據(jù)得到預(yù)測(cè)值
最后將預(yù)測(cè)值填到相應(yīng)的位置中
for?name?in?["engine",?"mileage",?"seats"]:
????X?=?data.drop(columns=f"{name}")
????Y?=?data.loc[:,?f"{name}"]
????X_0?=?SimpleImputer(missing_values=np.nan,?strategy="constant").fit_transform(X)
????y_train?=?Y[Y.notnull()]
????y_test?=?Y[Y.isnull()]
????x_train?=?X_0[y_train.index,?:]
????x_test?=?X_0[y_test.index,?:]
????rfc?=?RandomForestRegressor(n_estimators=100)
????rfc?=?rfc.fit(x_train,?y_train)
????y_predict?=?rfc.predict(x_test)
????data.loc[Y.isnull(),?f"{name}"]?=?y_predict
查看填充后的數(shù)據(jù)
data.info()
'pandas.core.frame.DataFrame'>
RangeIndex:?99?entries,?0?to?98
Data?columns?(total?17?columns):
?#???Column??????????????????Non-Null?Count??Dtype??
---??------??????????????????--------------??-----??
?0???year????????????????????99?non-null?????int64??
?1???selling_price???????????99?non-null?????int64??
?2???km_driven???????????????99?non-null?????int64??
?3???mileage?????????????????99?non-null?????float64
?4???engine??????????????????99?non-null?????float64
?5???seats???????????????????99?non-null?????float64
?6???fuel_CNG????????????????99?non-null?????uint8??
?7???fuel_Diesel?????????????99?non-null?????uint8??
?8???fuel_LPG????????????????99?non-null?????uint8??
?9???fuel_Petrol?????????????99?non-null?????uint8??
?10??seller_type_Dealer??????99?non-null?????uint8??
?11??seller_type_Individual??99?non-null?????uint8??
?12??transmission_Automatic??99?non-null?????uint8??
?13??transmission_Manual?????99?non-null?????uint8??
?14??owner_First?Owner???????99?non-null?????uint8??
?15??owner_Second?Owner??????99?non-null?????uint8??
?16??owner_Third?Owner???????99?non-null?????uint8??
dtypes:?float64(3),?int64(3),?uint8(11)
memory?usage:?5.8?KB
可以發(fā)現(xiàn)原有的缺失值已經(jīng)被填好了
最后把結(jié)果導(dǎo)出為excel文件
data.to_excel("test-2(填補(bǔ)后).xlsx")
最后,推薦螞蟻老師的sklearn 100題機(jī)器學(xué)習(xí)課程:
點(diǎn)擊閱讀原文,也可以到達(dá)課程頁(yè)
評(píng)論
圖片
表情
