手把手教你用pandas處理缺失值
作者:韋斯·麥金尼(Wes McKinney)譯者:徐敬一來源:大數(shù)據(jù)DT(ID:hzdashuju)
導(dǎo)讀:在進行數(shù)據(jù)分析和建模的過程中,大量的時間花在數(shù)據(jù)準(zhǔn)備上:加載、清理、轉(zhuǎn)換和重新排列。本文將討論用于缺失值處理的工具。
缺失數(shù)據(jù)會在很多數(shù)據(jù)分析應(yīng)用中出現(xiàn)。pandas的目標(biāo)之一就是盡可能無痛地處理缺失值。
string_data?=?pd.Series(['aardvark',?'artichoke',?np.nan,?'avocado'])
string_data
0??????aardvark
1?????artichoke
2????????????NaN
3???????avocado
dtype:?object
string_data.isnull()0?????False
1?????False
2??????True
3?????False
dtype:?boolstring_data[0]?=?None
string_data.isnull()
0??????True
1?????False
2??????True
3?????False
dtype:?booldropna:根據(jù)每個標(biāo)簽的值是否是缺失數(shù)據(jù)來篩選軸標(biāo)簽,并根據(jù)允許丟失的數(shù)據(jù)量來確定閾值 fillna:用某些值填充缺失的數(shù)據(jù)或使用插值方法(如“ffill”或“bfill”)。 isnull:返回表明哪些值是缺失值的布爾值 notnull:isnull的反作用函數(shù)
from?numpy?import?nan?as?NA
data?=?pd.Series([1,?NA,?3.5,?NA,?7])
data.dropna()
0?????1.0
2?????3.5
4?????7.0
dtype:?float64data[data.notnull()]0?????1.0
2?????3.5
4?????7.0
dtype:?float64data?=?pd.DataFrame([[1.,?6.5,?3.],?[1.,?NA,?NA]
?????????????????????[NA,?NA,?NA],?[NA,?6.5,?3.]])cleaned?=?data.dropna()data
???0?????1?????2
0??1.0??6.5??3.0
1??1.0??NaN??NaN
2??NaN??NaN??NaN
3??NaN??6.5??3.0
cleaned???0?????1?????2
0??1.0??6.5??3.0data.dropna(how='all')? 0?? ?1????2
0??1.0??6.5??3.0
1??1.0??NaN??NaN
3??NaN??6.5??3.0data[4]?=?NAdata 0????1????2???4
0??1.0??6.5??3.0?NaN
1??1.0??NaN??NaN?NaN
2??NaN??NaN??NaN?NaN
3??NaN??6.5??3.0?NaNdata.dropna(axis=1,?how='all') 0????1????2
0??1.0??6.5??3.0
1??1.0??NaN??NaN
2??NaN??NaN??NaN
3??NaN??6.5??3.0df?=?pd.DataFrame(np.random.randn(7,?3))
df.iloc[:4,?1]?=?NA
df.iloc[:2,?2]?=?NA
df??????????0?????????1?????????2
0 -0.204708???????NaN???????NaN
1 -0.555730???????NaN???????NaN
2??0.092908???????NaN??0.769023
3??1.246435???????NaN?-1.296221
4??0.274992??0.228913??1.352917
5??0.886429 -2.001637 -0.371843
6??1.669025 -0.438570 -0.539741
df.dropna() 0? 1? 2
4?0.274992? 0.228913? 1.352917
5?0.886429?-2.001637?-0.371843
6?1.669025?-0.438570?-0.539741df.dropna(thresh=2) 0? 1? 2
2?0.092908? NaN? 0.769023
3?1.246435? NaN?-1.296221
4?0.274992? 0.228913? 1.352917
5?0.886429?-2.001637?-0.371843
6?1.669025?-0.438570?-0.539741df.fillna(0)??????????0?????????1?????????2
0?-0.204708??0.000000??0.000000
1?-0.555730??0.000000??0.000000
2??0.092908??0.000000??0.769023
3??1.246435??0.000000?-1.296221
4??0.274992??0.228913? 1.352917
5??0.886429?-2.001637?-0.371843
6??1.669025?-0.438570?-0.539741df.fillna({1:?0.5,?2:?0})
??????????0?????????1?????????2
0?-0.204708??0.500000??0.000000
1?-0.555730??0.500000??0.000000
2??0.092908??0.500000??0.769023
3??1.246435??0.500000?-1.296221
4??0.274992??0.228913??1.352917
5??0.886429?-2.001637?-0.371843
6??1.669025?-0.438570?-0.539741_?=?df.fillna(0,?inplace=True)df??????????0?????????1?????????2
0?-0.204708??0.000000??0.000000
1?-0.555730??0.000000??0.000000
2??0.092908??0.000000??0.769023
3??1.246435??0.000000?-1.296221
4??0.274992??0.228913??1.352917
5??0.886429?-2.001637?-0.371843
6??1.669025?-0.438570?-0.539741df?=?pd.DataFrame(np.random.randn(6,?3))
df.iloc[2:,?1]?=?NA
df.iloc[4:,?2]?=?NA
df??????????0?????????1?????????2
0??0.476985??3.248944?-1.021228
1?-0.577087??0.124121??0.302614
2??0.523772???????NaN??1.343810
3?-0.713544???????NaN?-2.370232
4?-1.860761???????NaN???????NaN
5?-1.265934???????NaN???????NaNdf.fillna(method='ffill')??????????0?????????1?????????2
0??0.476985??3.248944?-1.021228
1?-0.577087??0.124121??0.302614
2??0.523772??0.124121??1.343810
3?-0.713544??0.124121?-2.370232
4?-1.860761??0.124121?-2.370232
5?-1.265934??0.124121?-2.370232df.fillna(method='ffill',?limit=2)??????????0?????????1?????????2
0??0.476985??3.248944?-1.021228
1?-0.577087??0.124121??0.302614
2??0.523772??0.124121??1.343810
3?-0.713544??0.124121?-2.370232
4?-1.860761???????NaN?-2.370232
5?-1.265934???????NaN?-2.370232data?=?pd.Series([1.,?NA,?3.5,?NA,?7])data.fillna(data.mean())
0?????1.000000
1?????3.833333
2?????3.500000
3?????3.833333
4?????7.000000
dtype:?float64value:標(biāo)量值或字典型對象用于填充缺失值 method:插值方法,如果沒有其他參數(shù),默認是'ffill' axis:需要填充的軸,默認axis=0 inplace:修改被調(diào)用的對象,而不是生成一個備份 limit:用于前向或后向填充時最大的填充范圍
關(guān)于作者:韋斯·麥金尼(Wes McKinney)是流行的Python開源數(shù)據(jù)分析庫pandas的創(chuàng)始人。他是一名活躍的演講者,也是Python數(shù)據(jù)社區(qū)和Apache軟件基金會的Python/C++開源開發(fā)者。目前他在紐約從事軟件架構(gòu)師工作。

也可以加一下老胡的微信 圍觀朋友圈~~~
推薦閱讀
(點擊標(biāo)題可跳轉(zhuǎn)閱讀)
干掉 LaTeX !用BookDown寫本書 101道Numpy、Pandas練習(xí)題 【資源干貨】香港中文大學(xué)《深度學(xué)習(xí)導(dǎo)論》2021課件 機器學(xué)習(xí)深度研究:特征選擇中幾個重要的統(tǒng)計學(xué)概念 老鐵,三連支持一下,好嗎?↓↓↓
評論
圖片
表情
