Pandas知識點(diǎn)-詳解行列級批處理函數(shù)apply
先看一個例子:
# coding=utf-8
import pandas as pd
df = pd.DataFrame({'Col-1': [1, 3, 5], 'Col-2': [2, 4, 6], 'Col-3': [9, 8, 7], 'Col-4': [3, 6, 9]},
index=['A', 'B', 'C'])
print(df)
df_new = df.apply(lambda x: x-1)
print('-' * 30, '\n', df_new, sep='')
Col-1 Col-2 Col-3 Col-4
A 1 2 9 3
B 3 4 8 6
C 5 6 7 9
------------------------------
Col-1 Col-2 Col-3 Col-4
A 0 1 8 2
B 2 3 7 5
C 4 5 6 8apply用法和參數(shù)介紹
apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwds):
func: 應(yīng)用于每一列或每一行的函數(shù),這個函數(shù)可以是Python內(nèi)置函數(shù)、Pandas或其他庫中的函數(shù)、自定義函數(shù)、匿名函數(shù)。
axis: 設(shè)置批處理函數(shù)按列還是按行應(yīng)用,0或index表示按列應(yīng)用函數(shù),1或columns表示按行應(yīng)用函數(shù),默認(rèn)值為0。
raw: 設(shè)置將列/行作為Series對象傳遞給函數(shù),還是作為ndarray對象傳遞給函數(shù)。raw是bool類型,默認(rèn)為False。
result_type: 當(dāng)axis=1時,設(shè)置返回結(jié)果的類型和樣式,支持{'expand', 'reduce', 'broadcast', None}四種類型,默認(rèn)為None。
args: 傳給應(yīng)用函數(shù)func的位置參數(shù),args接收的數(shù)據(jù)類型為元組,如果只有一個位置參數(shù)要注意加逗號。
**kwds: 如果func中有關(guān)鍵字參數(shù),可以傳給**kwds。
傳入不同類型的函數(shù)
import numpy as np
df = pd.DataFrame({'Col-1': [1, 3, 5], 'Col-2': [2, 4, 6], 'Col-3': [9, 8, 7], 'Col-4': [3, 6, 9]},
index=['A', 'B', 'C'])
print(df)
df1 = df.apply(max) # python內(nèi)置函數(shù)
print('-' * 30, '\n', df1, sep='')
df2 = df.apply(np.mean) # numpy中的函數(shù)
print('-' * 30, '\n', df2, sep='')
df3 = df.apply(pd.DataFrame.min) # pandas中的方法
print('-' * 30, '\n', df3, sep='')
Col-1 Col-2 Col-3 Col-4
A 1 2 9 3
B 3 4 8 6
C 5 6 7 9
------------------------------
Col-1 5
Col-2 6
Col-3 9
Col-4 9
dtype: int64
------------------------------
Col-1 3.0
Col-2 4.0
Col-3 8.0
Col-4 6.0
dtype: float64
------------------------------
Col-1 1
Col-2 2
Col-3 7
Col-4 3
dtype: int64def make_ok(s):
return pd.Series(['{}ok'.format(d) for d in s])
df4 = df.apply(make_ok) # 自定義函數(shù)
print('-' * 30, '\n', df4, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1ok 2ok 9ok 3ok
1 3ok 4ok 8ok 6ok
2 5ok 6ok 7ok 9ok設(shè)置按行還是按列
def make_ok(s):
if isinstance(s, pd.Series):
if s.name in df.columns:
return pd.Series(['{}ok-列'.format(d) for d in s])
else:
return pd.Series(['{}ok-行'.format(d) for d in s])
else:
return '{}ok'.format(s)
df5 = df.apply(make_ok, axis=0) # 按列處理
print('-' * 30, '\n', df5, sep='')
df6 = df.apply(make_ok, axis=1) # 按行處理
print('-' * 30, '\n', df6, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1ok-列 2ok-列 9ok-列 3ok-列
1 3ok-列 4ok-列 8ok-列 6ok-列
2 5ok-列 6ok-列 7ok-列 9ok-列
------------------------------
0 1 2 3
A 1ok-行 2ok-行 9ok-行 3ok-行
B 3ok-行 4ok-行 8ok-行 6ok-行
C 5ok-行 6ok-行 7ok-行 9ok-行函數(shù)func的參數(shù)
def yes_or_no(s, answer):
if answer != 'yes' and answer != 'no':
answer = 'yes'
if isinstance(s, pd.Series):
return pd.Series(['{}-{}'.format(d, answer) for d in s])
else:
return '{}-{}'.format(s, answer)
df7 = df.apply(yes_or_no, args=('yes',))
df7.index = ['A', 'B', 'C']
print('-' * 30, '\n', df7, sep='')
df8 = df.apply(yes_or_no, args=('no',))
print('-' * 30, '\n', df8, sep='')
df9 = df.apply(yes_or_no, args=(0,))
print('-' * 30, '\n', df9, sep='')
------------------------------
Col-1 Col-2 Col-3 Col-4
A 1-yes 2-yes 9-yes 3-yes
B 3-yes 4-yes 8-yes 6-yes
C 5-yes 6-yes 7-yes 9-yes
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1-no 2-no 9-no 3-no
1 3-no 4-no 8-no 6-no
2 5-no 6-no 7-no 9-no
------------------------------
Col-1 Col-2 Col-3 Col-4
0 1-yes 2-yes 9-yes 3-yes
1 3-yes 4-yes 8-yes 6-yes
2 5-yes 6-yes 7-yes 9-yes傳入多個函數(shù)進(jìn)行聚合
df10 = df.apply([np.max, np.min])
print('-' * 40, '\n', df10, sep='')
df11 = df.apply({'Col-1': np.mean, 'Col-2': np.min})
print('-' * 40, '\n', df11, sep='')
df12 = df.apply({'Col-1': [np.mean, np.median], 'Col-2': [np.min, np.mean]})
print('-' * 40, '\n', df12, sep='')
----------------------------------------
Col-1 Col-2 Col-3 Col-4
amax 5 6 9 9
amin 1 2 7 3
----------------------------------------
Col-1 3.0
Col-2 2.0
dtype: float64
----------------------------------------
Col-1 Col-2
mean 3.0 4.0
median 3.0 NaN
amin NaN 2.0通過函數(shù)名字符串調(diào)用函數(shù)
df13 = df.apply('mean', axis=1)
print('-' * 30, '\n', df13, sep='')
df14 = df.apply(['mean', 'min'], axis=1)
print('-' * 30, '\n', df14, sep='')
------------------------------
A 3.75
B 5.25
C 6.75
dtype: float64
------------------------------
mean min
A 3.75 1.0
B 5.25 3.0
C 6.75 5.0修改DataFrame本身
df15 = df.copy()
# 讀取df的一列,將處理結(jié)果添加到原df中,增加一列
df15['Col-x'] = df15['Col-1'].apply(make_ok)
print('-' * 40, '\n', df15, sep='')
# 讀取df的一行,將處理結(jié)果添加到原df中,增加一行
df15.loc['Z'] = df15.loc['A'].apply(yes_or_no, args=('yes',))
print('-' * 40, '\n', df15, sep='')
----------------------------------------
Col-1 Col-2 Col-3 Col-4 Col-x
A 1 2 9 3 1ok
B 3 4 8 6 3ok
C 5 6 7 9 5ok
----------------------------------------
Col-1 Col-2 Col-3 Col-4 Col-x
A 1 2 9 3 1ok
B 3 4 8 6 3ok
C 5 6 7 9 5ok
Z 1-yes 2-yes 9-yes 3-yes 1ok-yesSeries使用apply
s0 = df['Col-2'].apply(make_ok)
print('-' * 20, '\n', s0, sep='')
s = pd.Series(range(5), index=[alpha for alpha in 'abcde'])
print('-' * 20, '\n', s, sep='')
s1 = s.apply(make_ok)
print('-' * 20, '\n', s1, sep='')
--------------------
A 2ok
B 4ok
C 6ok
Name: Col-2, dtype: object
--------------------
a 0
b 1
c 2
d 3
e 4
dtype: int64
--------------------
a 0ok
b 1ok
c 2ok
d 3ok
e 4ok
dtype: objects2 = s.apply(np.mean)
print('-' * 20, '\n', s2, sep='')
s3 = np.mean(s)
print('-' * 20, '\n', s3, sep='')
--------------------
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
--------------------
2.0參考文檔:
[1] pandas中文網(wǎng):https://www.pypandas.cn/docs/
