Pandas知識點-詳解分組函數groupby
groupby用法和參數介紹
groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=no_default, observed=False, dropna=True):
by: 指定根據哪個/哪些字段分組,默認值是None,按多個字段分組時傳入列表。by參數可以按位置參數的方式傳入。
axis: 設置按列分組還是按行分組,0或index表示按列分組,1或columns表示按行分組,默認值為0。
level: 當DataFrame的索引為多重索引時,level參數指定用于分組的索引,可以傳入多重索引中索引的下標(0,1...)或索引名,多個用列表傳入。
as_index: 分組結果默認將分組列的值作為索引,如果按單列分組,結果默認是單索引,如果按多列分組,結果默認是多重索引。將as_index設置為False可以重置索引(0,1...)。
sort: 結果按分組列的值升序排列,將sort設置為False則不排序,可以提升性能。
dropna: 默認情況下,分組列的NaN在分組結果中不保留,將dropna設置為False,可以保留NaN分組。
分組對象的內部結構
# coding=utf-8
import pandas as pd
import numpy as np
vip_df = pd.DataFrame(
{'isVip': ['vip', 'svip', 'member', 'vip', 'member', 'vip', 'svip'],
'gender': ['male', 'female', 'female', 'female', 'male', 'female', 'male'],
'age': [25, 30, 40, 25, 40, 18, 30],
'vipLevel': ['LV2', 'LV5', np.nan, 'LV3', 'LV2', 'LV2', 'LV3'],
'growValue': [180, 425, np.nan, 288, 190, 110, 240]}
)
print(vip_df)
grouped = vip_df.groupby('isVip')
print(grouped)
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
1 svip female 30 LV5 425.0
2 member female 40 NaN NaN
3 vip female 25 LV3 288.0
4 member male 40 LV2 190.0
5 vip female 18 LV2 110.0
6 svip male 30 LV3 240.0
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C3E81D1370>for name, group in grouped:
print(name)
print(group)
member
isVip gender age vipLevel growValue
2 member female 40 NaN NaN
4 member male 40 LV2 190.0
svip
isVip gender age vipLevel growValue
1 svip female 30 LV5 425.0
6 svip male 30 LV3 240.0
vip
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0for group in grouped:
print(group)
print(type(group), type(group[0]), type(group[1]))
('member', isVip gender age vipLevel growValue
2 member female 40 NaN NaN
4 member male 40 LV2 190.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>
('svip', isVip gender age vipLevel growValue
1 svip female 30 LV5 425.0
6 svip male 30 LV3 240.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>
('vip', isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0)
<class 'tuple'> <class 'str'> <class 'pandas.core.frame.DataFrame'>print(grouped.groups)
group_name = [gn for gn in grouped.groups.keys()]
print(group_name)
group = grouped.get_group(group_name[2])
print('-'*40, '\n', group, sep='')
{'member': [2, 4], 'svip': [1, 6], 'vip': [0, 3, 5]}
['member', 'svip', 'vip']
----------------------------------------
isVip gender age vipLevel growValue
0 vip male 25 LV2 180.0
3 vip female 25 LV3 288.0
5 vip female 18 LV2 110.0按多重索引分組
vip_multi_df = vip_df.set_index(['isVip', 'gender'])
print('-'*40, '\n', vip_multi_df, sep='')
----------------------------------------
age vipLevel growValue
isVip gender
vip male 25 LV2 180.0
svip female 30 LV5 425.0
member female 40 NaN NaN
vip female 25 LV3 288.0
member male 40 LV2 190.0
vip female 18 LV2 110.0
svip male 30 LV3 240.0# 按多重索引中的指定索引進行分組
grouped = vip_multi_df.groupby(level='isVip')
print('-'*40, '\n', grouped.mean(), sep='')
# 按多重索引中除指定索引之外的索引分組
grouped = vip_multi_df.groupby(level=vip_multi_df.index.names.difference(['gender']))
print('-'*40, '\n', grouped.mean(), sep='')
# 按多重索引中的多個索引分組
grouped = vip_multi_df.groupby(level=[0, 1])
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip gender
member female 40.0 NaN
male 40.0 190.0
svip female 30.0 425.0
male 30.0 240.0
vip female 21.5 199.0
male 25.0 180.0grouped = vip_multi_df.groupby('gender')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_multi_df.groupby(['gender', 'age'])
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
gender
female 28.250000 274.333333
male 31.666667 203.333333
----------------------------------------
growValue
gender age
female 18 110.0
25 288.0
30 425.0
40 NaN
male 25 180.0
30 240.0
40 190.0重置結果的索引
grouped = vip_multi_df.groupby(['isVip', 'gender'])
print('-'*40, '\n', grouped.mean(), sep='')
# 重設索引
grouped = vip_multi_df.groupby(['isVip', 'gender'], as_index=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip gender
member female 40.0 NaN
male 40.0 190.0
svip female 30.0 425.0
male 30.0 240.0
vip female 21.5 199.0
male 25.0 180.0
----------------------------------------
age growValue
0 40.0 NaN
1 40.0 190.0
2 30.0 425.0
3 30.0 240.0
4 21.5 199.0
5 25.0 180.0結果是否排序
grouped = vip_df.groupby('isVip')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_df.groupby('isVip', sort=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
isVip
member 40.000000 190.000000
svip 30.000000 332.500000
vip 22.666667 192.666667
----------------------------------------
age growValue
isVip
vip 22.666667 192.666667
svip 30.000000 332.500000
member 40.000000 190.000000是否保留空值
grouped = vip_df.groupby('vipLevel')
print('-'*40, '\n', grouped.mean(), sep='')
grouped = vip_df.groupby('vipLevel', dropna=False)
print('-'*40, '\n', grouped.mean(), sep='')
----------------------------------------
age growValue
vipLevel
LV2 27.666667 160.0
LV3 27.500000 264.0
LV5 30.000000 425.0
----------------------------------------
age growValue
vipLevel
LV2 27.666667 160.0
LV3 27.500000 264.0
LV5 30.000000 425.0
NaN 40.000000 NaN提取分組結果的指定列
grouped = vip_df.groupby('isVip', dropna=False)
print('-'*40, '\n', grouped['gender'], sep='')
for name, group in grouped['gender']:
print(name)
print(group)
----------------------------------------
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000285CA1B0790>
member
2 female
4 male
Name: gender, dtype: object
svip
1 female
6 male
Name: gender, dtype: object
vip
0 male
3 female
5 female
Name: gender, dtype: objectgrouped = vip_df['gender'].groupby(vip_df['isVip'])
for name, group in grouped:
print(name)
print(group)
member
2 female
4 male
Name: gender, dtype: object
svip
1 female
6 male
Name: gender, dtype: object
vip
0 male
3 female
5 female
Name: gender, dtype: object參考文檔:
[1] pandas中文網:https://www.pypandas.cn/docs/
評論
圖片
表情
