在对数据进行处理的时候,分组与聚合是非常常用的操作。在Pandas中此类操作主要是通过groupby函数来完成的。
先看一个实际的例子:
# 生成一个原始的DataFrameIn [70]: raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawk ...: s', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scou ...: ts', 'Scouts'], ...: 'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1 ...: st', '1st', '2nd', '2nd'], ...: 'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ry ...: aner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], ...: 'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3], ...: 'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]} ...: In [71]: df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTes ...: tScore', 'postTestScore']) In [72]: df Out[72]: regiment company name preTestScore postTestScore 0 Nighthawks 1st Miller 4 25 1 Nighthawks 1st Jacobson 24 94 2 Nighthawks 2nd Ali 31 57 3 Nighthawks 2nd Milner 2 62 4 Dragoons 1st Cooze 3 70 5 Dragoons 1st Jacon 4 25 6 Dragoons 2nd Ryaner 24 94 7 Dragoons 2nd Sone 31 57 8 Scouts 1st Sloan 2 62 9 Scouts 1st Piger 3 70 10 Scouts 2nd Riani 2 62 11 Scouts 2nd Ali 3 70
通过groupby函数生成一个groupby对象,如下:
# 当针对特定列(此例是'preTestScore')进行分组时,需要通过df['colume_name'](此例是df['regiment'])来指定键名In [73]: groupby_regiment = df['preTestScore'].groupby(df['regiment'])# 生成的groupby对象没有做任何计算,只是将数据按键进行分组In [74]: groupby_regiment Out[74]: <pandas.core.groupby.SeriesGroupBy object at 0x11112cef0># 分组的聚合统计In [75]: groupby_regiment.describe() Out[75]: count mean std min 25% 50% 75% max regiment Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0 Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0 Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0# 也可以针对特定统计单独计算In [76]: groupby_regiment.mean() Out[76]: regiment Dragoons 15.50 Nighthawks 15.25 Scouts 2.50 Name: preTestScore, dtype: float64
整个分组统计的过程,可以通过下图更清晰地展示:
1.1-group and aggregate process
聚合函数
聚合的时候,既可以使用Pandas内置的函数进行聚合计算,也可以使用自定义的函数进行聚合计算,我们先来看下内置的函数:
1.2-built-in aggregate functions
另外,我们也可以自定义聚合函数:
In [81]: def my_agg(pre_test_score_group): ...: return np.sum(np.power(pre_test_score_group, 2)) ...:In [82]: df['preTestScore'].groupby(df['regiment']).apply(my_agg)Out[82]:regimentDragoons 1562Nighthawks 1557Scouts 26Name: preTestScore, dtype: int64
通过上面的例子我们可以看到,通过apply函数也可以完成类似for循环的迭代,在pandas中尽可能使用apply函数来代替for循环迭代,以提高性能。
根据多个键进行分组和聚合
# 如果有多个键,将多个键放到一个list当中,作为groupby的参数In [77]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean()Out[77]:regiment companyDragoons 1st 3.5 2nd 27.5Nighthawks 1st 14.0 2nd 16.5Scouts 1st 2.5 2nd 2.5Name: preTestScore, dtype: float64# unstack之后变成表格模式,更加清晰In [78]: df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()Out[78]:company 1st 2ndregimentDragoons 3.5 27.5Nighthawks 14.0 16.5Scouts 2.5 2.5
作者:geekpy
链接:https://www.jianshu.com/p/a213384a0dbf