如何使用 pandas groupby 计算完成每个唯一 id 的行选择标准?

DataFrame 组循环替代方案?

我有一个包含 1300 万行、1,214 个站点(唯一 ID)的数据集:


# copy the data to the clipboard, and read in with

df = pd.read_clipboard(sep=',', index_col=[0])


,tmc_code,measurement_tstamp,travel_time_minutes

0,133-04199,2019-01-01 18:15:00,2.01

1,133-04199,2019-01-01 18:20:00,2.01

2,133-04198,2019-01-01 18:25:00,9.23

3,133-04191,2019-01-01 20:35:00,2.88

4,133-04191,2019-01-01 20:40:00,2.62

5,133-04190,2019-01-01 20:40:00,1.3

6,133-04193,2019-01-01 20:20:00,4.96

7,133-04193,2019-01-01 20:25:00,4.96

8,133-04192,2019-01-01 20:30:00,5.05

9,133-04192,2019-01-01 20:35:00,5.14

10,133-04195,2019-01-01 19:45:00,9.52

11,133-04195,2019-01-01 19:50:00,10.69

12,133-04195,2019-01-01 19:55:00,9.37

13,133-04194,2019-01-01 20:10:00,5.96

14,133-04194,2019-01-01 20:15:00,5.96

15,133-04194,2019-01-01 20:20:00,5.96

16,133P04359,2019-01-01 22:25:00,0.66

17,133P04359,2019-01-01 22:30:00,0.78

18,133P04359,2019-01-01 23:25:00,0.8

19,133P04126,2019-01-01 23:10:00,0.01

20,133P04125,2019-01-01 23:10:00,0.71

有一些极端的最大值在物理上是不可能的,因此为了修剪它们,我尝试使用95 百分位数加上模式来创建阈值并过滤掉极端值。

站点会产生不同的 Travel_time 值(由于长度/交通模式),因此百分位数和众数必须按站点计算。

这可行,但速度非常慢。

df_clean_tmc = df.groupby(['tmc_code'], as_index=False)['travel_time_seconds'].apply(lambda x: x[x['travel_time_seconds'] 

< (x['travel_time_seconds'].quantile(.95) 

+ x['travel_time_seconds'].apply(lambda x: stats.mode(x)[0]))])

我也尝试过这个,但速度很慢,并且结果没有执行任何计算,它与原始数据帧的大小相同。


我怀疑第二个应用是错误的,但是 groupby 对象没有“模式”功能,并且 stats.mode 在各个 groupby 测试中正常工作。


我也尝试过这个:


df_clean_tmc = df.groupby(['tmc_code'], as_index=False)

np.where(df_clean_tmc['travel_time_seconds'] 

< (df_clean_tmc['travel_time_seconds'].quantile(.95)

+ df_clean_tmc['travel_time_seconds'].apply(lambda x: stats.mode(x)[0]),df['travel_time_seconds']))

但出现类型错误:


TypeError: '<' not supported between instances of 'DataFrameGroupBy' and 'tuple'


什么是更有效、更合适的方法来实现这一目标?


慕村225694
浏览 79回答 1
1回答

qq_笑_17

numba根据测试结果,不太可能实现几个数量级的改进(不使用像甚至 Cython 这样的底层工具)。这可以从执行聚合计算所需的时间看出。然而,仍然可以进行两个关键优化:减少显式数据传递的数量 - 主要是df[df['col'] = val]过滤。在我的实现中,您的 for 循环被替换为(1)使用一次聚合所有内容.groupby().agg(),(2)使用查找表(dict)检查阈值。我不确定是否存在更有效的方法,但它总是涉及一次数据传递,并且最多只能再节省几秒钟。访问df["col"].values而不是df["col"]尽可能。(注意,这不会复制数据,因为可以在tracemalloc模块打开的情况下轻松验证。)基准代码:使用您的示例生成了 15M 条记录。import pandas as pdimport numpy as npfrom datetime import datetime# check memory footprint# import tracemalloc# tracemalloc.start()# datadf = pd.read_csv("/mnt/ramdisk/in.csv", index_col="idx")del df['measurement_tstamp']df.reset_index(drop=True, inplace=True)df["travel_time_minutes"] = df["travel_time_minutes"].astype(np.float64)# repeatcols = df.columnsdf = pd.DataFrame(np.repeat(df.values, 500000, axis=0))df.columns = cols# Aggregation startst0 = datetime.now()print(f"Program begins....")# 1. aggregate everything at oncedf_agg = df.groupby("tmc_code").agg(&nbsp; &nbsp; mode=("travel_time_minutes", pd.Series.mode),&nbsp; &nbsp; q95=("travel_time_minutes", lambda x: np.quantile(x, .95)))t1 = datetime.now()print(f"&nbsp; Aggregation: {(t1 - t0).total_seconds():.2f}s")# 2. construct a lookup table for the thresholdsthreshold = {}for tmc_code, row in df_agg.iterrows():&nbsp; # slow but only 1.2k rows&nbsp; &nbsp; threshold[tmc_code] = np.max(row["mode"]) + row["q95"]t2 = datetime.now()&nbsp; # doesn't matterprint(f"&nbsp; Computing Threshold: {(t2 - t1).total_seconds():.2f}s")# 3. filteringdef f(tmc_code, travel_time_minutes):&nbsp; &nbsp; return travel_time_minutes <= threshold[tmc_code]df = df[list(map(f, df["tmc_code"].values, df["travel_time_minutes"].values))]t3 = datetime.now()print(f"&nbsp; Filter: {(t3 - t2).total_seconds():.2f}s...")print(f"Program ends in {(datetime.now() - t0).total_seconds():.2f}s")# memory footprint# current, peak = tracemalloc.get_traced_memory()# print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")# tracemalloc.stop()print()结果:(3 次运行)| No. | old&nbsp; &nbsp;| new&nbsp; &nbsp;| new(aggr) | new(filter) ||-----|-------|-------|-----------|-------------|| 1&nbsp; &nbsp;| 24.55 | 14.04 | 9.87&nbsp; &nbsp; &nbsp; | 4.16&nbsp; &nbsp; &nbsp; &nbsp; || 2&nbsp; &nbsp;| 23.84 | 13.58 | 9.66&nbsp; &nbsp; &nbsp; | 3.92&nbsp; &nbsp; &nbsp; &nbsp; || 3&nbsp; &nbsp;| 24.81 | 14.37 | 10.02&nbsp; &nbsp; &nbsp;| 4.34&nbsp; &nbsp; &nbsp; &nbsp; || avg | 24.40 | 14.00 |&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;|=> ~74% faster使用 python 3.7 和 pandas 1.1.2 进行测试
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python