使用 pandas 有效地计算大型数据帧的每个时间条柱的值

我有多个大型数据帧（大约3GB csv文件，每个大约1.5亿行），其中包含Unix样式的时间戳和随机生成的观察ID。每个观察可以/将在不同的时间多次发生。它们看起来像这样：

time_utc obs_id

0 1564617600 aabthssv

1 1564617601 vvvx7ths

2 1564618501 optnhfsa

3 1564619678 aabthssv

4 1564619998 abtzsnwe

...

我现在想为了分析观测的时间发展，得到一个数据帧，其中包含每个观测值ID的列和可以更改的时间箱的行，例如1小时，如下所示：

time_bin aabthssv vvvx7ths optnhfsa ...

1 1 1 1

2 1 0 0

...

我试图通过创建一个时间戳起点的numpy数组来做到这一点，然后将value_counts添加到一个新的空数据帧中，以选择该箱中的所有行。这会遇到内存错误。我已经尝试了更多的预清理，但即使将原始数据的大小减少三分之一（因此2GB，1亿行）仍然会发生内存错误。

SLICE_SIZE = 3600 # example value of 1h

slice_startpoints = np.arange(START_TIME, END_TIME+1-SLICE_SIZE, SLICE_SIZE)

agg_df = pd.DataFrame()

for timeslice in slice_startpoints:

temp_slice = raw_data[raw_data['time_utc'].between(timeslice, timeslice + SLICE_SIZE)]

temp_counts = temp_slice['obs_id'].value_counts()

agg_df = agg_df.append(temp_counts)

temp_index = raw_data[raw_data['time_utc'].between(timeslice, timeslice + SLICE_SIZE)].index

raw_data.drop(temp_index, inplace=True)

有没有办法更有效地做到这一点，或者更确切地说，让它根本有效？

编辑：我根据使用交叉表的建议找到了有效的方法来做到这一点。文件大小不需要减小。使用以下代码得出的结果正是我正在寻找的结果。

df['binned'] = pd.cut(df['time_utc'], bins=slice_startpoints, include_lowest=True, labels=slice_startpoints[1:])

df.groupby('binned')['obs_id'].value_counts().unstack().fillna(0)

ibeautiful

浏览 136回答 2

2回答

猛跑小猪

您可以尝试使用交叉表进行剪切：slice_startpoints = np.arange(START_TIME, END_TIME+SLICE_SIZE, SLICE_SIZE)print (slice_startpoints)df['binned'] = pd.cut(df['time_utc'],                       bins=slice_startpoints,                       include_lowest=True,                      labels=slice_startpoints[1:])df = pd.crosstab(df['binned'], df['obs_id'])

尚方宝剑之说

您可以使用“块”迭代器读取大型.csv，然后对块而不是整个.csv文件执行计算。块大小定义单个块中的行数。这样，您就有了一个很好的句柄来控制内存使用情况。缺点是，您将必须添加一些逻辑来合并块的结果。import pandas as pddf_chunk = pd.read_csv('file.csv', chunksize=1000)for chunk in df_chunk:    print(chunk)

随时随地看视频慕课网APP