如何自动将数据帧切片成批次以避免 python 中的 MemoryError

我有一个包含 270 万行的数据框,如下所示 -


df

Out[10]: 

         ClaimId  ServiceSubCodeKey  ClaimRowNumber  SscRowNumber

0        1902659                183               1             1

1        1902659               2088               1             2

2        1902663               3274               2             1

3        1902674                 12               3             1

4        1902674                 23               3             2

         ...                ...             ...           ...

2793010  2563847               3109          603037             4

2793011  2563883               3109          603038             1

2793012  2564007               3626          603039             1

2793013  2564007               3628          603039             2

2793014  2564363               3109          603040             1


[2793015 rows x 4 columns]

我正在尝试在下面的 python 中对此进行热编码,但最终出现内存错误:


import pandas as pd


columns = (

    pd.get_dummies(df["ServiceSubCodeKey"])

    .reindex(range(df.ServiceSubCodeKey.min(),

        df.ServiceSubCodeKey.max()+1), axis=1, fill_value=0)

    # now it has all digits

    .astype(str)

    )

# this will create codes

codes_values = [int(''.join(r)) for r in columns.itertuples(index=False)]

codes = pd.Series({'test': codes_values}).explode()

codes.index = df.index


# groupby and aggregate the values into lists

dfg = codes.groupby(df.ClaimId).agg(list).reset_index()


# sum the lists; doing this with a pandas function also does not work, so no .sum or .apply

summed_lists = list()

for r, v in dfg.iterrows():

    summed_lists.append(str(sum(v[0])))


# assign the list of strings to a column

dfg['sums'] = summed_lists


# perform the remainder of the functions on the sums column

dfg['final'] = dfg.sums.str.pad(width=columns.shape[1], fillchar='0').str.rstrip('0')


# merge df and dfg.final

dfm = pd.merge(df, dfg[['ClaimId', 'final']], on='ClaimId')

dfm

  File "pandas/_libs/lib.pyx", line 574, in pandas._libs.lib.astype_str


MemoryError

我如何在自动批处理中执行此操作,以免出现内存错误?


森林海
浏览 110回答 1
1回答

Qyouu

onehot = []for groupi, group in df.groupby(df.index//1e5):    # encode each group separately    onehot.expand(group_onehot)df = df.assign(onehot=onehot)会给你 28 个小组单独工作。但是,查看您的代码,该行:codes_values = [int(''.join(r)) for r in columns.itertuples(index=False)]integer正在创建一个可能长达 4k 位的字符串并尝试在 10e4000 范围内创建一个字符串,这将导致溢出(请参阅https://numpy.org/devdocs/user/basics.types.html)编辑另一种编码方法。从这个 df 开始:df = pd.DataFrame({    'ClaimId': [1902659, 1902659, 1902663, 1902674, 1902674, 2563847, 2563883,        2564007, 2564007, 2564363],    'ServiceSubCodeKey': [183, 2088, 3274, 12, 23, 3109, 3109, 3626, 3628, 3109]    })代码:scale = df.ServiceSubCodeKey.max() + 1onehot = []for claimid, ssc in df.groupby('ClaimId').ServiceSubCodeKey:    ssc_list = ssc.to_list()    onehot.append([claimid,        ''.join(['1' if i in ssc_list else '0' for i in range(1, scale)])])onehot = pd.DataFrame(onehot, columns=['ClaimId', 'onehot'])print(onehot)输出   ClaimId                                             onehot0  1902659  0000000000000000000000000000000000000000000000...1  1902663  0000000000000000000000000000000000000000000000...2  1902674  0000000000010000000000100000000000000000000000...3  2563847  0000000000000000000000000000000000000000000000...4  2563883  0000000000000000000000000000000000000000000000...5  2564007  0000000000000000000000000000000000000000000000...6  2564363  0000000000000000000000000000000000000000000000...这修复了您的方法中的溢出问题并避免调用pd.get_dummies()创建 600K x 4K 虚拟数据帧,具有迭代分组系列和在每个组上构建列表理解的障碍(既不利用 pandas 的内置 C 实现) .从这里您可以:推荐:继续保持每个 one-hot 编码的摘要ClaimId,或者您要求的是:根据df需要合并,复制相同的编码与ClaimId复制的次数一样多df和df = df.merge(onehot, on='ClaimId')输出   ClaimId  ServiceSubCodeKey                                             onehot0  1902659                183  0000000000000000000000000000000000000000000000...1  1902659               2088  0000000000000000000000000000000000000000000000...2  1902663               3274  0000000000000000000000000000000000000000000000...3  1902674                 12  0000000000010000000000100000000000000000000000...4  1902674                 23  0000000000010000000000100000000000000000000000...5  2563847               3109  0000000000000000000000000000000000000000000000...6  2563883               3109  0000000000000000000000000000000000000000000000...7  2564007               3626  0000000000000000000000000000000000000000000000...8  2564007               3628  0000000000000000000000000000000000000000000000...9  2564363               3109  0000000000000000000000000000000000000000000000...
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python