我有一个包含 270 万行的数据框,如下所示 -
df
Out[10]:
ClaimId ServiceSubCodeKey ClaimRowNumber SscRowNumber
0 1902659 183 1 1
1 1902659 2088 1 2
2 1902663 3274 2 1
3 1902674 12 3 1
4 1902674 23 3 2
... ... ... ...
2793010 2563847 3109 603037 4
2793011 2563883 3109 603038 1
2793012 2564007 3626 603039 1
2793013 2564007 3628 603039 2
2793014 2564363 3109 603040 1
[2793015 rows x 4 columns]
我正在尝试在下面的 python 中对此进行热编码,但最终出现内存错误:
import pandas as pd
columns = (
pd.get_dummies(df["ServiceSubCodeKey"])
.reindex(range(df.ServiceSubCodeKey.min(),
df.ServiceSubCodeKey.max()+1), axis=1, fill_value=0)
# now it has all digits
.astype(str)
)
# this will create codes
codes_values = [int(''.join(r)) for r in columns.itertuples(index=False)]
codes = pd.Series({'test': codes_values}).explode()
codes.index = df.index
# groupby and aggregate the values into lists
dfg = codes.groupby(df.ClaimId).agg(list).reset_index()
# sum the lists; doing this with a pandas function also does not work, so no .sum or .apply
summed_lists = list()
for r, v in dfg.iterrows():
summed_lists.append(str(sum(v[0])))
# assign the list of strings to a column
dfg['sums'] = summed_lists
# perform the remainder of the functions on the sums column
dfg['final'] = dfg.sums.str.pad(width=columns.shape[1], fillchar='0').str.rstrip('0')
# merge df and dfg.final
dfm = pd.merge(df, dfg[['ClaimId', 'final']], on='ClaimId')
dfm
File "pandas/_libs/lib.pyx", line 574, in pandas._libs.lib.astype_str
MemoryError
我如何在自动批处理中执行此操作,以免出现内存错误?
Qyouu
相关分类