我有一个很长的数据框需要展平。数据框看起来像这样。我想展平这个表,用作referenceDate companyId索引,列应该有两层,第一层是data_item,第二层是N。我知道 pd.pivot 会解决这个问题。
+---------------+-----------+-----------+---+-------+
| referenceDate | CompanyId | data_item | N | value |
+---------------+-----------+-----------+---+-------+
| 2020-01-31 | 1 | A | 1 | 0.1 |
| 2020-01-31 | 2 | A | 2 | 0.2 |
| 2020-01-31 | 3 | A | 3 | 0.3 |
+---------------+-----------+-----------+---+-------+
然而,
df = pd.pivot(df, values='value', index=['referenceDate', 'companyId'], columns=['data_item', 'N'])
总是给出 valueError
Traceback (most recent call last):
File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-56-3738f20d42ed>", line 1, in <module>
df = pd.pivot(df, values='value', index=['referenceDate', 'companyId'], columns=['data_item', 'N'])
File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\pandas\core\reshape\pivot.py", line 429, in pivot
indexed = data._constructor_sliced(data[values].values, index=index)
File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\pandas\core\series.py", line 302, in __init__
"index implies {ind}".format(val=len(data), ind=len(index))
ValueError: Length of passed values is 239689, index implies 2
pd.pivot_table效果很好,但在这种情况下我不需要聚合,而且我还担心数据帧很大(数十亿行)时的性能。实际上我这里确实有一个内存错误,它说当我执行此操作时无法为 numpy 数组分配 1.xxGB:
df = pd.pivot_table(df, values='value', index=['referenceDate', 'companyId'],
columns=['data_item', 'N'], aggfunc='first')
我想知道为什么pd.pivot这里和旁边失败pd.pivot,pd.pivot_table如果我的问题有最佳解决方案(需要最少的内存)?
富国沪深
相关分类