在有或没有枢轴的情况下展平长数据框的最佳方法

我有一个很长的数据框需要展平。数据框看起来像这样。我想展平这个表,用作referenceDate companyId索引,列应该有两层,第一层是data_item,第二层是N。我知道 pd.pivot 会解决这个问题。


+---------------+-----------+-----------+---+-------+

| referenceDate | CompanyId | data_item | N | value |

+---------------+-----------+-----------+---+-------+

| 2020-01-31    |         1 | A         | 1 | 0.1   |

| 2020-01-31    |         2 | A         | 2 | 0.2   |

| 2020-01-31    |         3 | A         | 3 | 0.3   |

+---------------+-----------+-----------+---+-------+

然而,


df = pd.pivot(df, values='value', index=['referenceDate', 'companyId'], columns=['data_item', 'N'])

总是给出 valueError


Traceback (most recent call last):

  File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code

    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-56-3738f20d42ed>", line 1, in <module>

    df = pd.pivot(df, values='value', index=['referenceDate', 'companyId'], columns=['data_item', 'N'])

  File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\pandas\core\reshape\pivot.py", line 429, in pivot

    indexed = data._constructor_sliced(data[values].values, index=index)

  File "C:\Users\\PycharmProjects\venvs\\lib\site-packages\pandas\core\series.py", line 302, in __init__

    "index implies {ind}".format(val=len(data), ind=len(index))

ValueError: Length of passed values is 239689, index implies 2

pd.pivot_table效果很好,但在这种情况下我不需要聚合,而且我还担心数据帧很大(数十亿行)时的性能。实际上我这里确实有一个内存错误,它说当我执行此操作时无法为 numpy 数组分配 1.xxGB:


df = pd.pivot_table(df, values='value', index=['referenceDate', 'companyId'],

                                            columns=['data_item', 'N'], aggfunc='first')

我想知道为什么pd.pivot这里和旁边失败pd.pivot,pd.pivot_table如果我的问题有最佳解决方案(需要最少的内存)?


三国纷争
浏览 36回答 1
1回答

富国沪深

如果可以暂时或永久升级到较新版本的 pandas,请尝试升级到较新版本的 pandas,因为pivot早期版本的 pandas 中存在错误。例如,您可以这样做:pip install pandas==1.1.3升级到他们修复的特定版本pivot。pip install pandas==1.1.3# then restart the kernelimport pandas as pd# df = ....df = pd.pivot(df, values='value', index=['referenceDate', 'CompanyId'], columns=['data_item', 'N'])dfOut[1]:&nbsp;data_item&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; A&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;N&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; 2&nbsp; &nbsp; 3referenceDate CompanyId&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2020-01-31&nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0.1&nbsp; NaN&nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN&nbsp; 0.2&nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN&nbsp; NaN&nbsp; 0.3然后,您可以随时返回pip install pandas==0.25.3。您可以通过 jupyter notebok 完成这一切。确保每次切换版本时都重新启动内核。我当前的 pandas 版本是1.0.1,所以我也收到同样的错误!pip install pandas==1.0.1#restart kernelimport pandas as pd#df = ...df = pd.pivot(df, values='value', index=['referenceDate', 'CompanyId'], columns=['data_item', 'N'])df错误:ValueError&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Traceback (most recent call last)<ipython-input-2-11248dbe0eba> in <module>&nbsp; &nbsp; &nbsp; 1 df = d.copy()----> 2 df = pd.pivot(df, values='value', index=['referenceDate', 'CompanyId'], columns=['data_item', 'N'])&nbsp; &nbsp; &nbsp; 3 dfC:\Users\david.erickson\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot(data, index, columns, values)&nbsp; &nbsp; 445&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;)&nbsp; &nbsp; 446&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;else:--> 447&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;indexed = data._constructor_sliced(data[values].values, index=index)&nbsp; &nbsp; 448&nbsp; &nbsp; &nbsp;return indexed.unstack(columns)&nbsp; &nbsp; 449&nbsp;C:\Users\david.erickson\Anaconda3\lib\site-packages\pandas\core\series.py in __init__(self, data, index, dtype, name, copy, fastpath)&nbsp; &nbsp; 290&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;if len(index) != len(data):&nbsp; &nbsp; 291&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;raise ValueError(--> 292&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f"Length of passed values is {len(data)}, "&nbsp; &nbsp; 293&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;f"index implies {len(index)}."&nbsp; &nbsp; 294&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;)ValueError: Length of passed values is 3, index implies 2.
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python