合并两个数据框-python中的UPSERT

首页课程实战体系课手记专栏慕课教程

合并两个数据框-python中的UPSERT

在熊猫数据框中插入或更新

我想合并 storage_df 和 processes_df ，如下所示。假设 phone 是主键： 1. 如果值存在则字段（并创建剩余的列，如下例中的性别） 2. 如果值不存在，则将该值插入最终数据帧中，如示例中的 382837371

请注意，随着我们处理更多信息，该列会不断增加。但是有 32 列的限制，直到 processes_df/storage_df 会增加

storage_df

________________________

Phone Name

918348483 Sumit

874647474 Saurabh

238362633 NA

Processed_df

_______________________________

Phone Name Gender

874647474 Saurabh Male

238362633 NA Female

382837371 NA Male

final_df

_______________________________

Phone Name Gender

918348483 Sumit NA

874647474 Saurabh Male

238362633 NA Female

382837371 NA Male

为此，我使用了熊猫的 combine_first：

final_df = processed_df.set_index('phone').combine_first(storage_df.set_index('phone'))

但是随着数据帧大小的增加，系统内存不足（16Gb 内存并且无法组合形状（88488, 6）和形状（7307, 8）

可以使用 sqlite 在 sql 中存储两个数据帧，然后使用 UPSERT。你能指导我这样做的语法吗？虽然我真的很想在内存中而不是在数据库中。

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5364, in combine_first

return self.combine(other, combiner, overwrite=False)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5229, in combine

this, other = self.align(other, copy=False)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3792, in align

broadcast_axis=broadcast_axis)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8423, in align

fill_axis=fill_axis)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8459, in _align_frame

allow_dups=True)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 4490, in _reindex_with_indexers

copy=copy)

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1220, in reindex_indexer

self._consolidate_inplace()

File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace

月关宝盒

浏览 245回答 3

3回答

芜湖不芜

你可以试试 pandas 外连接。final_df = storage_df.merge(processed_df, on='Phone', how='outer', suffixes=('', '_y'))final_df.drop(list(final_df.filter(regex=r'.*_y$').columns), axis=1, inplace=True)加入数据框从合并中删除额外的列

0 0

PIPIONE

设置Phone为两个数据帧的索引，因为它们是您所说的主键，然后使用pandas.concat.在这样做的同时从其他数据框中删除公共列，否则它们将在结果数据框中重复。>>> df1.set_index('Phone', inplace=True)>>> df2.set_index('Phone', inplace=True)>>> other_cols = set(df2.columns) - set(df1.columns)>>> df = pd.concat([df1, df2[other_cols]], axis=1)>>> df              Name  GenderPhone                     238362633      NaN  Female382837371      NaN    Male874647474  Saurabh    Male918348483    Sumit     NaN

0 0

泛舟湖上清波郎朗

您需要做的就是首先删除重复的列并进行外部连接。# as mentioned you don't need this.processed_df.drop('Name', axis=1, inplace=True)# now do an outer joinstorage_df.merge(processed_df, on='Phone', how='outer')

0 0

随时随地看视频慕课网APP

相关分类

Python