根据来自小型 DataFrame 的信息过滤大型 DataFrame

首页课程实战体系课手记专栏慕课教程

根据来自小型 DataFrame 的信息过滤大型 DataFrame

我有一个包含大约 10 亿行和大约 15 列的大型 DataFrame。

+--------+-------+-----------+----+

| France| Paris| 2018-07-01| ...|

| Spain| Madrid| 2017-06-01| ...|

我有一个较小的 DataFrame，其中包含要根据组合（国家、城市）过滤的日期 - 大约 50 行。

| country| city | filter_date |

+--------+-------+-------------+

| France| Paris| 2018-07-01 |

| Spain| Madrid| 2017-06-01 |

我想使用存储在小 DataFrame 中的 filter_date 按日期过滤大 DataFrame 对于给定组合 - 例如删除包含（法国，巴黎）并且在 2018-07-01 之前的任何行，等等......

我最初想到的解决方案只是进行左连接，然后进行过滤，例如：

df = df_large.join(df_small, on=['country', 'city'], how='left').filter(f.col('date') >= c.col('filter_date'))

但是，如果非常昂贵并且我的 DataFrame 太大，则此解决方案并不理想，因为左连接。在此操作后执行操作时，代码需要很长时间才能运行。

慕码人8056858

浏览 97回答 1

1回答

慕斯709654

尝试left semi加入 +broadcasting较小的 df。还使用and如下组合所有过滤器 - df_large.join(broadcast(df_small), df_large("country") === df_small("country") &&       df_large("city") === df_small("city") && df_large("date") >= df_small("filter_date"), "leftsemi")

0 0

随时随地看视频慕课网APP