基于距离最小化加入熊猫数据框

我有一个包含每日时间戳的 2D 位置的商店数据集。我试图将每一行与在其他一些位置的站点进行的天气测量以及每日时间戳相匹配，以便最小化每个商店和匹配站点之间的笛卡尔距离。没有每天进行天气测量，并且站点位置可能会有所不同，因此这是在每个特定日期为每个特定商店找到最近站点的问题。

我意识到我可以构建嵌套循环来执行匹配，但我想知道这里是否有人能想到一些使用 Pandas 数据框操作来完成此操作的巧妙方法。下面显示了一个玩具示例数据集。为简单起见，它具有静态气象站位置。

store_df = pd.DataFrame({

'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],

'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],

'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],

'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})

weather_station_df = pd.DataFrame({

'station_id': [1, 1, 1, 2, 2, 3, 3, 3],

'weather': [20, 21, 19, 17, 16, 18, 19, 17],

'x': [0, 0, 0, 5, 5, 3, 3, 3],

'y': [2, 2, 2, 1, 1, 3, 3, 3],

'date': [1, 2, 3, 1, 3, 1, 2, 3]})

下面的数据是期望的结果。我包括在内station_id只是为了澄清。

store_id date station_id weather

0 1 1 1 20

1 1 2 1 21

2 1 3 1 19

3 2 1 2 17

4 2 2 3 19

5 2 3 2 16

6 3 1 3 18

7 3 2 3 19

8 3 3 3 17

12345678_0001

浏览 226回答 2

2回答

忽然笑

解决方案的想法是建立所有组合的表，df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))计算距离df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2并选择每组的最小值：df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()如果你有很多约会，你可以按组加入。

拉风的咖菲猫

import mathimport numpy as npdef distance(x1, x2, y1, y2):    return np.sqrt((x2-x1)**2 + (y2-y1)**2)#Join On Date to get all combinations of store and stations per daydf_all = store_df.merge(weather_station_df, on=['date'])#Apply distance formula to each combinationdf_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])#Get Minimum distance for each day Per store_iddf_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()#Use resulting minimums to get the station_id matching the min distancesclosest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')#filter out the unnecessary columnsresult_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])编辑：使用矢量化距离公式

随时随地看视频慕课网APP