猿问

根据多个条件合并两个数据框

我正在寻找比较两个数据框(df-a 和 df-b)并搜索 1 个数据框(df-b)中给定 ID 和日期在另一个数据框(df-a)中 ID 匹配的日期范围内的位置). 然后我想剥离 df-a 中的所有列并将它们连接到它们匹配的 df-b 中。例如


如果我有一个数据框 df-a,格式如下 df-a:


    ID       Start_Date    End_Date     A   B   C   D   E 

0   cd2      2020-06-01    2020-06-24   'a' 'b' 'c' 10  20

1   cd2      2020-06-24    2020-07-21

2   cd56     2020-06-10    2020-07-03

3   cd915    2020-04-28    2020-07-21

4   cd103    2020-04-13    2020-04-24

和 df-b 在


    ID      Date

0   cd2     2020-05-12

1   cd2     2020-04-12

2   cd2     2020-06-10

3   cd15    2020-04-28

4   cd193   2020-04-13

我想要一个像这样的输出 df df-c=


    ID      Date        Start_Date  End_Date    A   B   C   D   E 

0   cd2     2020-05-12      -           -       -   -   -   -   -

1   cd2     2020-04-12      -           -       -   -   -   -   -

2   cd2     2020-06-10 2020-06-01 2020-06-11    'a' 'b' 'c' 10  20

3   cd15    2020-04-28      -           -       -   -   -   -   -

4   cd193   2020-04-13      -           -       -   -   -   -   -

在上一篇文章中,我得到了一个很好的答案,它允许比较数据帧并在满足此条件的任何地方丢弃,但我正在努力弄清楚如何从 df-a 中适当地提取信息。目前的尝试如下!


df_c=df_b.copy()


ar=[]

for i in range(df_c.shape[0]):

    currentID = df_c.stafnum[i]

    currentDate = df_c.Date[i]

    df_a_entriesForCurrentID = df_a.loc[df_a.stafnum == currentID]


    for j in range(df_a_entriesForCurrentID.shape[0]):

        startDate = df_a_entriesForCurrentID.iloc[j,:].Leave_Start_Date

        endDate = df_a_entriesForCurrentID.iloc[j,:].Leave_End_Date


        if (startDate <= currentDate <= endDate):

            print(df_c.loc[i])

            print(df_a_entriesForCurrentID.iloc[j,:])

            

            #df_d=pd.concat([df_c.loc[i], df_a_entriesForCurrentID.iloc[j,:]], axis=0)

            

            #df_fin_2=df_fin.append(df_d, ignore_index=True)

            #ar.append(df_d)


茅侃侃
浏览 97回答 1
1回答

慕尼黑5688855

所以你想做一种“软”匹配。这是一个尝试矢量化日期范围匹配的解决方案。# notice working with dates as strings, inequalities will only work if dates in format y-m-d# otherwise it is safer to parse all date columns like `df_a.Date = pd.to_datetime(df_a)`# create a groupby object once so we can efficiently filter df_b inside the loop# good idea if df_b is considerably large and has many different IDsgdf_b = df_b.groupby('ID')b_IDs = gdf_b.indices # returns a dictionary with grouped rows {ID: arr(integer-indices)}matched = [] # so we can collect matched rows from df_b# iterate over rows with `.itertuples()`, more efficient than iterating range(len(df_a))for i, ID, date in df_a.itertuples():&nbsp; &nbsp; if ID in b_IDs:&nbsp; &nbsp; &nbsp; &nbsp; gID = gdf_b.get_group(ID) # get the filtered df_b&nbsp; &nbsp; &nbsp; &nbsp; inrange = gID.Start_Date.le(date) & gID.End_Date.ge(date)&nbsp; &nbsp; &nbsp; &nbsp; if any(inrange):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; matched.append(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; gID.loc[inrange.idxmax()] # get the first row with date inrange&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .values[1:] # use the array without column indices and slice `ID` out&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; matched.append([np.nan] * (df_b.shape[1] - 1)) # no date inrange, fill with NaNs&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; matched.append([np.nan] * (df_b.shape[1] - 1)) # no ID match, fill with NaNsdf_c = df_a.join(pd.DataFrame(matched, columns=df_b.columns[1:]))print(df_c)输出&nbsp; &nbsp; &nbsp; ID&nbsp; &nbsp; &nbsp; &nbsp; Date&nbsp; Start_Date&nbsp; &nbsp; End_Date&nbsp; &nbsp; A&nbsp; &nbsp; B&nbsp; &nbsp; C&nbsp; &nbsp; &nbsp;D&nbsp; &nbsp; &nbsp;E0&nbsp; &nbsp; cd2&nbsp; 2020-05-12&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; NaN&nbsp; NaN&nbsp; NaN&nbsp; &nbsp;NaN&nbsp; &nbsp;NaN1&nbsp; &nbsp; cd2&nbsp; 2020-04-12&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; NaN&nbsp; NaN&nbsp; NaN&nbsp; &nbsp;NaN&nbsp; &nbsp;NaN2&nbsp; &nbsp; cd2&nbsp; 2020-06-10&nbsp; 2020-06-01&nbsp; 2020-06-24&nbsp; &nbsp; a&nbsp; &nbsp; b&nbsp; &nbsp; c&nbsp; 10.0&nbsp; 20.03&nbsp; &nbsp;cd15&nbsp; 2020-04-28&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; NaN&nbsp; NaN&nbsp; NaN&nbsp; &nbsp;NaN&nbsp; &nbsp;NaN4&nbsp; cd193&nbsp; 2020-04-13&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; NaN&nbsp; NaN&nbsp; NaN&nbsp; &nbsp;NaN&nbsp; &nbsp;NaN
随时随地看视频慕课网APP

相关分类

Python
我要回答