猿问

使用 pandas merge_asof() 识别范围关系

给定以下两个表示范围的数据框:


df1 =


  start   end

0   200   300

1   600   900

2   950  1050

df2 =


  start   end

0   350   550

1   650   800

2   900  1100

它们可以这样表示:


df1  [200 300]            [600    900] [950 1050]

df2            [350  550]   [650 800] [900   1100]

我的任务是确定范围df1df2范围之间的四种不同类型的关系:

  1. df2的子集df1

    • df2 [650 800]的子集df1 [600    900]

  2. df2的超集df1

    • df2 [900   1100]的超集df1 [950 1050]

  3. df2之后df1(最近邻,不包括子集/超集)

    • df2 [350  550]df1 [200 300]

    • df2 [900   1100]df1 [600    900]

  4. df2之前df1(最近的邻居,不包括子集/超集)

    • df2 [350  550]df1 [600    900]

    • df2 [650 800]df1 [950 1050]

我正在尝试使用merge_asof()这个答案中学到的东西,但由于超集/子集关系添加的复杂性,它不起作用,例如:

# Create "before" condition

df_before = pd.merge_asof(

    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),

    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),

    on='before_end',

    direction='forward'

).query('end > before_end')


print(df_before)

输出:


  before_start  before_end  start    end

0          350         550  600.0  900.0

1          650         800  600.0  900.0

目标输出:


  before_start  before_end  start     end

0          350         550  600.0   900.0

1          650         800  950.0  1050.0

问题是


pd.merge_asof(

    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),

    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),

    on='before_end',

    direction='forward'

)

df1.end在 800 之后找到最接近的df2 [650 800],即df1 [600    900]:


  before_start  before_end  start    end

0          350         550  600.0  900.0

1          650         800  600.0  900.0

2          900        1100    NaN    NaN

是否可以merge_asof()根据特定条件查找最接近的值,例如“df1.end仅当df1.start该范围大于 800(在本例中为 950)时才查找最近的值”?有了这种复杂程度,也许还有另一个更适合这项任务的功能?

笔记:

  • 中的范围df1可以相互重叠,但绝不相同。

  • 中的范围df2可以相互重叠,但绝不相同。

  • df1并且df2每行有超过 200k 行。

  • df1并且df2有不同的行数。



回首忆惘然
浏览 186回答 1
1回答

阿晨1998

可以pd.merge_asof用来查找之前和之后的选项。before_df = pd.merge_asof(df1, df2, left_on='start', right_on='end', suffixes=['', '_before'])before_df#&nbsp; &nbsp; start&nbsp; &nbsp;end&nbsp; start_before&nbsp; end_before# 0&nbsp; &nbsp; 200&nbsp; &nbsp;300&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN# 1&nbsp; &nbsp; 600&nbsp; &nbsp;900&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;350.0&nbsp; &nbsp; &nbsp; &nbsp;550.0# 2&nbsp; &nbsp; 950&nbsp; 1050&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;650.0&nbsp; &nbsp; &nbsp; &nbsp;800.0after_df = pd.merge_asof(df2, df1, left_on='start', right_on='end', suffixes=['_after', ''])#&nbsp; &nbsp; start_after&nbsp; end_after&nbsp; start&nbsp; end# 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 350&nbsp; &nbsp; &nbsp; &nbsp; 550&nbsp; &nbsp; 200&nbsp; 300# 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 650&nbsp; &nbsp; &nbsp; &nbsp; 800&nbsp; &nbsp; 200&nbsp; 300# 2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 900&nbsp; &nbsp; &nbsp; &nbsp;1100&nbsp; &nbsp; 600&nbsp; 900但是要使其工作或子集和超集计算并不容易。对于那些人,我会争取这种可以一次性工作的算法。def range_intersect(lh_ranges, rh_ranges):&nbsp;&nbsp; &nbsp; all_ranges = sorted(&nbsp; &nbsp; &nbsp; &nbsp; [(b, e, 'lh') for b, e in lh_ranges] +&nbsp; &nbsp; &nbsp; &nbsp; [(b, e, 'rh') for b, e in rh_ranges]&nbsp; &nbsp; )&nbsp;&nbsp; &nbsp; res = []&nbsp;&nbsp; &nbsp; max_b, max_e = None, None&nbsp;&nbsp; &nbsp; for b, e, which in all_ranges:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; if which == 'rh':&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if max_e is None or e > max_e:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; max_b, max_e = b, e&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; elif max_e is not None and e <= max_e:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; res.append((b, e, max_b, max_e))&nbsp;&nbsp; &nbsp; return res这会发现 的元素是lh中元素的子集rh。要查找超集,可以反向运行。为简单起见,它采用范围列表而不是DataFrames。转换很简单。lh = df1.to_dict('split')['data']rh = df2.to_dict('split')['data']lh# [[200, 300], [600, 900], [950, 1050]]rh&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# [[350, 550], [650, 800], [900, 1100]]在那之后,DataFrame你想要的结果只是几个合并。# Compute supersets, then run in reverse to get the subsets.superset_df = pd.DataFrame(range_intersect(lh, rh), columns=['start', 'end', 'start_superset', 'end_superset'])subset_df = pd.DataFrame(range_intersect(rh, lh), columns=['start_subset', 'end_subset', 'start', 'end'])# Merge all the results together.result = df1.merge(subset_df, how='left').merge(superset_df, how='left').merge(before_df, how='left').merge(after_df, how='left')# The reversed operations, after and subset, can have many matches in df1.result.drop_duplicates(['start', 'end'])#&nbsp; &nbsp; start&nbsp; &nbsp;end&nbsp; start_subset&nbsp; end_subset&nbsp; start_superset&nbsp; end_superset&nbsp; start_before&nbsp; end_before&nbsp; start_after&nbsp; end_after# 0&nbsp; &nbsp; 200&nbsp; &nbsp;300&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; 350.0&nbsp; &nbsp; &nbsp; 550.0# 2&nbsp; &nbsp; 600&nbsp; &nbsp;900&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;650.0&nbsp; &nbsp; &nbsp; &nbsp;800.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;350.0&nbsp; &nbsp; &nbsp; &nbsp;550.0&nbsp; &nbsp; &nbsp; &nbsp; 900.0&nbsp; &nbsp; &nbsp;1100.0# 3&nbsp; &nbsp; 950&nbsp; 1050&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;900.0&nbsp; &nbsp; &nbsp; &nbsp; 1100.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;650.0&nbsp; &nbsp; &nbsp; &nbsp;800.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; NaN
随时随地看视频慕课网APP

相关分类

Python
我要回答