Python:搜索多列并识别包含列表中任何元素的行

我最近发布了一个关于如何使用 R 进行子集化的问题:Subset dataframe in R based on a list specified in a vector (using a 'starts with' expression or equivalent)

用户@Ric S 非常友好地提供了这个问题的 R 答案。R 解决方案是:


library(dplyr)

df %>% 

  filter_at(vars(-ID), any_vars(grepl(paste(medications, collapse = "|"), .)))

但是,我的印象是 Python 可能更省时。总而言之,我试图在超过 100 万行的数据集中识别任何参与者规定的他汀类药物和基于此的子集。我有一个列表,其中包含这些药物的所有代码(为了演示目的,我刚刚编造了一些代码),接下来我想搜索数据框并将其子集给任何拥有“以”开头的药物代码的人"列表中的任何元素。示例中的循环似乎是多余的,但我将为一些药物类别运行此代码。


示例数据集,称为meds_df:


     ID readcode_1 readcode_2 generic_name

1  1001       bxd1 1146785342  Simvastatin

2  1002       <NA>       <NA>         <NA>

3  1003       <NA>       <NA>  Pravastatin

4  1004       <NA>       <NA>         <NA>

5  1005       bxd4   45432344         <NA>

6  1006       <NA>       <NA>         <NA>

7  1007       <NA>       <NA>         <NA>

8  1008       <NA>       <NA>         <NA>

9  1009       <NA>       <NA>         <NA>

10 1010       bxde       <NA>         <NA>

11 1011       <NA>       <NA>         <NA>

预期输出为:


     ID readcode_1 readcode_2 generic_name

1  1001       bxd1 1146785342  Simvastatin

3  1003       <NA>       <NA>  Pravastatin

5  1005       bxd4   45432344         <NA>

10 1010       bxde       <NA>         <NA>

到目前为止,我的代码基于其他 stackoverflow 问题,我还需要在此处嵌入 any() 或等效项:


list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]

variable_list = ['statins']

for m in variable_list:

    print('extracting individuals prescribed %s' %m)

test = meds_df.loc[meds_df['readcode_1', 'readcode_2','generic_name'].str.startswith(list_to_extract, na=False)]

    print(test)

收到错误:


KeyError: ('readcode_1', 'readcode_2','generic_name')

任何帮助将不胜感激!


梦里花落0921
浏览 105回答 3
3回答

温温酱

首先,正确的语法是meds_df[['readcode_1', 'readcode_2','generic_name']](list索引切片中的列名)。这就是为什么你得到一个KeyError.要回答您的问题,这是一种实现方法:# Updated to use tuple per David's suggestionidx = pd.concat((med_df[col].astype(str).str.startswith(tuple(list_to_extract)) for col in ['readcode_1', 'readcode_2','generic_name']), axis=1).any(axis=1)med_df.loc[idx]结果:&nbsp; &nbsp; &nbsp; ID readcode_1&nbsp; &nbsp; readcode_2 generic_name1&nbsp; &nbsp;1001&nbsp; &nbsp; &nbsp; &nbsp;bxd1&nbsp; 1.146785e+09&nbsp; Simvastatin3&nbsp; &nbsp;1003&nbsp; &nbsp; &nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; Pravastatin5&nbsp; &nbsp;1005&nbsp; &nbsp; &nbsp; &nbsp;bxd4&nbsp; 4.543234e+07&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN10&nbsp; 1010&nbsp; &nbsp; &nbsp; &nbsp;bxde&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN

繁花如伊

您可以通过这种方式进行申请:list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]bool_df = df[['readcode_1', 'readcode_2','generic_name']].apply(lambda x: x.str.startswith(tuple(list_to_extract), na=False), axis=1)df.loc[bool_df[bool_df.any(axis=1)].index]输出:&nbsp; &nbsp; ID&nbsp; readcode_1&nbsp; readcode_2&nbsp; generic_name1&nbsp; &nbsp;1001&nbsp; &nbsp; bxd1&nbsp; &nbsp; 1.146785e+09&nbsp; &nbsp; Simvastatin3&nbsp; &nbsp;1003&nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Pravastatin5&nbsp; &nbsp;1005&nbsp; &nbsp; bxd4&nbsp; &nbsp; 4.543234e+07&nbsp; &nbsp; NaN10&nbsp; 1010&nbsp; &nbsp; bxde&nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN感谢 r.ook 发现了一个小错误

叮当猫咪

另一种解决方案,在重新创建数据帧之前,字符串处理发生在 vanilla python 中:list_to_extract = ["bxd", "Simvastatin", "1146785342", "45432344", "Pravastatin"]cols_to_search = ['readcode_1', 'readcode_2','generic_name']output = [(ID, *searchbox)&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for ID, searchbox in zip(df.ID,df.filter(cols_to_search).to_numpy())&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if any([str(box).startswith(tuple(list_to_extract)) for box in searchbox])]pd.DataFrame(output, columns = df.columns)&nbsp; &nbsp; &nbsp;ID readcode_1&nbsp; readcode_2&nbsp; generic_name0&nbsp; &nbsp;1001&nbsp; &nbsp; bxd1&nbsp; &nbsp; &nbsp;1.146785e+09&nbsp; &nbsp;Simvastatin1&nbsp; &nbsp;1003&nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Pravastatin2&nbsp; &nbsp;1005&nbsp; &nbsp; bxd4&nbsp; &nbsp; &nbsp;4.543234e+07&nbsp; &nbsp;NaN3&nbsp; &nbsp;1010&nbsp; &nbsp; bxde&nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python