Pandas：根据开始/结束分割点的字符串列表（重叠）将字符串列拆分为组件列

这是一种方法，我不确定大数据集的效率：# first we build a big regex patternpat = '|'.join(listStrings)# find all keywords in the seriesnew_df = testdf.A.str.findall(pat)# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]# 1                    [\nResults, \nConclusion]# 2                [BACKGROUND, METHODS, RESULT]# Name: A, dtype: object# find all the chunks by splitting the text with the found keywordschunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)              for i in range(len(testdf))]).stack()# stack the keywords:keys = new_df.str.join(' ').str.split(' ', expand=True).stack()# out return dataframe# note that we shift the chunks to match the keywordspd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})输出：                D                                                  E0 0    BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...  1       METHODS  \nData from 75 ALS patients and 75 healthy con...  2        RESULT  S\nFollowing predictor variable selection, a c...  3    DISCUSSION  \nThis study evaluates disease-associated imag...  4           NaN                                                NaN1 0     \nResults  : The findings show ICT innovation was effecti...  1  \nConclusion  : By evaluating the ICT innovation, empirical ...  2           NaN                                                NaN2 0    BACKGROUND   AND PURPOSE\nRotator cuff tears are associate...  1       METHODS  \nSupraspinatus muscle biopsies were obtained ...  2        RESULT  S\nDegenerative changes were present in both p...  3           NaN                                                NaN编辑：这是解决方案的一个版本，它给出了问题中指定的确切输出# first we build a big regex patternpat = '|'.join(listStrings)# find all keywords in the seriesnew_df = testdf.A.str.findall(pat)# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]# 1                    [\nResults, \nConclusion]# 2                [BACKGROUND, METHODS, RESULT]# Name: A, dtype: object# find all the chunks by splitting the text with the found keywordschunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)              for i in range(len(testdf))]).stack()# stack the keywords:keys = np.concatenate(new_df.values) # Flatten the keywords arrayvalues = chunks.groupby(level=0).shift(-1).dropna().valueslabels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) # out return dataframe# note that we shift the chunks to match the keywordspd.DataFrame({'C': keys, 'D': values, 'E': labels})输出：C   D   E0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...   Entry11   METHODS \nData from 75 ALS patients and 75 healthy con...   Entry12   RESULTS \nFollowing predictor variable selection, a cl...   Entry13   DISCUSSION  \nThis study evaluates disease-associated imag...   Entry14   \nResult    s: The findings show ICT innovation was effect...   Entry25   \nConclusion    : By evaluating the ICT innovation, empirical ...   Entry26   BACKGROUND  AND PURPOSE\nRotator cuff tears are associate...    Entry37   METHODS \nSupraspinatus muscle biopsies were obtained ...   Entry38   RESULTS \nDegenerative changes were present in both pa...   Entry3

Pandas：根据开始/结束分割点的字符串列表（重叠）将字符串列拆分为组件列

1回答