猿问

Pandas:根据开始/结束分割点的字符串列表(重叠)将字符串列拆分为组件列

在我的 Pandas 字符串数据框中,在一列中我有一个大字符串,我想将其拆分为单独的字符串,每个字符串都有自己的行一个新的数据框。第二列是一个标签,相同的标签应该出现在每个字符串组件上。


起点和终点分割点应由一组字符串确定。每个组件字符串将从遇到该集合中的一个字符串开始。每个字符串的起点应该在它自己的行的列中,而不应该在拆分的字符串中。


这是一个例子


我有一组这些字符串


listStrings = { 

'\nIntroduction' , '\nCase' , 

'\nLiterature' , '\nBackground',  '\nRelated' , 

'\nMethods' , '\nMethod',

'\nTechniques', '\nMethodology',

'\nResults', '\nResult', '\nExperimental',

'\nExperiments', '\nExperiment',

'\nDiscussion' , '\nLimitations',

'\nConclusion' , '\nConclusions',

'\nConcluding' ,

'Introduction\n' , 'Case\n' , 

'Literature\n' , 'Background\n',  'Related\n' , 

'Methods\n' , 'Method\n',

'Techniques\n', 'Methodology\n',

'Results\n', 'Result\n', 'Experimental\n',

'Experiments\n', 'Experiment\n',

'Discussion\n' , 'Limitations\n',

'Conclusion\n' , 'Conclusions\n',

'Concluding\n' ,

'INTRODUCTION' , 'CASE' , 

'LITERATURE' , 'BACKGROUND',  'RELATED' , 

'METHODS' , 'METHOD',

'TECHNIQUES', 'METHODOLOGY',

'RESULTS', 'RESULT', 'EXPERIMENTAL',

'EXPERIMENTS', 'EXPERIMENT',

'DISCUSSION' , 'LIMITATIONS',

'CONCLUSION' , 'CONCLUSIONS',

'CONCLUDING' ,

'Introduction:' , 'Case:' , 

'Literature:' , 'Background:',  'Related:' , 

'Methods:' , 'Method:',

'Techniques:', 'Methodology:',

'Results:', 'Result:', 'Experimental:',

'Experiments:', 'Experiment:',

'Discussion:' , 'Limitations:',

'Conclusion:' , 'Conclusions:',

'Concluding:' ,

}

在 A 列中的字符串到达 中的字符串之一之前listStrings,不要保存任何内容。一旦它到达 中的一个字符串listStrings,将该listStrings字符串作为它自己的单独列放在新数据框的一行中。然后将那个listStrings字符串之后的所有内容放在一个新行中,直到该段到达另一个字符串listStrings。然后重复该过程:将该字符串放在一个新列中,并为新段创建一个新行,依此类推。



森栏
浏览 137回答 1
1回答

大话西游666

这是一种方法,我不确定大数据集的效率:# first we build a big regex patternpat = '|'.join(listStrings)# find all keywords in the seriesnew_df = testdf.A.str.findall(pat)# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]# 1                    [\nResults, \nConclusion]# 2                [BACKGROUND, METHODS, RESULT]# Name: A, dtype: object# find all the chunks by splitting the text with the found keywordschunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)              for i in range(len(testdf))]).stack()# stack the keywords:keys = new_df.str.join(' ').str.split(' ', expand=True).stack()# out return dataframe# note that we shift the chunks to match the keywordspd.DataFrame({'D': keys, 'E': chunks.groupby(level=0).shift(-1)})输出:                D                                                  E0 0    BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...  1       METHODS  \nData from 75 ALS patients and 75 healthy con...  2        RESULT  S\nFollowing predictor variable selection, a c...  3    DISCUSSION  \nThis study evaluates disease-associated imag...  4           NaN                                                NaN1 0     \nResults  : The findings show ICT innovation was effecti...  1  \nConclusion  : By evaluating the ICT innovation, empirical ...  2           NaN                                                NaN2 0    BACKGROUND   AND PURPOSE\nRotator cuff tears are associate...  1       METHODS  \nSupraspinatus muscle biopsies were obtained ...  2        RESULT  S\nDegenerative changes were present in both p...  3           NaN                                                NaN编辑:这是解决方案的一个版本,它给出了问题中指定的确切输出# first we build a big regex patternpat = '|'.join(listStrings)# find all keywords in the seriesnew_df = testdf.A.str.findall(pat)# 0    [BACKGROUND, METHODS, RESULT, DISCUSSION]# 1                    [\nResults, \nConclusion]# 2                [BACKGROUND, METHODS, RESULT]# Name: A, dtype: object# find all the chunks by splitting the text with the found keywordschunks = pd.concat([testdf.A.iloc[[i]].str.split('|'.join(new_df.iloc[i]), expand=True)              for i in range(len(testdf))]).stack()# stack the keywords:keys = np.concatenate(new_df.values) # Flatten the keywords arrayvalues = chunks.groupby(level=0).shift(-1).dropna().valueslabels = np.concatenate([len(val) * [testdf['B'][ind]] for ind, val in enumerate(new_df.values)]) # out return dataframe# note that we shift the chunks to match the keywordspd.DataFrame({'C': keys, 'D': values, 'E': labels})输出:C   D   E0   BACKGROUND  \nDiagnostic uncertainty in ALS has serious ma...   Entry11   METHODS \nData from 75 ALS patients and 75 healthy con...   Entry12   RESULTS \nFollowing predictor variable selection, a cl...   Entry13   DISCUSSION  \nThis study evaluates disease-associated imag...   Entry14   \nResult    s: The findings show ICT innovation was effect...   Entry25   \nConclusion    : By evaluating the ICT innovation, empirical ...   Entry26   BACKGROUND  AND PURPOSE\nRotator cuff tears are associate...    Entry37   METHODS \nSupraspinatus muscle biopsies were obtained ...   Entry38   RESULTS \nDegenerative changes were present in both pa...   Entry3
随时随地看视频慕课网APP

相关分类

Python
我要回答