猿问

如何使用 extract 从 pandas 数据框中提取大写字母以及一些子字符串?

这个问题是上一个问题How to extract only uppercase substring from pandas series? 的后续问题 

我决定提出新问题,而不是改变旧问题。

我的目标是从名为 item 的列中提取聚合方法agg和特征名称。feat

这是问题:

import numpy as np

import pandas as pd



df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})



regexp = (r'(?P<agg>) '     # agg is the word in uppercase (all other substring is lowercased)

         r'(?P<feat>), '   # 1. if there is no uppercase, whole string is feat

                           # 2. if there is uppercase the substring after example. is feat

                           # e.g. cat ==> cat

                           # cat.N_MOST_COMMON(example.ord)[2] ==> ord

                  

        )


df[['agg','feat']] = df.col.str.extract(regexp,expand=True)


# I am not sure how to build up regexp here.



print(df)


"""

Required output



                                item   agg               feat

0                                num                     num

1                               bool                     bool

2                                cat                     cat

3                 cat.COUNT(example)   COUNT                           # note: here feat is empty

4  cat.N_MOST_COMMON(example.ord)[2]   N_MOST_COMMON     ord

5             cat.FIRST(example.ord)   FIRST             ord

6             cat.FIRST(example.num)   FIRST             num

""";


慕雪6442864
浏览 122回答 1
1回答

冉冉说

对于feat,由于您已经在其他 StackOverflow 问题中得到了答案agg,我认为您可以使用以下内容根据两个不同的模式提取两个不同的系列,这些模式彼此分开|,然后fillna()一个系列与另一个系列分开。^([^A-Z]*$)仅当完整字符串为小写时才返回完整字符串[^a-z].*example\.([a-z]+)\).*$example.仅当之前的)字符串中有大写字母时才应返回之后和之前的字符串example.df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)df['feat'] = s[0].fillna(s[1]).fillna('')dfOut[1]:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item&nbsp; feat0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; num&nbsp; &nbsp;num1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bool&nbsp; bool2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cat&nbsp; &nbsp;cat3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.COUNT(example)&nbsp; &nbsp; &nbsp;&nbsp;4&nbsp; cat.N_MOST_COMMON(example.ord)[2]&nbsp; &nbsp;ord5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.FIRST(example.ord)&nbsp; &nbsp;ord6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.FIRST(example.num)&nbsp; &nbsp;num上面给出了您正在寻找样本数据的输出,并符合您的条件。然而:如果后面有大写怎么办example.?电流输出将返回''请参见下面的示例#2,其中一些数据根据上述点进行了更改:df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)df['feat'] = s[0].fillna(s[1]).fillna('')dfOut[2]:&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; feat0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; num&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;num1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.count(example.AAA)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.count(example.aaa)&nbsp; cat.count(example.aaa)3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.count(example)&nbsp; &nbsp; &nbsp; cat.count(example)4&nbsp; cat.N_MOST_COMMON(example.ord)[2]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ord5&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.FIRST(example.ord)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;ord6&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;cat.FIRST(example.num)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;num
随时随地看视频慕课网APP

相关分类

Python
我要回答