这个问题是上一个问题How to extract only uppercase substring from pandas series? 的后续问题 。
我决定提出新问题,而不是改变旧问题。
我的目标是从名为 item 的列中提取聚合方法agg
和特征名称。feat
这是问题:
import numpy as np
import pandas as pd
df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})
regexp = (r'(?P<agg>) ' # agg is the word in uppercase (all other substring is lowercased)
r'(?P<feat>), ' # 1. if there is no uppercase, whole string is feat
# 2. if there is uppercase the substring after example. is feat
# e.g. cat ==> cat
# cat.N_MOST_COMMON(example.ord)[2] ==> ord
)
df[['agg','feat']] = df.col.str.extract(regexp,expand=True)
# I am not sure how to build up regexp here.
print(df)
"""
Required output
item agg feat
0 num num
1 bool bool
2 cat cat
3 cat.COUNT(example) COUNT # note: here feat is empty
4 cat.N_MOST_COMMON(example.ord)[2] N_MOST_COMMON ord
5 cat.FIRST(example.ord) FIRST ord
6 cat.FIRST(example.num) FIRST num
""";
冉冉说
相关分类