数据框中的文本操作:单词提取

我想检查数字旁边的单词。例如,我的数据框中有这一列:Recipes


Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.

Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.

2 heaped teaspoons Chinese five-spice 

100 ml Marsala

1 litre organic chicken stock

我想获得一个新的专栏,我在其中提取它们:


New Column

[1 hour, 20 minutes]

15 minutes

2 heaped

100 ml

1 litre

因为我需要与值列表进行比较:


to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]

查看每行有多少个元素是共同的。谢谢您的帮助。


跃然一笑
浏览 155回答 2
2回答

动漫人物

我们Series.str.extractall与模式一起使用numbers - space - letter。然后我们检查有哪些匹配项to_compare,最后我们使用GroupBy.sum得到有多少匹配项matches = df['Col'].str.extractall('(\d+\s\w+)')df['matches'] = matches[0].isin(to_compare).groupby(level=0).sum()                                                 Col  matches0  Halve the clementine and place into the cavity...      2.01  Add the stock, then bring to the boil and redu...      1.02              2 heaped teaspoons Chinese five-spice      0.03                                     100 ml Marsala      1.04                      1 litre organic chicken stock      0.0此外,matches返回:                  0  match            0 0          1 hour  1      20 minutes1 0      15 minutes2 0        2 heaped3 0          100 ml4 0         1 litre要将它们放入列表中,请使用:matches.groupby(level=0).agg(list)                      00  [1 hour, 20 minutes]1          [15 minutes]2            [2 heaped]3              [100 ml]4             [1 litre]

慕森卡

您可以使用正则表达式构建可以提取数字和后续单词的模式,然后将此功能应用于数据框的整个列import pandas as pdimport redf = pd.DataFrame({'text':["Halve the clementine and place into the cavity along with the bay leaves. Transfer the duck to a medium roasting tray and roast for around 1 hour 20 minutes.",           "Add the stock, then bring to the boil and reduce to a simmer for around 15 minutes.",           "2 heaped teaspoons Chinese five-spice",           "100 ml Marsala",           "1 litre organic chicken stock"]})def extract_qty(txt):  return re.findall('\d+ \w+',txt)df['extracted_qty'] = df['text'].apply(extract_qty)df    #   text                                                extracted_qty#0  Halve the clementine and place into the cavity...   [1 hour, 20 minutes]#1  Add the stock, then bring to the boil and redu...   [15 minutes]#2  2 heaped teaspoons Chinese five-spice               [2 heaped]#3  100 ml Marsala                                      [100 ml]#4  1 litre organic chicken stock                       [1 litre]to_compare使用列表理解提取常见值:to_compare= ["1 hour", "20 litres", "100 ml", "2", "15 minutes", "20 minutes"]df['common'] = df['extracted_qty'].apply(lambda x: [el for el in x if el in to_compare])#   text                        extracted_qty           common#0  Halve the clementine ...    [1 hour, 20 minutes]    [1 hour, 20 minutes]#1  Add the stock, then  ...    [15 minutes]            [15 minutes]#2  2 heaped teaspoons ...      [2 heaped]              []#3  100 ml Marsala              [100 ml]                [100 ml]#4  1 litre organic chicken...  [1 litre]               []
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python