猿问

如何检查是否存在并在CSV蟒蛇中提取年份和百分比

我有一个CSV文件,新闻.csv,其中包含许多数据。我想检查该行是否包含任何年份,如果是,则为 1,否则为 0。这也适用于百分比,如果行包含百分比,则返回 1,否则为 0。并且还要提取它们。


以下是到目前为止我的代码。我遇到错误(值错误:通过的项目数量错误2,放置意味着1),当我尝试提取百分比


news=pd.read_csv("news.csv")

news['year']= news['STORY'].str.extract(r'(?!\()\b(\d+){1}')

news["howmanyyear"] = news["STORY"].str.count(r'(?!\()\b(\d+){1}')

news["existyear"] = news["howmany"] != 0

news["existyear"] = news["existyear"].astype(int)

news['percentage']= news['STORY'].str.extract(r'(\s100|\s\d{1})(\.\d+)+%')



news.to_csv('news.csv')

提取年份的代码似乎有效,但是,它也提取普通数字,并且只提取其中一个年份。


我的 CSV 文件示例


ID  STORY                                                            

1   There are a total of 2,070 people died in 2001 due to the virus                         

2   20% of people in the village have diabetes in 2007                        

3   About 70 percent of them still believe the rumor                            

4  In 2003 and 2020, the pneumonia pandemic spread in the world

以下是我想要的输出:


ID  STORY                                                            existyear  year    existpercentage  percentage

1   There are a total of 2,070 people died in 2001 due to the virus    1        2001      0              -

2   20% of people in the village have diabetes in 2007                 1        2007      1              20%

3   About 70 percent of them still believe the rumor                   0         -        1              70

4  In 2003 and 2020, the pneumonia pandemic spread in the world        1       2003,2020  0              -



明月笑刀无情
浏览 62回答 1
1回答

MYYA

创建示例数据帧:c = [1,2,3,4]d = ["There are a total of 2,070 people died in 2001 due to the virus" , "20% of people in the village have diabetes in 2007 ",    "About 70 percent of them still believe the rumor", "In 2003 and 2020, the pneumonia pandemic spread in the world"] f = ['2001', '2007', '-', '2003,2020']g = ['-', '20%', '70', '-']df = pd.DataFrame([c,d,f,g]).Tdf.rename(columns = {0:'ID ', 1:'STORY', 2:'year', 3:'percentage'}, inplace = True)断续器:ID  STORY                                                           year    percentage1   There are a total of 2,070 people died in 2001 due to the virus 2001    -2   20% of people in the village have diabetes in 2007              2007    20%3   About 70 percent of them still believe the rumor                -       704   In 2003 and 2020, the pneumonia pandemic spread in the world    2003,2020 -法典:def year_exits_or_not(row):    if re.match(r'.*([1-3][0-9]{3})', row):        return 1    else:        return 0def perc_or_not(row):    if re.match(r'.*\d+', row):        return 1    else:        return 0df['existyear'] = df.year.apply(year_exits_or_not)df['existpercentage'] = df.percentage.apply(perc_or_not)断续器:ID  STORY                                                            existyear  year    existpercentage  percentage1   There are a total of 2,070 people died in 2001 due to the virus    1        2001      0              -2   20% of people in the village have diabetes in 2007                 1        2007      1              20%3   About 70 percent of them still believe the rumor                   0         -        1              704   In 2003 and 2020, the pneumonia pandemic spread in the world       1       2003,2020  0              -编辑:df.year = df.STORY.apply(lambda row: str(re.findall(r'.*?([1-3][0-9]{3})', row))[1:-1])df.percentage = df.STORY.apply(lambda row: str(re.findall(r"(\d+)(?:%| percent)", row))[1:-1])断续器:    ID  STORY                                                year          percentage0   1   There are a total of 2,070 people died in 2001...   '2001'  1   2   20% of people in the village have diabetes in ...   '2007'         '20'2   3   About 70 percent of them still believe the rumor                   '70'3   4   In 2003 and 2020, the pneumonia pandemic sprea...   '2003', '2020'  
随时随地看视频慕课网APP

相关分类

Python
我要回答