计算 pandas 列中的唯一单词

首页课程实战体系课手记专栏慕课教程

计算 pandas 列中的唯一单词

我对以下数据（来自 pandas 数据框）遇到一些困难：

Text

0 Selected moments from Fifa game t...

1 What I learned is that I am ...

3 Bill Gates kept telling us it was comi...

5 scenario created a month before the...

... ...

1899 Events for May 19 – October 7 - October CTOvision.com

1900 Office of Event Services and Campus Center Ope...

1901 How the CARES Act May Affect Gift Planning in ...

1902 City of Rohnert Park: Home

1903 iHeartMedia, Inc.

我需要提取每行的唯一单词数（删除标点符号后）。因此，例如：

Unique

0 6

1 6

3 8

5 6

... ...

1899 8

1900 8

1901 9

1902 5

1903 2

我尝试按如下方式进行：

df["Unique"]=df['Text'].str.lower()

df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))

但我没有得到任何计数，只有一个单词列表（没有它们在该行中的频率）。

你能告诉我出了什么问题吗？

慕少森

浏览 242回答 3

3回答

饮歌长啸

如果不需要计数，请先删除所有标点符号。杠杆套。str.split.map(set)会给你一套。计算后面集合中的元素。集合不采用多个唯一元素。链式df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()逐步df[Text]=df['Text'].str.replace(r'[^\w\s]+', '') df['New Text']=df.Text.str.split().map(set).str.len()

0 0

GCT1015

所以，我只是根据评论更新这一点。该解决方案还考虑了标点符号。df['Unique'] =  df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)

0 0

有只小跳蛙

尝试这个from collections import Counterdict = {'A': {0:'John', 1:'Bob'},        'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates  and Larry Ellison'}}df = pd.DataFrame(dict)df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')print(df.loc[:,"Desc"]) print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))

0 0

随时随地看视频慕课网APP