计算数据框中的标签频率

首页课程实战体系课手记专栏慕课教程

计算数据框中的标签频率

我正在尝试计算数据框“文本”列中主题标签词的频率。

index text

1 ello ello ello ello #hello #ello

2 red green blue black #colours

3 Season greetings #hello #goodbye

4 morning #goodMorning #hello

5 my favourite animal #dog

word_freq = df.text.str.split(expand=True).stack().value_counts()

上面的代码将对文本列中的所有字符串执行频率计数，但我只是返回标签频率。

例如，在我上面的数据框上运行代码后，它应该返回

#hello 3

#goodbye 1

#goodMorning 1

#ello 1

#colours 1

#dog 1

有没有一种方法可以稍微重新调整我的 word_freq 代码，以便它只计算标签词并按照我上面的方式返回它们？提前致谢。

SMILET

浏览 188回答 3

3回答

慕妹3146593

Series.str.findall在列上使用text查找所有主题标签词，然后使用Series.explode+ Series.value_counts：counts = df['text'].str.findall(r'(#\w+)').explode().value_counts()Series.str.split使用+的另一个想法DataFrame.stack：s = df['text'].str.split(expand=True).stack() counts = s[lambda x: x.str.startswith('#')].value_counts()结果：print(counts)#hello 3#dog 1#colours 1#ello 1#goodMorning 1#goodbye 1Name: text, dtype: int64

0 0

aluckdog

使用它的一种方法是从结果中str.extractall删除。#那么value_counts也s = df['text'].str.extractall('(?<=#)(\w*)')[0].value_counts()print(s)hello 3colours 1goodbye 1ello 1goodMorning 1dog 1Name: 0, dtype: int64

0 0

守候你守候我

一个稍微详细的解决方案，但这可以解决问题。dictionary_count=data_100.TicketDescription.str.split(expand=True).stack().value_counts().to_dict()dictionary_count={'accessgtgtjust': 1,'sent': 1,'investigate': 1,'edit': 1,'#prd': 1,'getting': 1}ert=[i for i in list(dictionary_count.keys()) if '#' in i]ertOut[238]: ['#prd']unwanted = set(dictionary_count.keys()) - set(ert)for unwanted_key in unwanted:    del dictionary_count[unwanted_key]dictionary_countOut[241]: {'#prd': 1}

0 0

随时随地看视频慕课网APP