我有这样的代码:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'This document is the fourth document.',
'And this is the fifth one.',
'This document is the sixth.',
'And this is the seventh one document.',
'This document is the eighth.',
'And this is the nineth one document.',
'This document is the second.',
'And this is the tenth one document.',
]
vectorizer = skln.TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
accumulated = [0] * len(vectorizer.get_feature_names())
for i in range(tfidf_matrix.shape[0]):
for j in range(len(vectorizer.get_feature_names())):
accumulated[j] += tfidf_matrix[i][j]
accumulated = sorted(accumulated)[-CENTRAL_TERMS:]
print(accumulated)
我在其中打印在CENTRAL_TERMS语料库的所有文档中获得最高 tf-idf 分数的单词。
但是,我也想MOST_REPEATED_TERMS从语料库的所有文档中获取单词。这些是具有最高 tf 分数的单词。我知道我可以通过简单地使用来获得CountVectorizer,但我只想使用TfidfVectorizer(为了不先执行vectorizer.fit_transform(corpus)for the TfidfVectorizer,然后执行vectorizer.fit_transform(corpus)for the CountVectorizer。我也知道我可以先使用CountVectorizer(获得 tf 分数)然后使用TfidfTransformer(获得tf-idf 分数)。但是,我认为必须有办法只使用TfidfVectorizer.
让我知道是否有办法做到这一点(欢迎提供任何信息)。
万千封印
隔江千里
喵喵时光机
随时随地看视频慕课网APP
相关分类