在 python 中使用 sklearn 计算 TF-IDF 用于变量 n-gram

问题:使用 scikit-learn 查找特定词汇的可变 n-gram 的命中数。


解释。我从这里得到了例子。


想象一下,我有一个语料库,我想找出有多少命中(计数)具有如下词汇:


myvocabulary = [(window=4, words=['tin', 'tan']),

                (window=3, words=['electrical', 'car'])

                (window=3, words=['elephant','banana'])

我在这里所说的窗口是单词可以出现的单词跨度的长度。如下:


'tin tan' 被击中(4 个字以内)


'tin dog tan' 被击中(4 个字以内)


'tin dog cat tan被击中(4个字以内)


'tin car sun eclipse tan' 没有被击中。tin 和 tan 相距超过 4 个单词。


我只想计算 (window=4, words=['tin', 'tan']) 出现在文本中的次数,所有其他的都相同,然后将结果添加到 pandas 以计算tf-idf 算法。我只能找到这样的东西:


from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')

tfs = tfidf.fit_transform(corpus.values())

其中词汇表是一个简单的字符串列表,可以是单个单词或多个单词。


除了来自 scikitlearn:


class sklearn.feature_extraction.text.CountVectorizer

ngram_range : tuple (min_n, max_n)

要提取的不同 n-gram 的 n 值范围的下边界和上边界。将使用所有满足 min_n <= n <= max_n 的 n 值。


也无济于事。


有任何想法吗?谢谢。


呼如林
浏览 199回答 1
1回答

一只斗牛犬

我不确定这是否可以使用CountVectorizeror来完成TfidfVectorizer。我为此编写了自己的函数,如下所示:import pandas as pdimport numpy as npimport string&nbsp;def contained_within_window(token, word1, word2, threshold):&nbsp; word1 = word1.lower()&nbsp; word2 = word2.lower()&nbsp; token = token.translate(str.maketrans('', '', string.punctuation)).lower()&nbsp; if (word1 in token) and word2 in (token):&nbsp; &nbsp; &nbsp; word_list = token.split(" ")&nbsp; &nbsp; &nbsp; word1_index = [i for i, x in enumerate(word_list) if x == word1]&nbsp; &nbsp; &nbsp; word2_index = [i for i, x in enumerate(word_list) if x == word2]&nbsp; &nbsp; &nbsp; count = 0&nbsp; &nbsp; &nbsp; for i in word1_index:&nbsp; &nbsp; &nbsp; &nbsp; for j in word2_index:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if np.abs(i-j) <= threshold:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count=count+1&nbsp; &nbsp; &nbsp; return count&nbsp; return 0样本:corpus = [&nbsp; &nbsp; 'This is the first document. And this is what I want',&nbsp; &nbsp; 'This document is the second document.',&nbsp; &nbsp; 'And this is the third one.',&nbsp; &nbsp; 'Is this the first document?',&nbsp; &nbsp; 'I like coding in sklearn',&nbsp; &nbsp; 'This is a very good question']df = pd.DataFrame(corpus, columns=["Test"])你的df会看起来像这样:&nbsp; &nbsp; Test0&nbsp; &nbsp;This is the first document. And this is what I...1&nbsp; &nbsp;This document is the second document.2&nbsp; &nbsp;And this is the third one.3&nbsp; &nbsp;Is this the first document?4&nbsp; &nbsp;I like coding in sklearn5&nbsp; &nbsp;This is a very good question现在你可以申请contained_within_window如下:sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))你得到:2您可以运行一个for循环来检查不同的实例。你这个来构建你的 pandasdf并应用TfIdf它,这是直截了当的。希望这可以帮助!
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python