tf-idf 病态学习将“词”与词分开

我与在那里如果是在这种格式中找到的单词在文本分类问题工作的“字”，将有不同的重要性，从如果以这种格式发现字，所以我尝试这个代码

import re

from sklearn.feature_extraction.text import CountVectorizer

sent1 = "The cat sat on my \"face\" face"

sent2 = "The dog sat on my bed"

content = [sent1,sent2]

vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w\w+\b|!|\?|\"|\'")

vectorizer.fit(content)

print (vectorizer.get_feature_names())

结果是

['"', 'bed', 'cat', 'dog', 'face', 'my', 'on', 'sat', 'the']

我希望它在的地方

['bed', 'cat', 'dog', 'face','"face"' 'my', 'on', 'sat', 'the']

哆啦的时光机

浏览 139回答 2