清理并标记化后的数据框测试。
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
test['tokenize'] = test['tweet'].apply(tt.tokenize)
print(test)
输出
0 congratulations dear friend ... [congratulations, dear, friend]
1 happy anniversary be happy ... [happy, anniversary, be, happy]
2 make some sandwich ... [make, some, sandwich]
我想为我的数据创建一个词袋。以下给了我错误:'list'对象没有属性'lower'
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tokenize'])
print(BOW.toarray())
print(vectorizer.get_feature_names())
第二个:AttributeError: 'list' object has no attribute 'split'
from collections import Counter
test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
print(test['BOW'])
你能帮我一个方法或两个。谢谢!
UYOU
慕斯王
相关分类