我知道我的解释很长,但我觉得有必要。希望有人有耐心和乐于助人的灵魂:)我正在做一个情感分析项目atm,我被困在预处理部分。我导入了csv文件,将其转换为数据帧,将变量/列转换为正确的数据类型。然后我像这样进行了标记化,在数据帧(df_tweet1)中选择要标记的变量(推文内容):
# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
print(i)
输出是一个包含单词(标记)的列表列表。
然后我执行非索引字删除:
# Stop word removal
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)
clean_sents = []
for m in tokenized_sents:
stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
clean_sents.append(stop_m)
输出相同,但没有非索引字
接下来的两个步骤让我感到困惑(词性标记和词形还原)。我尝试了两件事:
1)将上一个输出转换为字符串列表
new_test = [' '.join(x) for x in clean_sents]
因为我认为这将允许我使用此代码在一个步骤中执行这两个步骤:
from pywsd.utils import lemmatize_sentence
text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)
我得到了这个错误: 类型错误: 预期的字符串或类似字节的对象
2) 分别执行 POS 和词形还原。第一个使用clean_sents作为输入的 POS:
# PART-OF-SPEECH
def process_content(clean_sents):
try:
tagged_list = []
for lst in clean_sents[:500]:
for item in lst:
words = nltk.word_tokenize(item)
tagged = nltk.pos_tag(words)
tagged_list.append(tagged)
return tagged_list
except Exception as e:
print(str(e))
output_POS_clean_sents = process_content(clean_sents)
输出是一个列表列表,其中附加了带有标记的单词 然后我想重新修饰此输出,但是如何呢?我尝试了两个模块,但都给了我错误:
from pywsd.utils import lemmatize_sentence
lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
for s in output_POS_clean_sents]
# AND
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
for s in output_POS_clean_sents]
print(lemmatized)
错误分别为:
类型错误:预期的字符串或类似字节的对象
属性错误:“元组”对象没有属性“endswith”
撒科打诨
烙印99
相关分类