如何使用 NLTK 或拓扑结构进行词形还原

我知道我的解释很长,但我觉得有必要。希望有人有耐心和乐于助人的灵魂:)我正在做一个情感分析项目atm,我被困在预处理部分。我导入了csv文件,将其转换为数据帧,将变量/列转换为正确的数据类型。然后我像这样进行了标记化,在数据帧(df_tweet1)中选择要标记的变量(推文内容):


# Tokenization

tknzr = TweetTokenizer()

tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]

for i in tokenized_sents:

    print(i)

输出是一个包含单词(标记)的列表列表。


然后我执行非索引字删除:


# Stop word removal

from nltk.corpus import stopwords


stop_words = set(stopwords.words("english"))

#add words that aren't in the NLTK stopwords list

new_stopwords = ['!', ',', ':', '&', '%', '.', '’']

new_stopwords_list = stop_words.union(new_stopwords)


clean_sents = []

for m in tokenized_sents:

    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]

    clean_sents.append(stop_m)

输出相同,但没有非索引字


接下来的两个步骤让我感到困惑(词性标记和词形还原)。我尝试了两件事:


1)将上一个输出转换为字符串列表


new_test = [' '.join(x) for x in clean_sents]

因为我认为这将允许我使用此代码在一个步骤中执行这两个步骤:


from pywsd.utils import lemmatize_sentence


text = new_test

lemm_text = lemmatize_sentence(text, keepWordPOS=True)

我得到了这个错误: 类型错误: 预期的字符串或类似字节的对象


2) 分别执行 POS 和词形还原。第一个使用clean_sents作为输入的 POS:


# PART-OF-SPEECH        

def process_content(clean_sents):

    try:

        tagged_list = []  

        for lst in clean_sents[:500]: 

            for item in lst:

                words = nltk.word_tokenize(item)

                tagged = nltk.pos_tag(words)

                tagged_list.append(tagged)

        return tagged_list


    except Exception as e:

        print(str(e))


output_POS_clean_sents = process_content(clean_sents)

输出是一个列表列表,其中附加了带有标记的单词 然后我想重新修饰此输出,但是如何呢?我尝试了两个模块,但都给了我错误:


from pywsd.utils import lemmatize_sentence


lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]

              for s in output_POS_clean_sents]


# AND


from nltk.stem.wordnet import WordNetLemmatizer


lmtzr = WordNetLemmatizer()

lemmatized = [[lmtzr.lemmatize(word) for word in s]

              for s in output_POS_clean_sents]

print(lemmatized)

错误分别为:


类型错误:预期的字符串或类似字节的对象


属性错误:“元组”对象没有属性“endswith”


三国纷争
浏览 95回答 2
2回答

撒科打诨

如果您使用的是数据帧,我建议您将预处理步骤结果存储在新列中。通过这种方式,您始终可以检查输出,并且始终可以创建一个列表列表,以用作一行代码后记中模型的输入。这种方法的另一个优点是,您可以轻松地可视化预处理线,并在需要时添加其他步骤,而不会感到困惑。关于你的代码,它可以被优化(例如,你可以同时执行非索引字删除和标记化),我看到你执行的步骤有点混乱。例如,你执行多次词形还原,也使用不同的库,这样做是没有意义的。在我看来,nltk工作得很好,我个人使用其他库来预处理推文,只是为了处理表情符号,网址和主题标签,所有与推文特别相关的东西。# I won't write all the imports, you get them from your code# define new column to store the processed tweetsdf_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)tknzr = TweetTokenizer()lmtzr = WordNetLemmatizer()stop_words = set(stopwords.words("english"))new_stopwords = ['!', ',', ':', '&', '%', '.', '’']new_stopwords_list = stop_words.union(new_stopwords)# iterate through each tweetfor ind, row in df_tweet1.iterrows():    # get initial tweet: ['This is the initial tweet']    tweet = row['Tweet Content']    # tokenisation, stopwords removal and lemmatisation all at once    # out: ['initial', 'tweet']    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]    # pos tag, no need to lemmatise again after.    # out: [('initial', 'JJ'), ('tweet', 'NN')]    tweet = nltk.pos_tag(tweet)    # save processed tweet into the new column    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet因此,总的来说,您只需要4行,一行用于获取推文字符串,两行用于预处理文本,另一行用于存储推文。您可以添加额外的处理步骤,注意每个步骤的输出(例如,标记化返回字符串列表,pos标记返回元组列表,您遇到麻烦的原因)。如果你愿意,你可以创建一个列表列表,其中包含数据帧中的所有推文:# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]

烙印99

第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_testtext = new_testlemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]应该创建一个词形符号化句子的列表。实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:import lemmy, redef remove_stopwords(lst):    with open('stopwords.txt', 'r') as sw:        #read the stopwords file         stopwords = sw.read().split('\n')        return [word for word in lst if not word in stopwords]def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.    -- body_text: string or list of strings    -- language: language of the passed string(s), e.g. 'en', 'da' etc.    """    if isinstance(body_text, str):        body_text = [body_text] #Convert whatever passed to a list to support passing of single string    if not hasattr(body_text, '__iter__'):        raise TypeError('Passed argument should be a sequence.')    lemmatizer = lemmy.load(language) #load lemmatizing dictionary    lemma_list = [] #list to store each lemmatized string     word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words    for string in body_text:        #remove punctuation and split words        matches = word_regex.findall(string)        #split words and lowercase them unless they are all caps        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        #lemmatize each word and choose the shortest word of suggested lemmatizations        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        lemma_list.append(' '.join(lemmatized_string))    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!让我知道:-)第一部分是字符串列表。 需要一个字符串,因此传递将引发一个像您得到的错误。您必须单独传递每个字符串,然后从每个词根化字符串创建一个列表。所以:new_testlemmatize_sentencenew_testtext = new_testlemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]应该创建一个词形符号化句子的列表。实际上,我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串:import lemmy, redef remove_stopwords(lst):    with open('stopwords.txt', 'r') as sw:        #read the stopwords file         stopwords = sw.read().split('\n')        return [word for word in lst if not word in stopwords]def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.    -- body_text: string or list of strings    -- language: language of the passed string(s), e.g. 'en', 'da' etc.    """    if isinstance(body_text, str):        body_text = [body_text] #Convert whatever passed to a list to support passing of single string    if not hasattr(body_text, '__iter__'):        raise TypeError('Passed argument should be a sequence.')    lemmatizer = lemmy.load(language) #load lemmatizing dictionary    lemma_list = [] #list to store each lemmatized string     word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words    for string in body_text:        #remove punctuation and split words        matches = word_regex.findall(string)        #split words and lowercase them unless they are all caps        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        #lemmatize each word and choose the shortest word of suggested lemmatizations        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        lemma_list.append(' '.join(lemmatized_string))    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string如果你愿意,你可以看看,但不要觉得有义务。如果它能帮助你得到任何想法,我会非常高兴,我花了很多时间试图自己弄清楚!让我知道:-)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python