如何使用 NLTK 或拓扑结构进行词形还原

第一部分是字符串列表。需要一个字符串，因此传递将引发一个像您得到的错误。您必须单独传递每个字符串，然后从每个词根化字符串创建一个列表。所以：new_testlemmatize_sentencenew_testtext = new_testlemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]应该创建一个词形符号化句子的列表。实际上，我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串：import lemmy, redef remove_stopwords(lst):    with open('stopwords.txt', 'r') as sw:        #read the stopwords file         stopwords = sw.read().split('\n')        return [word for word in lst if not word in stopwords]def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.    -- body_text: string or list of strings    -- language: language of the passed string(s), e.g. 'en', 'da' etc.    """    if isinstance(body_text, str):        body_text = [body_text] #Convert whatever passed to a list to support passing of single string    if not hasattr(body_text, '__iter__'):        raise TypeError('Passed argument should be a sequence.')    lemmatizer = lemmy.load(language) #load lemmatizing dictionary    lemma_list = [] #list to store each lemmatized string     word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words    for string in body_text:        #remove punctuation and split words        matches = word_regex.findall(string)        #split words and lowercase them unless they are all caps        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        #lemmatize each word and choose the shortest word of suggested lemmatizations        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        lemma_list.append(' '.join(lemmatized_string))    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string如果你愿意，你可以看看，但不要觉得有义务。如果它能帮助你得到任何想法，我会非常高兴，我花了很多时间试图自己弄清楚！让我知道：-）第一部分是字符串列表。需要一个字符串，因此传递将引发一个像您得到的错误。您必须单独传递每个字符串，然后从每个词根化字符串创建一个列表。所以：new_testlemmatize_sentencenew_testtext = new_testlemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]应该创建一个词形符号化句子的列表。实际上，我曾经做过一个看起来与你正在做的项目相似的项目。我做了以下函数来词形还原字符串：import lemmy, redef remove_stopwords(lst):    with open('stopwords.txt', 'r') as sw:        #read the stopwords file         stopwords = sw.read().split('\n')        return [word for word in lst if not word in stopwords]def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.    -- body_text: string or list of strings    -- language: language of the passed string(s), e.g. 'en', 'da' etc.    """    if isinstance(body_text, str):        body_text = [body_text] #Convert whatever passed to a list to support passing of single string    if not hasattr(body_text, '__iter__'):        raise TypeError('Passed argument should be a sequence.')    lemmatizer = lemmy.load(language) #load lemmatizing dictionary    lemma_list = [] #list to store each lemmatized string     word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words    for string in body_text:        #remove punctuation and split words        matches = word_regex.findall(string)        #split words and lowercase them unless they are all caps        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        #lemmatize each word and choose the shortest word of suggested lemmatizations        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]        #remove words that are in the stopwords file        if remove_stopwords_:            lemmatized_string = remove_stopwords(lemmatized_string)        lemma_list.append(' '.join(lemmatized_string))    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string如果你愿意，你可以看看，但不要觉得有义务。如果它能帮助你得到任何想法，我会非常高兴，我花了很多时间试图自己弄清楚！让我知道：-）

如何使用 NLTK 或拓扑结构进行词形还原

2回答