忽然笑
更新 2020 年 10 月 21 日我决定构建一个 Python 模块来处理我在这个答案中概述的任务。该模块名为wordhoard ,可以从pypi下载我曾尝试在需要确定关键字(例如医疗保健)和关键字同义词(例如健康计划、预防医学)的频率的项目中使用Word2vec 和WordNet。我发现大多数 NLP 库无法生成我需要的结果,因此我决定使用自定义关键字和同义词构建自己的词典。这种方法适用于多个项目中的文本分析和分类。我确信精通 NLP 技术的人可能有更强大的解决方案,但下面的解决方案是一次又一次为我工作的类似解决方案。我对答案进行了编码以匹配您问题中的词频数据,但可以对其进行修改以使用任何关键字和同义词数据集。import string# Python Dictionary# I manually created these word relationship - primary_word:synonymsword_relationship = {"father": ['dad', 'daddy', 'old man', 'pa', 'pappy', 'papa', 'pop'], "mother": ["mamma", "momma", "mama", "mammy", "mummy", "mommy", "mom", "mum"]}# This input text is from various poems about mothers and fathersinput_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \ 'that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of ' \ 'her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She ' \ 'has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the ' \ 'greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, ' \ 'This to me you have always been. Through the good times and the bad, Your understanding I have had.'# converts the input text to lowercase and splits the words based on empty space.wordlist = input_text.lower().split()# remove all punctuation from the wordlistremove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) for s in wordlist]# list for word frequencieswordfreq = []# count the frequencies of a wordfor w in remove_punctuation:wordfreq.append(remove_punctuation.count(w))word_frequencies = (dict(zip(remove_punctuation, wordfreq)))word_matches = []# loop through the dictionariesfor word, frequency in word_frequencies.items(): for keyword, synonym in word_relationship.items(): match = [x for x in synonym if word == x] if word == keyword or match: match = ' '.join(map(str, match)) # append the keywords (mother), synonyms(mom) and frequencies to a list word_matches.append([keyword, match, frequency])# used to hold the final keyword and frequenciesfinal_results = {}# list comprehension to obtain the primary keyword and its frequenciessynonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches]# iterate synonym_matches and output total frequency count for a specific keywordfor item in synonym_matches: if item[0] not in final_results.keys(): frequency_count = 0 frequency_count = frequency_count + item[1] final_results[item[0]] = frequency_count else: frequency_count = frequency_count + item[1] final_results[item[0]] = frequency_count print(final_results)# output{'mother': 3, 'father': 2}其他方法以下是一些其他方法及其开箱即用的输出。NLTK字网在此示例中,我查找了“母亲”一词的同义词。请注意,WordNet 没有与“mother”一词相关的同义词“mom”或“mum”。这两个词在我上面的示例文本中。另请注意,“父亲”一词被列为“母亲”的同义词。from nltk.corpus import wordnetsynonyms = []word = 'mother'for synonym in wordnet.synsets(word): for item in synonym.lemmas(): if word != synonym.name() and len(synonym.lemma_names()) > 1: synonyms.append(item.name())print(synonyms)['mother', 'female_parent', 'mother', 'fuss', 'overprotect', 'beget', 'get', 'engender', 'father', 'mother', 'sire', 'generate', 'bring_forth']Py字典在此示例中,我使用 PyDictionary 查找“mother”一词的同义词,该词典查询synonym.com。此示例中的同义词包括单词“mom”和“mum”。此示例还包括 WordNet 未生成的其他同义词。但是,PyDictionary 还生成了“妈妈”的同义词列表。这与“母亲”这个词无关。PyDictionary 似乎是从页面的形容词部分而不是名词部分提取此列表。计算机很难区分形容词妈妈和名词妈妈。from PyDictionary import PyDictionarydictionary_mother = PyDictionary('mother')print(dictionary_mother.getSynonyms())# output [{'mother': ['mother-in-law', 'female parent', 'supermom', 'mum', 'parent', 'mom', 'momma', 'para I', 'mama', 'mummy', 'quadripara', 'mommy', 'quintipara', 'ma', 'puerpera', 'surrogate mother', 'mater', 'primipara', 'mammy', 'mamma']}]dictionary_mum = PyDictionary('mum')print(dictionary_mum.getSynonyms())# output [{'mum': ['incommunicative', 'silent', 'uncommunicative']}]其他一些可能的方法是使用牛津词典 API 或查询 thesaurus.com。这两种方法也都有缺陷。例如,牛津词典 API 需要 API 密钥和基于查询编号的付费订阅。thesaurus.com 缺少可能对单词分组有用的潜在同义词。https://www.thesaurus.com/browse/mothersynonyms: mom, parent, ancestor, creator, mommy, origin, predecessor, progenitor, source, child-bearer, forebearer, procreator更新为语料库中的每个潜在单词生成精确的同义词列表很困难,并且需要多管齐下的方法。下面的代码使用 WordNet 和 PyDictionary 创建同义词的超集。与所有其他答案一样,这种组合方法也会导致词频的过度计算。我一直在尝试通过在最终的同义词词典中组合键和值对来减少这种过度计数。后一个问题比我预期的要困难得多,可能需要我提出自己的问题来解决。最后,我认为根据您的用例,您需要确定哪种方法最有效,并且可能需要结合多种方法。感谢您提出这个问题,因为它使我能够了解解决复杂问题的其他方法。from string import punctuationfrom nltk.corpus import stopwordsfrom nltk.corpus import wordnetfrom PyDictionary import PyDictionaryinput_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother that keeps the family strong. When I think about my mum, I just cannot help but smile; The beauty of her loving heart, the easy grace in her style. I will always need my mom, regardless of my age. She has made me laugh, made me cry. Her love will never fade. If I could write a story, It would be the greatest ever told. I would write about my daddy, For he had a heart of gold. For my father, my friend, This to me you have always been. Through the good times and the bad, Your understanding I have had."""def normalize_textual_information(text): # split text into tokens by white space token = text.split() # remove punctuation from each token table = str.maketrans('', '', punctuation) token = [word.translate(table) for word in token] # remove any tokens that are not alphabetic token = [word.lower() for word in token if word.isalpha()] # filter out English stop words stop_words = set(stopwords.words('english')) # you could add additional stops like this stop_words.add('cannot') stop_words.add('could') stop_words.add('would') token = [word for word in token if word not in stop_words] # filter out any short tokens token = [word for word in token if len(word) > 1] return tokendef generate_word_frequencies(words): # list to hold word frequencies word_frequencies = [] # loop through the tokens and generate a word count for each token for word in words: word_frequencies.append(words.count(word)) # aggregates the words and word_frequencies into tuples and coverts them into a dictionary word_frequencies = (dict(zip(words, word_frequencies))) # sort the frequency of the words from low to high sorted_frequencies = {key: value for key, value in sorted(word_frequencies.items(), key=lambda item: item[1])} return sorted_frequenciesdef get_synonyms_internet(word): dictionary = PyDictionary(word) synonym = dictionary.getSynonyms() return synonym words = normalize_textual_information(input_text)all_synsets_1 = {}for word in words: for synonym in wordnet.synsets(word): if word != synonym.name() and len(synonym.lemma_names()) > 1: for item in synonym.lemmas(): if word != item.name(): all_synsets_1.setdefault(word, []).append(str(item.name()).lower())all_synsets_2 = {}for word in words: word_synonyms = get_synonyms_internet(word) for synonym in word_synonyms: if word != synonym and synonym is not None: all_synsets_2.update(synonym) word_relationship = {**all_synsets_1, **all_synsets_2} frequencies = generate_word_frequencies(words) word_matches = [] word_set = {} duplication_check = set() for word, frequency in frequencies.items(): for keyword, synonym in word_relationship.items(): match = [x for x in synonym if word == x] if word == keyword or match: match = ' '.join(map(str, match)) if match not in word_set or match not in duplication_check or word not in duplication_check: duplication_check.add(word) duplication_check.add(match) word_matches.append([keyword, match, frequency]) # used to hold the final keyword and frequencies final_results = {} # list comprehension to obtain the primary keyword and its frequencies synonym_matches = [(keyword[0], keyword[2]) for keyword in word_matches] # iterate synonym_matches and output total frequency count for a specific keyword for item in synonym_matches: if item[0] not in final_results.keys(): frequency_count = 0 frequency_count = frequency_count + item[1] final_results[item[0]] = frequency_count else: frequency_count = frequency_count + item[1] final_results[item[0]] = frequency_count# do something with the final results
慕容708150
您可以生成词嵌入向量并使用一些聚类算法。最后,您需要调整算法的超参数以获得高精度的结果。from sklearn.cluster import DBSCANfrom sklearn.decomposition import PCAimport spacyimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D# Load the large english modelnlp = spacy.load("en_core_web_lg")tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")# Generate word embedding vectorsvectors = np.array([token.vector for token in tokens])vectors.shape# (12, 300)让我们使用主成分分析算法来可视化 3 维空间中的嵌入:pca_vecs = PCA(n_components=3).fit_transform(vectors)pca_vecs.shape# (12, 3)fig = plt.figure(figsize=(6, 6))ax = fig.add_subplot(111, projection='3d')xs, ys, zs = pca_vecs[:, 0], pca_vecs[:, 1], pca_vecs[:, 2]_ = ax.scatter(xs, ys, zs)for x, y, z, lable in zip(xs, ys, zs, tokens): ax.text(x+0.3, y, z, str(lable))让我们使用 DBSCAN 算法对单词进行聚类:model = DBSCAN(eps=5, min_samples=1)model.fit(vectors)for word, cluster in zip(tokens, model.labels_): print(word, '->', cluster)输出:dog -> 0cat -> 0banana -> 1apple -> 2teaching -> 3teacher -> 3mom -> 4mother -> 4mama -> 4mommy -> 4berlin -> 5paris -> 6