猿问

在 LDA 中指定词汇输入

我试图了解如何在我的情况下使用LDA。我有一个包含许多文档的语料库,我想看看一组非常具体的单词和ngram是如何跨主题的分布的。有没有办法将特定单词的列表指定为主题建模的词汇表?

我一直在使用gensim实现,我相信这个论点可以解决这个问题,但是文档对我来说并不清楚。我的理解是否正确?id2word


蓝山帝景
浏览 143回答 2
2回答

呼如林

你可以使用Scikit学习计数矢量器为此from sklearn.feature_extraction.text import CountVectorizerfrom gensim import matutilsfrom gensim.models.ldamodel import LdaModeltext = ['computer time graph', 'survey response eps', 'human system computer','machinelearning is very hot topic','python win the race for simplicity as compared to other programming language']# suppose this are the word that you want to be used in your vocab vocabulary = ['machine','python','learning','human', 'system','hot','time']vect = CountVectorizer(vocabulary = vocabulary)x = vect.fit_transform(text)feature_name = vect.get_feature_names()# now you can use matutils helper function of gensimmodel = LdaModel(matutils.Sparse2Corpus(x),num_topic=3,id2word=dict([(i, s) for i, s in enumerate(feature_name)]))#printing the topic model.show_topics()#to see the vocab that use being used  print(vect.get_feature_names())  ['machine', 'python', 'learning', 'human', 'system', 'hot', 'time'] # you will get the feature that you want include

守着一只汪

LDA的主题建模方法是将每个文档视为一定比例的主题集合。每个主题作为关键字的集合,同样,以一定的比例。一旦为算法提供了主题的数量,它就会重新排列文档中的主题分布和主题内的关键字分布,以获得主题关键字分布的良好组合。主题模型的两个主要输入是字典或词汇()和语料库。LDAid2word您可以使用类似这样的东西来实现此目的:import gensim.corpora as corpora# Create Dictionary/Vocabularyid2word = corpora.Dictionary(data_lemmatized)# Create Corpustexts = data_lemmatized# Term Document Frequencycorpus = [id2word.doc2bow(text) for text in texts]
随时随地看视频慕课网APP

相关分类

Python
我要回答