我正在研究向量空间模型,数据集由 50 个文本文件组成。遍历它们分解成单词并将它们保存在字典中。现在我想使用嵌套字典,如:
dictionary = { {someword: {Doc1:23},{Doc21:2},{Doc34:3}},
{someword: {Doc1:23},{Doc21:2},{Doc34:3}},
{someword: {Doc1:23},{Doc21:2},{Doc34:3}}
}
但是当我运行我的程序时,它不仅会替换文档,而且不会通过添加“某个词”在特定文档中出现的次数来计算频率。
for iterator in range(1, 51):
f = open(directory + str(iterator) + ext, "r")
for line in f.read().lower().split():
line = getwords(line)
for word in line:
if check(word, stopwords) == 0:
if existence(word, terms, iterator) != 1:
terms[word] = {}
terms[word]["Doc"+str(iterator)] = 1
else:
terms[word]["Doc"+str(iterator)] = int(terms[word]["Doc"+str(iterator)]) + 1
f.close()
存在函数为:
def existence(tok, diction, iteration):
if tok in diction:
temp = "Doc"+str(iteration)
if temp in diction:
return 1
else:
return 0
else:
return 0
结果有点像这样。
{'blunder': {'Doc1': 1}, 'by': {'Doc50': 1}, 'anton': {'Doc27': 1}, 'chekhov': {'Doc27': 1}, 'an': {'Doc50': 1}, 'illustration': {'Doc48': 1}, 'story': {'Doc48': 1}, 'author': {'Doc48': 1}, 'portrait'...
收到一只叮咚
相关分类