我想在我的项目中显示字典单词的匹配键。我的代码当前输出的键,但对于您键入的任何单词,都使用相同的键。例如,如果我把返回的密钥将是.如果我把同样的钥匙将被归还。请参阅下面的代码,如果做错了什么,请告诉我'england played well'[737, 736, 735, 734, 733, 732, 731, 730, 729, 728]'Hello'
import re
import os
import math
import heapq
def readfile(path, docid):
files = sorted(os.listdir(path))
f = open(os.path.join(path, files[docid]), 'r',encoding='latin-1')
s = f.read()
f.close()
return s
DELIM = '[ \n\t0123456789;:.,/\(\)\"\'-]+'
def tokenize(text):
return re.split(DELIM, text.lower())
N = len(sorted(os.listdir('docs')))
def indextextfiles_RR(path):
postings={}
docLength = {}
term_in_document = {}
for docID in range(N):
s = readfile(path, docID)
words = tokenize(s)
length = 0
for w in words:
if w!='':
length += (math.log10(words.count(w)))**2
docLength[docID] = math.sqrt(length)
for w in words:
if w!='':
doc_length = math.log10(words.count(w))/docLength[docID]
term_in_document.setdefault(doc_length, set()).add(docID)
postings[w] = term_in_document
return postings
def query_RR(postings, qtext):
words = tokenize(qtext)
doc_scores = {}
for docID in range(N):
score = 0
for w in words:
tf = words.count(w)
df = len(postings[w])
idf = math.log10(N / (df+1))
query_weights = tf * idf
for w in words:
if w in postings:
score = score + query_weights
doc_scores[docID] = score
res = heapq.nlargest(10, doc_scores)
return res
postings = indextextfiles_RR('docs')
print(query_RR(postings, 'hello'))
当我运行帖子时,它应该返回hello和与之关联的键列表。
拉丁的传说
相关分类