编写 AND 查询以在数据集中查找匹配的文档 (python)

我正在尝试构建一个名为“and_query”的函数,该函数将包含一个或多个单词的单个字符串作为输入,以便该函数返回文档摘要中单词的匹配文档列表。


首先,我将所有单词放在倒排索引中,id 是文档的 id,摘要是纯文本。


inverted_index = defaultdict(set)


for (id, abstract) in Abstracts.items():

for term in preprocess(tokenize(abstract)):

    inverted_index[term].add(id)

然后,我编写了一个查询函数,其中 finals 是所有匹配文档的列表。


因为它应该只返回文档中函数参数的每个单词都匹配的文档,所以我使用了设置操作“intersecton”。


def and_query(tokens):

    documents=set()

    finals = []

    terms = preprocess(tokenize(tokens))


    for term in terms:

        for i in inverted_index[term]:

            documents.add(i)


    for term in terms:

        temporary_set= set()

        for i in inverted_index[term]:

            temporary_set.add(i)

        finals.extend(documents.intersection(temporary_set))

    return finals


def finals_print(finals):

    for final in finals:

        display_summary(final)        


finals_print(and_query("netherlands vaccine trial"))

但是,该函数似乎仍在返回文档摘要中只有 1 个单词的文档。


有谁知道我在设置操作方面做错了什么?


(我认为错误应该出现在这部分代码的任何地方):


for term in terms:

    temporary_set= set()

    for i in inverted_index[term]:

        temporary_set.add(i)

    finals.extend(documents.intersection(temporary_set))

return finals 

提前致谢


基本上我想做的事情简而言之:


for word in words:

    id_set_for_one_word= set()

    for  i  in  get_id_of that_word[word]:

        id_set_for_one_word.add(i)

pseudo:

            id_set_for_one_word intersection (id_set_of_other_words)


finals.extend( set of all intersections for all words)

然后我需要所有这些词的 id 集的交集,返回一个集合,其中 id 存在于词中的每个词。


明月笑刀无情
浏览 145回答 3
3回答

哔哔one

为了详细说明我的代码气味注释,这里是我之前为解决此类问题所做的工作的粗略草稿。def tokenize(abstract):&nbsp; &nbsp; #return <set of words in abstract>&nbsp; &nbsp; set_ = .....&nbsp; &nbsp; return set_candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():all_criterias = "netherlands vaccine trial".split()def searcher(candidates, criteria, match_on_found=True):&nbsp; &nbsp; search_results = []&nbsp; &nbsp; for cand in candidates:&nbsp; &nbsp; &nbsp; &nbsp; #cand[2] has a set of tokens or somesuch...&nbsp; abstract.&nbsp; &nbsp; &nbsp; &nbsp; if criteria in cand[2]:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if match_on_found:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_results.append(cand)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; #that's a AND NOT if you wanted that&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_results.append(cand)&nbsp; &nbsp; return search_resultsfor criteria in all_criterias:&nbsp; &nbsp; #pass in the full list every time, but it gets progressively shrunk&nbsp; &nbsp; candidates = searcher(candidates, criteria)#whats left is what you wantanswer = [(abs[0],abs[1]) for abs in candidates]&nbsp;

幕布斯6054654

最终自己找到了解决方案。替换&nbsp; &nbsp; finals.extend(documents.intersection(id_set_for_one_word))return finals&nbsp;和&nbsp; &nbsp; documents = (documents.intersection(id_set_for_one_word))return documents似乎在这里工作。不过,谢谢大家的努力。

慕哥9229398

问题:返回文档摘要中单词的匹配文档列表该term用min的数量documents,保持始终result。如果 aterm在 中不存在inverted_index,则根本不匹配。为简单起见,预定义数据:Abstracts = {1: 'Lorem ipsum dolor sit amet,',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2: 'consetetur sadipscing elitr,',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;4: 'sed diam voluptua.',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;5: 'At vero eos et accusam et justo duo dolores et ea rebum.',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;6: 'Stet clita kasd gubergren,',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }inverted_index = {'Stet': {6}, 'ipsum': {1, 7}, 'erat,': {3}, 'ut': {3}, 'dolores': {5}, 'gubergren,': {6}, 'kasd': {6}, 'ea': {5}, 'consetetur': {2}, 'sit': {1, 7}, 'nonumy': {3}, 'voluptua.': {4}, 'est': {7}, 'elitr,': {2}, 'At': {5}, 'rebum.': {5}, 'magna': {3}, 'sadipscing': {2}, 'diam': {3, 4}, 'dolore': {3}, 'sanctus': {7}, 'labore': {3}, 'sed': {3, 4}, 'takimata': {7}, 'Lorem': {1, 7}, 'invidunt': {3}, 'aliquyam': {3}, 'accusam': {5}, 'duo': {5}, 'amet.': {7}, 'et': {3, 5}, 'sea': {7}, 'dolor': {1, 7}, 'vero': {5}, 'no': {7}, 'eos': {5}, 'tempor': {3}, 'amet,': {1}, 'clita': {6}, 'justo': {5}, 'eirmod': {3}}def and_query(tokens):&nbsp; &nbsp; print("tokens:{}".format(tokens))&nbsp; &nbsp; #terms = preprocess(tokenize(tokens))&nbsp; &nbsp; terms = tokens.split()&nbsp; &nbsp; term_min = None&nbsp; &nbsp; for term in terms:&nbsp; &nbsp; &nbsp; &nbsp; if term in inverted_index:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Find min&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if not term_min or term_min[0] > len(inverted_index[term]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; term_min = (len(inverted_index[term]), term)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Break early, if a term is not in inverted_index&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return set()&nbsp; &nbsp; finals = inverted_index[term_min[1]]&nbsp; &nbsp; print("term_min:{} inverted_index:{}".format(term_min, finals))&nbsp; &nbsp; return finalsdef finals_print(finals):&nbsp; &nbsp; if finals:&nbsp; &nbsp; &nbsp; &nbsp; for final in finals:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print("Document [{}]:{}".format(final, Abstracts[final]))&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; print("No matching Document found")if __name__ == "__main__":&nbsp; &nbsp; for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:&nbsp; &nbsp; &nbsp; &nbsp; finals_print(and_query(tokens))&nbsp; &nbsp; &nbsp; &nbsp; print()输出:tokens:sed diam voluptua.term_min:(1, 'voluptua.') inverted_index:{4}Document [4]:sed diam voluptua.tokens:Lorem ipsum dolorterm_min:(2, 'Lorem') inverted_index:{1, 7}Document [1]:Lorem ipsum dolor sit amet,Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.tokens:Lorem ipsum dolor testNo matching Document found用 Python 测试:3.4.2
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python