将 n 元语法与组重复项进行比较

我正在编写一个脚本,如果两行之间的三个连续单词匹配,该脚本将认为两行是重复的。


假设我当前的数据集是:


1 A Course of Pure Mathematics by G. H. Hardy

2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin

3 Advanced Programming in the UNIX Environment, 3rd Edition

4 Advanced Selling Strategies: Brian Tracy

5 Advanced Programming in the UNIX(R) Environment

6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley

7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising

8 Agile Software Development, Principles, Patterns, and Practices

9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 

10 Alex’s Adventures in Numberland

11 Advertising Secrets of the Written Word

12 Alex's Adventures in Numberland Paperback by Alex Bellos

这里,1 和 9 是重复的,因为course pure mathematics匹配。2 和 8 是重复的,因为advanced programming unix匹配。3 和 5 是重复的,因为advanced programming unix匹配。等等 ...


偶然的你
浏览 93回答 1
1回答

宝慕林4294392

OP 这里,解决方案似乎是:import refrom nltk.util import ngramsOriginalBooksList = list()booksAfterRemovingStopWords = list()booksWithNGrams = list()stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',             'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']with open('UnifiedBookList.txt') as fin:    for line_no, line in enumerate(fin):        OriginalBooksList.append(line)        line = re.sub(r'[^\w\s]', ' ', line)  # replace punctuation with space        line = re.sub(' +', ' ', line)  # replace multiple space with one        line = line.lower()  # to lower case        if line.strip() and len(line.split()) > 2:  # line can not be empty and line must have more than 2 words            booksAfterRemovingStopWords.append(' '.join([i for i in line.split(            ) if i not in stopWords]))  # Remove Stop Words And Make Sentencefor line_no, line in enumerate(booksAfterRemovingStopWords):    tokens = line.split(" ")    output = list(ngrams(tokens, 3))    temp = list()    temp.append(OriginalBooksList[line_no])  # Adding original line    for x in output:  # Adding n-grams        temp.append(' '.join(x))    booksWithNGrams.append(temp)while booksWithNGrams:    first_element = booksWithNGrams.pop(0)    x = 0    for mylist in booksWithNGrams:        if set(first_element) & set(mylist):            if x == 0:                print(first_element[0])                x = 1                # print(set(first_element) & set(mylist))            print(mylist[0])            booksWithNGrams.remove(mylist)    x = 0
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python