如何使用 difflib 突出显示（仅）单词错误？

3回答

慕斯王

我还想建议一个使用 difflib 的解决方案，但我更喜欢使用 RegEx 进行单词检测，因为它会更精确并且更能容忍奇怪的字符和其他问题。我在您的原始字符串中添加了一些奇怪的文字以表明我的意思：import reimport difflibtruth = 'The quick! brown - fox jumps, over the lazy dog.'speech = 'the quick... brown box jumps. over the dog'truth = re.findall(r"[\w']+", truth.lower())speech = re.findall(r"[\w']+", speech.lower())for d in difflib.ndiff(truth, speech):    print(d)输出  the  quick  brown- fox+ box  jumps  over  the- lazy  dog另一个可能的输出：diff = difflib.unified_diff(truth, speech)print(''.join(diff))输出---+++@@ -1,9 +1,8 @@ the quick brown-fox+box jumps over the-lazy dog

0 0

HUX布斯

为什么不将句子拆分成单词然后在这些单词上使用 difflib？import difflibtruth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(    '.').split()speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()for d in difflib.ndiff(truth, speech):    print(d)

0 0

神不在的星期二

所以我想我已经解决了这个问题。我意识到 difflib 的“contextdiff”提供了其中有变化的行的索引。为了获取“ground truth”文本的索引，我删除了大写/标点符号，将文本拆分为单个单词，然后执行以下操作：altered_word_indices = []diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)for line in diff:  if line.startswith('*** ') and line.endswith(' ****\n'):    line = line.replace(' ', '').replace('\n', '').replace('*', '')    if ',' in line:      split_line = line.split(',')      for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):        altered_word_indices.append((int(split_line[0]) + i) - 1)    else:      altered_word_indices.append(int(line) - 1)在此之后，我将更改后的单词大写打印出来：split_ground_truth = ground_truth.split(' ')for i in range(0, len(split_ground_truth)):    if i in altered_word_indices:        print(split_ground_truth[i].upper(), end=' ')    else:        print(split_ground_truth[i], end=' ')这让我可以打印出“The quick brown FOX jumps over the LAZY dog”。（包括大写/标点符号）而不是“快速的棕色 FOX 跳过 LAZY 狗”。这不是一个超级优雅的解决方案，它需要经过测试、清理、错误处理等。但这似乎是一个不错的开始，并且可能对遇到相同问题的其他人有用。我会把这个问题悬而未决几天，以防有人想出一种不太粗略的方法来获得相同的结果。

0 0