如何在python中过滤大文件中的重叠行

棘手的部分是，您必须修改要遍历的列表，并且仍然要跟踪两个索引。一种方法是向后移动，因为删除索引等于或大于您跟踪的索引的项目不会影响它们。这段代码未经测试，但是您可以理解：with open("file.txt") as fileobj:    sets = [set(line.split()) for line in fileobj]    for first_index in range(len(sets) - 2, -1, -1):        for second_index in range(len(sets) - 1, first_index, -1):            union = sets[first_index] | sets[second_index]            intersection = sets[first_index] & sets[second_index]            if len(intersection) / float(len(union)) > 0.25:                del sets[second_index]with open("output.txt", "w") as fileobj:    for set_ in sets:        # order of the set is undefined, so we need to sort each set        output = " ".join(sorted(set_, key=lambda x: int(x[1:])))        fileobj.write("{0}\n".format(output))既然很明显如何对每一行的元素进行排序，我们可以这样做。如果顺序以某种方式自定义，则必须将读取行与每个set元素耦合在一起，以便我们可以准确地写回最后读取的行，而不是重新生成它。

如何在python中过滤大文件中的重叠行

1回答