如何在python中过滤大文件中两行的重叠

我正在尝试在python中过滤大文件中的重叠行。

重叠度设置为两行和其他两行的25％。换言之，重叠度是a*b/(c+d-a*b)>0.25，a是多少交叉点的第一行和第三行之间，b是多少交叉点的第二行和第四行之间，c是乘以元素数第1行的元素的数量第二行d的元素数乘以第四行的元素数。如果重叠度大于0.25，则删除第3行和第4行。因此，如果我有一个大文件，总共有1000 000行，那么前6行如下：

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c67 c75 c75
k6 k12 k33 k63

因为重叠度第一两行和第二行的是a=3，（例如c6,c24,c32）， b=3（如k6,k12,k63），，c=25,d=24，a*b/(c+d-a*b)=9/40<0.25的第三和第四行没有被删除。接下来，第一两行和第三两行的重叠度为5*4/(25+28-5*4)=0.61>0.25，则删除第三两行。
最终答案是第一和第二两行。

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63

伪代码如下：

for i=1:(n-1) # n is a half of the number of rows of the big file

for j=(i+1):n

if overlap degrees of the ith two rows and jth two rows is more than 0.25

delete the jth two rows from the big file

end

python代码如下，但这是错误的。如何解决？

with open("iuputfile.txt") as fileobj:

sets = [set(line.split()) for line in fileobj]

for first_index in range(len(sets) - 4, -2, -2):

c=len(sets[first_index])*len(sets[first_index+1])

for second_index in range(len(sets)-2 , first_index, -2):

d=len(sets[second_index])*len(sets[second_index+1])

ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1])

if (ab/(c+d-ab))>0.25:

del sets[second_index]

del sets[second_index+1]

with open("outputfile.txt", "w") as fileobj:

for set_ in sets:

# order of the set is undefined, so we need to sort each set

output = " ".join(set_)

fileobj.write("{0}\n".format(output))

可以在https://stackoverflow.com/questions/17321275/中找到类似的问题

如何修改该代码以解决Python中的此问题？谢谢！

MYYA

浏览 206回答 2

如何在python中过滤大文件中两行的重叠

2回答