chrX 7970000 8670000 3 2 7 7 RPS6KA6 4
chrX 7970000 8670000 3 2 7 7 SATL1 3
chrX 7970000 8670000 3 2 7 7 SH3BGRL 4
chrX 7970000 8670000 3 2 7 7 VCX2 1
chrX 86580000 86980000 1 1 1 5 KLHL4 2
chrX 87370000 88620000 4 4 11 11 CPXCR1 2
chrX 87370000 88620000 4 4 11 11 FAM9A 2
chrX 89050000 91020000 11 6 10 13 FAM9B 3
chrX 89050000 91020000 11 6 10 13 PABPC5 2
我想计算每行重复的次数 ( only 1st, 2nd and 3rd columns)。在output,会有5 columns。the1st 3 columns将相同(每行仅重复一次),但4th column在 thesame column和 the 中会有多个字符same line(这些字符在8th columnof 中original file)。the5th column是1st 3 lines are repeatedin的次数original file。
in short: 在input file,columns 4,5,6,7 and 9 are useless对于输出文件。我们应该算在其中的行数1st 3 columns are the same,因此,在output file该1st 3 column would be the same as input file(但only repeated once)。该5th column is the number of times行是重复的。的4th column of output是所有字符从8th column这些都是重复行。在expected output,这一行是repeated 4 times:chrX 7970000 8670000。所以,5th column is 4和4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2。正如您在4th column are comma separated.
chrX 7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000 KLHL4 1
chrX 87370000 88620000 CPXCR1,FAM9A 2
chrX 89050000 91020000 FAM9B,PABPC5 2
我试图在 Python 中做到这一点并编写了以下代码:
file = open("myfile.txt", 'rb')
infile = []
for line in file:
count = 0
final = []
for i in range(len(infile)):
count += 1
if infile[i-1] == infile[i]
final.append(infile[0,1,2,7, count])