守候你守候我
如果我明白你在做什么;在每一列中找到最常见的字符(?),你可以这样做:def most_common(col, exclude_char='N'): col = list(filter((exclude_char).__ne__, col)) return max(set(col), key=col.count)sequences = []with open('DNAinput.txt', 'r') as file: for line in file: if line[0] == '>': continue else: sequences.append(line.strip())m = max([len(v) for v in sequences])matrix = [list(v) for v in sequences]for seq in matrix: seq.extend(list('N' * (m - len(seq))))transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)] for column in transposed_matrix: print(most_common(column))这通过以下方式工作:打开您的文件并将其读入list如下所示:# This is the `sequences` list['GATCA', 'AATC', 'AATA', 'ACTA']获取最长DNA序列的长度:# m = max([len(v) for v in sequences])5从这些序列创建一个矩阵(列表列表):# matrix = [list(v) for v in sequences][['G', 'A', 'T', 'C', 'A'], ['A', 'A', 'T', 'C'], ['A', 'A', 'T', 'A'], ['A', 'C', 'T', 'A']]填充矩阵,使所有序列的长度相同:# for seq in matrix:# seq.extend(list('N' * (m - len(seq))))[['G', 'A', 'T', 'C', 'A'], ['A', 'A', 'T', 'C', 'N'], ['A', 'A', 'T', 'A', 'N'], ['A', 'C', 'T', 'A', 'N']]转置矩阵,使列移动top -> bottom(不是left -> right)。这会将来自相同位置的所有字符一起放入一个列表中。# [[matrix[j][i] for j in range(len(matrix))] for i in range(m)][['G', 'A', 'A', 'A'], ['A', 'A', 'A', 'C'], ['T', 'T', 'T', 'T'], ['C', 'C', 'A', 'A'], ['A', 'N', 'N', 'N']]最后,迭代转置矩阵中的每个列表,并most_common以子列表作为输入调用:# for column in transposed_matrix:# print(most_common(column))AATCA这种方法有一些注意事项;首先,如果most_common单个位置中有相同数量的核苷酸,我所包含的函数将返回第一个值(参见位置四,这可能是A或C)。此外,该most_common功能可能需要成倍的时间要比使用Counter从集合。由于这些原因,我强烈建议使用以下脚本代替collections安装时包含在 python 中的脚本。from collections import Countersequences = []with open('DNAinput.txt', 'r') as file: for line in file: if line[0] == '>': continue else: sequences.append(line.strip())m = max([len(v) for v in sequences])matrix = [list(v) for v in sequences]for seq in matrix: seq.extend(list('N' * (m - len(seq))))transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)] for column in transposed_matrix: print(Counter(column).most_common(5))