如何在 ID 行后跟值行的文件中按 ID 对行进行分组?

我生成的序列文件如下:


>rpl-7

ATGGCTCCAAC

>rpl-7

AAGAAAGTGCCACAGGTTCCAGAAAC

>rpl-8

AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGC

>rpl-8

GCTCTCCAGATCCTCCGTCTTCGTCAGATCAA

>rpl-8

AAGTTCAACATCATCTGTCTTGAGGA

我想合并相同ID的序列,就像这样:


>rpl-7

ATGGCTCCAAC

AAGAAAGTGCCACAGGTTCCAGAAAC

>rpl-8

AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGC

GCTCTCCAGATCCTCCGTCTTCGTCAGATCAA

AAGTTCAACATCATCTGTCTTGAGGA

我用python判断以'>'开头的字符串是否相同,如果相同则继续增加序列。但是,这种方法无法输出第一个ID。另外,我认为使用awk会更容易,不幸的是我对 awk 不熟悉。你知道该怎么做吗?谢谢。


缥缈止盈
浏览 124回答 3
3回答

红颜莎娜

循环输入文件,使用rpl-idas 键分组到字典并将值附加到列表中:rpl_dict = {}with open('rpl_input.txt') as rpl_input_file:    lines = rpl_input_file.readlines()    for line in lines:        # Fetching current `rpl-id`        if line.startswith('>rpl'):            rpl_key = line.strip()        # Fetching current `rpl-value`        else:            rpl_value = line.strip()            # Appending current `rpl-value`            if rpl_key not in rpl_dict.keys():                rpl_dict[rpl_key] = []            rpl_dict[rpl_key].append(rpl_value)# {'>rpl-7': ['ATGGCTCCAAC', 'AAGAAAGTGCCACAGGTTCCAGAAAC'], '>rpl-8': ['AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGC', 'GCTCTCCAGATCCTCCGTCTTCGTCAGATCAA', 'AAGTTCAACATCATCTGTCTTGAGGA']}print(rpl_dict)with open('rpl_output.txt', 'w') as rpl_output_file:    for rpl_id, rpl_values in rpl_dict.items():        rpl_output_file.write(f'{rpl_key}\n')        for v in rpl_values:            rpl_output_file.write(f'{v}\n')输出文件:>rpl-8ATGGCTCCAACAAGAAAGTGCCACAGGTTCCAGAAAC>rpl-8AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGCGCTCTCCAGATCCTCCGTCTTCGTCAGATCAAAAGTTCAACATCATCTGTCTTGAGGA

慕勒3428872

这是另一个解决方案,input_ = """>rpl-7ATGGCTCCAAC>rpl-7AAGAAAGTGCCACAGGTTCCAGAAAC>rpl-8AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGC>rpl-8GCTCTCCAGATCCTCCGTCTTCGTCAGATCAA>rpl-8AAGTTCAACATCATCTGTCTTGAGGA"""results = {}lines = input_.splitlines()for i, j in zip(lines[::2], lines[1::2]):    results.setdefault(i, []).append(j)for i, j in results.items():    print(i)    print("\n".join(j))>rpl-7ATGGCTCCAACAAGAAAGTGCCACAGGTTCCAGAAAC>rpl-8AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGCGCTCTCCAGATCCTCCGTCTTCGTCAGATCAAAAGTTCAACATCATCTGTCTTGAGGA

撒科打诨

您可以使用正则表达式来执行此操作。由于您提到文件,我添加了新行字符,您可以将其替换为文件的内容。import reregex = r'rpl-\d\n.*(?:$|\n)'dic = {}test_str = (">rpl-7\n"    "ATGGCTCCAAC\n"    ">rpl-7\n"    "AAGAAAGTGCCACAGGTTCCAGAAAC\n"    ">rpl-8\n"    "AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGC\n"    ">rpl-8\n"    "GCTCTCCAGATCCTCCGTCTTCGTCAGATCAA\n"    ">rpl-8\n"    "AAGTTCAACATCATCTGTCTTGAGGA\n")matches = re.finditer(regex, test_str, re.MULTILINE)for  match in matches:    rpl,pro = match.group().split('\n')    if rpl in dic:        dic[rpl] = dic[rpl]+pro    else:        dic[rpl] = pro输出:{'rpl-7': 'ATGGCTCCAACAAGAAAGTGCCACAGGTTCCAGAAAC', 'rpl-8': 'AAGAACAAGGAGAAGAAGACCCAATACTTCAAGCGTGCGCTCTCCAGATCCTCCGTCTTCGTCAGATCAAAAGTTCAACATCATCTGTCTTGAGGA'}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python