需要帮助解析复杂的文本文件

PMID您可以使用as 键和AUTHORs 作为值来收集字典中的数据。假设您从文件开始from io import StringIOfo = StringIO('''PMID- 12345678xyz - text (might be multiple lines)xyz- text (might be multiple lines)AUTHOR- author1AUTHOR- author2PMID- 12345679xyz - text (might be multiple lines)xyz- text (might be multiple lines)AUTHOR- author3AUTHOR- author4''')    # with open(filename, 'r') as fo:然后迭代行并填充字典records = dict()pmid = Nonefor line in fo.readlines():    if line.startswith('PMID-'):        pmid = line.split('-')[-1].strip()        records[pmid] = []    elif line.startswith('AUTHOR'):        records[pmid].append(line.split('-')[-1].strip())创建数据框时，您可以将df = pd.DataFrame(records)每个作者放在一列中或在传递给数据框构造函数之前加入列表df = pd.DataFrame(    [', '.join(r) for r in records.values()],    index=records.keys())输出                         012345678  author1, author212345679  author3, author4

需要帮助解析复杂的文本文件

1回答