在 Python 中解析嵌套且复杂的 XML

3回答

蓝山帝景

鉴于在某些情况下您希望解析标签属性，而在其他情况下您希望解析 tag_values，您的问题有点不清楚。我的理解如下。您需要以下值：标签cell-line的属性类别的值。标签cell-line创建的属性值。标签cell-line的属性last_updated的值。标签加入的属性类型的值。与具有属性标识符的标签名称相对应的文本。与带有属性synonym 的标签名称相对应的文本。这些值可以使用模块 xml.etree.Etree 从 xml 文件中提取。特别是，请注意使用Element 类的findall和iter方法。假设 xml 位于名为input.xml的文件中，则以下代码片段应该可以解决问题。import xml.etree.ElementTree as etdef main(): tree = et.parse('cellosaurus.xml') root = tree.getroot() results = [] for element in root.findall('.//cell-line'): key_values = {} for key in ['category', 'created', 'last_updated']: key_values[key] = element.attrib[key] for child in element.iter(): if child.tag == 'accession': key_values['accession type'] = child.attrib['type'] elif child.tag == 'name' and child.attrib['type'] == 'identifier': key_values['name type identifier'] = child.text elif child.tag == 'name' and child.attrib['type'] == 'synonym': key_values['name type synonym'] = child.text results.append([ # Using the get method of the dict object in case any particular # entry does not have all the required attributes. key_values.get('category' , None) ,key_values.get('created' , None) ,key_values.get('last_updated' , None) ,key_values.get('accession type' , None) ,key_values.get('name type identifier', None) ,key_values.get('name type synonym' , None) ]) print(results)if __name__ == '__main__': main()

0 0

狐的传说

恕我直言，解析 xml 的最简单方法是使用 lxml。from lxml import etreedata = """[your xml above]"""doc = etree.XML(data)for att in doc.xpath('//cell-line'):    print(att.attrib['category'])    print(att.attrib['last_updated'])    print(att.xpath('.//accession/@type')[0])    print(att.xpath('.//name[@type="identifier"]/text()')[0])    print(att.xpath('.//name[@type="synonym"]/text()'))输出：Hybridoma2020-03-12primary#490['490', 'Mab 7', 'Mab7']然后，您可以将输出分配给变量、附加到列表等。

0 0

呼唤远方

另一种方法。最近比较了几个XML解析库，发现这个很好用。我推荐它。from simplified_scrapy import SimplifiedDoc, utilsxml = '''your xml above'''# xml = utils.getFileContent('your file name.xml')results = []doc = SimplifiedDoc(xml)for ele in doc.selects('cell-line'):  key_values = {}  for k in ele:    if k not in ['tag','html']:      key_values[k]=ele[k]  key_values['name type identifier'] = ele.select('name@type="identifier">text()')  key_values['name type synonym'] = ele.selects('name@type="synonym">text()')  results.append(key_values)print (results)结果：[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]

0 0