在 Python 中解析嵌套且复杂的 XML

我正在尝试解析相当复杂的 xml 文件并将其内容存储在数据框中。我尝试了 xml.etree.ElementTree 并且设法检索了一些元素,但我以某种方式多次检索了它,就好像有更多对象一样。我正在尝试提取以下内容:category, created, last_updated, accession type, name type identifier, name type synonym as a list


<cellosaurus>

<cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6">

  <accession-list>

    <accession type="primary">CVCL_B375</accession>

  </accession-list>

  <name-list>

    <name type="identifier">#490</name>

    <name type="synonym">490</name>

    <name type="synonym">Mab 7</name>

    <name type="synonym">Mab7</name>

  </name-list>

  <comment-list>

    <comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment>

    <comment category="Monoclonal antibody isotype"> IgM, kappa </comment>

  </comment-list>

  <species-list>

    <cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term>

  </species-list>

  <derived-from>

    <cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term>

  </derived-from>

  <reference-list>

    <reference resource-internal-ref="Patent=US5616470"/>

  </reference-list>

  <xref-list>

    <xref database="CLO" category="Ontologies" accession="CLO_0001018">

      <url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url>

    </xref>

    <xref database="ATCC" category="Cell line collections" accession="HB-12029">

      <url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url>

    </xref>

    <xref database="Wikidata" category="Other" accession="Q54422073">

      <url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url>

    </xref>

  </xref-list>

</cell-line>

</cellosaurus>


尚方宝剑之说
浏览 165回答 3
3回答

蓝山帝景

鉴于在某些情况下您希望解析标签属性,而在其他情况下您希望解析 tag_values,您的问题有点不清楚。我的理解如下。您需要以下值:标签cell-line的属性类别的值。标签cell-line创建的属性值。标签cell-line的属性last_updated的值。标签加入的属性类型的值。与具有属性标识符的标签名称相对应的文本。与带有属性synonym 的标签名称相对应的文本。这些值可以使用模块 xml.etree.Etree 从 xml 文件中提取。特别是,请注意使用Element 类的findall和iter方法。假设 xml 位于名为input.xml的文件中,则以下代码片段应该可以解决问题。import xml.etree.ElementTree as etdef main():    tree = et.parse('cellosaurus.xml')    root = tree.getroot()    results = []    for element in root.findall('.//cell-line'):        key_values = {}        for key in ['category', 'created', 'last_updated']:            key_values[key] = element.attrib[key]        for child in element.iter():            if child.tag == 'accession':                key_values['accession type'] = child.attrib['type']            elif child.tag == 'name' and child.attrib['type'] == 'identifier':                key_values['name type identifier'] = child.text            elif child.tag == 'name' and child.attrib['type'] == 'synonym':                key_values['name type synonym'] = child.text        results.append([                # Using the get method of the dict object in case any particular                # entry does not have all the required attributes.                 key_values.get('category'            , None)                ,key_values.get('created'             , None)                ,key_values.get('last_updated'        , None)                ,key_values.get('accession type'      , None)                ,key_values.get('name type identifier', None)                ,key_values.get('name type synonym'   , None)                ])    print(results)if __name__ == '__main__':    main()

狐的传说

恕我直言,解析 xml 的最简单方法是使用 lxml。from lxml import etreedata = """[your xml above]"""doc = etree.XML(data)for att in doc.xpath('//cell-line'):&nbsp; &nbsp; print(att.attrib['category'])&nbsp; &nbsp; print(att.attrib['last_updated'])&nbsp; &nbsp; print(att.xpath('.//accession/@type')[0])&nbsp; &nbsp; print(att.xpath('.//name[@type="identifier"]/text()')[0])&nbsp; &nbsp; print(att.xpath('.//name[@type="synonym"]/text()'))输出:Hybridoma2020-03-12primary#490['490', 'Mab 7', 'Mab7']然后,您可以将输出分配给变量、附加到列表等。

呼唤远方

另一种方法。最近比较了几个XML解析库,发现这个很好用。我推荐它。from simplified_scrapy import SimplifiedDoc, utilsxml = '''your xml above'''# xml = utils.getFileContent('your file name.xml')results = []doc = SimplifiedDoc(xml)for ele in doc.selects('cell-line'):&nbsp; key_values = {}&nbsp; for k in ele:&nbsp; &nbsp; if k not in ['tag','html']:&nbsp; &nbsp; &nbsp; key_values[k]=ele[k]&nbsp; key_values['name type identifier'] = ele.select('name@type="identifier">text()')&nbsp; key_values['name type synonym'] = ele.selects('name@type="synonym">text()')&nbsp; results.append(key_values)print (results)结果:[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python