Python 高效地从 XML 中提取嵌套元素

我正在尝试解析大量包含大量嵌套元素的 XML 文件,以收集稍后使用的特定信息。由于文件数量巨大,我试图尽可能高效地完成此操作,以减少处理时间。我可以使用 xpath 提取所需的信息,如下所示,但似乎效率很低。特别是必须运行第二个 for 循环来使用另一个 xpath 搜索提取结果值。我可以使用更有效的方法来获得下面所需的输出吗?我可以通过单个 xpath 查询收集所需的信息吗?

所需的解析格式:


Id             Object    Type             Result

Packages       total     totalPackages    1200

DeliveryMethod priority  packagesSent     100

DeliveryMethod express   packagesSent     200

DeliveryMethod ground    packagesSent     300

DeliveryMethod priority  packagesReceived 100

DeliveryMethod express   packagesReceived 200

DeliveryMethod ground    packagesReceived 300

XML 示例:


<?xml version="1.0" encoding="utf-8"?>

    <Data>

        <Location localDn="Chicago"/>

        <Info Id="Packages">

            <job jobId="1"/>

            <Type pos="1">totalPackages</Type>

            <Value Object="total">

                <result pos="1">1200</result>

            </Value>

        </Info>

        <Info Id="DeliveryMethod">

            <job jobId="1"/>

            <Type pos="1">packagesSent</Type>

            <Type pos="2">packagesReceived</Type>

            <Value Object="priority">

                <result pos="1">100</result>

                <result pos="2">100</result>

            </Value>

            <Value Object="express">

                <result pos="1">200</result>

                <result pos="2">200</result>

            </Value>

            <Value Object="ground">

                <result pos="1">300</result>

                <result pos="2">300</result>

            </Value>

        </Info>

  </Data>


是否可以通过迭代来获取所有信息tree.xpath('//*')?


慕斯709654
浏览 41回答 2
2回答

慕村225694

其中一项优化不会像您现在使用tree.xpath('//*')if 语句那样遍历所有标签并进行检查。这可以替换为tree.xpath('//Type')接下来需要优化的是迭代值。Value您无需一遍又一遍地迭代( tree.xpath('//Value')),您可以获得标签的所有同级Values标签Typeelem.xpath('./following-sibling::Value')from lxml import etreexml_file = open('stack_sample.xml')tree = etree.parse(xml_file)root = tree.getroot()for elem in tree.xpath('//Type'):&nbsp; &nbsp; _id = elem.getparent().attrib["Id"]&nbsp; &nbsp; _type = elem.text&nbsp; &nbsp; _position = elem.attrib["pos"]&nbsp; &nbsp; values = elem.xpath('./following-sibling::Value')&nbsp; &nbsp; for value in values:&nbsp; &nbsp; &nbsp; &nbsp; _object = value.attrib['Object']&nbsp; &nbsp; &nbsp; &nbsp; _result = value.xpath(f'./result[@pos={_position}]/text()')[0]&nbsp; &nbsp; &nbsp; &nbsp; print(_id, _type, _object, _result)这将打印出:Packages totalPackages total 1200DeliveryMethod packagesSent priority 100DeliveryMethod packagesSent express 200DeliveryMethod packagesSent ground 300DeliveryMethod packagesReceived priority 100DeliveryMethod packagesReceived express 200DeliveryMethod packagesReceived ground 300编辑这是针对特定情况的解决方案,其中我们确定resultinValue标签的数量等于与其他解决方案Type是同级的标签的数量Value,另外解决方案假设Type和result按相同pos属性排序。请记住,这是一种非常具体的解决方案,而不是通用的解决方案。from lxml import etreexml_file = open('stack_sample.xml')tree = etree.parse(xml_file)root = tree.getroot()for elem in tree.xpath('//Type'):&nbsp; &nbsp; _id = elem.getparent().attrib["Id"]&nbsp; &nbsp; _type = elem.text&nbsp; &nbsp; _objects = elem.xpath('./following-sibling::Value/@Object')&nbsp; &nbsp; _results = elem.xpath('./following-sibling::Value/result/text()')&nbsp; &nbsp; for _object, _result in zip(_objects, _results):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print(_id, _type, _object, _result)输出:Packages totalPackages total 1200DeliveryMethod packagesSent priority 100DeliveryMethod packagesSent express 100DeliveryMethod packagesSent ground 200DeliveryMethod packagesReceived priority 100DeliveryMethod packagesReceived express 100DeliveryMethod packagesReceived ground 200

茅侃侃

//*如果您不迭代所有标签( ),而只是迭代,也许性能会更高<Value>:from lxml import etreexml_file = open('stack_sample.xml')tree = etree.parse(xml_file)root = tree.getroot()for val in tree.xpath('//Value'):&nbsp; &nbsp; t = {t.get('pos'): t.text for t in val.getparent().xpath('./Type')}&nbsp; &nbsp; for r in val.xpath('./result'):&nbsp; &nbsp; &nbsp; &nbsp; print(val.getparent().get('Id'), val.get('Object'), t[r.get('pos')], r.text)印刷:Packages total totalPackages 1200DeliveryMethod priority packagesSent 100DeliveryMethod priority packagesReceived 100DeliveryMethod express packagesSent 200DeliveryMethod express packagesReceived 200DeliveryMethod ground packagesSent 300DeliveryMethod ground packagesReceived 300
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python