猿问

如何使用 Python 和 XSLT 文件迭代解析巨大的 XML 并写入 CSV

在处理大型 XML 文件时,我无法使用 XSLT 将 XML 展平和转换为 CSV 文件。


目前,我正在lxml使用 XSL 文件解析嵌套的 XML 文件来展平输出,然后将输出写入 CSV 文件。


我的 XML 看起来像这样:


<root>

    <level1>

        <level2>

            <topid>1</topid>

            <level3>

                <subtopid>1</topid>

                <level4>

                    <subid>1</id>

                    <descr>test</descr>

                </level4>

                <level4>

                    <subid>2</id>

                    <descr>test2</descr>

                </level4>

                ...

            </level3>

            ...

        </level2>

    </level1>

</root>

我想最终得到以下 CSV 文件:


topid,subtopid,subid,descr

1,1,1,test

1,1,2,test2

....

我的 XSLT:


<?xml version="1.0" encoding="UTF-8" ?>

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text" encoding="utf-8" use-character-maps="map"/>

<xsl:character-map name="map">

    <xsl:output-character character="," string=" "/>

</xsl:character-map>


<xsl:strip-space elements="*"/>

<xsl:variable name="delimiter" select="','"/>

<xsl:variable name="newline" select="'&#xd;'" />


<xsl:template match="/root">

    <xsl:text>topid,subtopid,subid,descr</xsl:text>

    <xsl:value-of select="$newline" />


    <xsl:for-each select="level1/level2/level3/level4">

        <xsl:value-of select="ancestor::root/level1/level2/topid" />

        <xsl:value-of select="$delimiter" />

        <xsl:value-of select="ancestor::root/level1/level2/level3/subtopid" />

        <xsl:value-of select="$delimiter" />

        <xsl:value-of select="subid" />

        <xsl:value-of select="$delimiter" />

        <xsl:value-of select="descr" />

        <xsl:value-of select="$newline" />

    </xsl:for-each>

</xsl:template>


这对小文件很有用,但现在我想对 +- 2.5gb 的 XML 文件做同样的事情。使用 etree.parse 会将其加载到内存中,这显然不适用于较大的文件。


我想在某个地方迭代,所以我没有将 XML 文件加载到内存中并逐行写入 CSV 行,同时仍然使用 XSLT 进行转换。我正在使用 XSLT 文件,因为这是我知道(现在)如何展平嵌套的 XML 文件的唯一方法。


陪伴而非守候
浏览 223回答 3
3回答

至尊宝的传说

我宁愿在 Python 中使用 XSLT 3.0(甚至 2.0!),但还没有时间弄清楚如何使用 Saxon/C。另一种选择是使用iterparse().例子...XML 输入(固定为格式良好并添加第二个level3用于测试)<root>&nbsp; &nbsp; <level1>&nbsp; &nbsp; &nbsp; &nbsp; <level2>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <topid>1</topid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level3>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subtopid>1</subtopid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subid>1</subid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <descr>test</descr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subid>2</subid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <descr>test2</descr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level3>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level3>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subtopid>2</subtopid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subid>1</subid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <descr>test</descr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <subid>2</subid>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <descr>test2</descr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level4>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </level3>&nbsp; &nbsp; &nbsp; &nbsp; </level2>&nbsp; &nbsp; </level1></root>Pythonfrom lxml import etreeimport csvcontext = etree.iterparse("test.xml", events=("start", "end"))fields = ("topid", "subtopid", "subid", "descr")with open("test.csv", "w", newline="", encoding="utf8") as xml_data_to_csv:&nbsp; &nbsp; csv_writer = csv.DictWriter(xml_data_to_csv, fieldnames=fields,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; delimiter=",", quoting=csv.QUOTE_MINIMAL)&nbsp; &nbsp; csv_writer.writeheader()&nbsp; &nbsp; topid = None&nbsp; &nbsp; subtopid = None&nbsp; &nbsp; values = {}&nbsp; &nbsp; for event, elem in context:&nbsp; &nbsp; &nbsp; &nbsp; tag = elem.tag&nbsp; &nbsp; &nbsp; &nbsp; text = elem.text&nbsp; &nbsp; &nbsp; &nbsp; if tag == "topid" and text:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; topid = text&nbsp; &nbsp; &nbsp; &nbsp; if tag == "subtopid" and text:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; subtopid = text&nbsp; &nbsp; &nbsp; &nbsp; if tag == "subid" and text:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values["subid"] = text&nbsp; &nbsp; &nbsp; &nbsp; if tag == "descr" and text:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values["descr"] = text&nbsp; &nbsp; &nbsp; &nbsp; if event == "start" and tag == "level4":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Build a dict containing all of the "fields" with default values of "Unknown".&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values = {key: "Unknown" for key in fields}&nbsp; &nbsp; &nbsp; &nbsp; if event == "end" and tag == "level4":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values["topid"] = topid&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values["subtopid"] = subtopid&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; csv_writer.writerow(values)&nbsp; &nbsp; &nbsp; &nbsp; elem.clear()CSV 输出topid,subtopid,subid,descr1,1,1,test1,1,2,test21,2,1,test1,2,2,test2

holdtom

一种可能性是使用 XSLT 3.0 流。这里有两个挑战:(a) 使您的代码可流式传输。如果没有看到样式表代码,我们无法判断这有多难。(b) 安装和运行流式 XSLT 3.0 处理器。这取决于您对 Python 环境的锁定程度。如果必须在 Python 中完成,您可以尝试安装 Saxon/C。另一种方法是调用不同的环境,在这种情况下您有更多选择,例如您可以在 Java 上运行 Saxon-EE。之后看你贴的代码,比较奇怪<xsl:for-each select="level1/level2/level3/level4">&nbsp; &nbsp; <xsl:value-of select="ancestor::root/level1/level2/topid" />我怀疑您想输出topid“当前”level2元素的 ,但这不是这样做的(在 XSLT 1.0 中它将打印第一个的值level2/topic,在 XSLT 2.0+ 中将打印所有level2/topic元素的值。我怀疑你真的想要这样的东西:&nbsp; &nbsp; <xsl:for-each select="level1/level2/level3/level4">&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="ancestor::level2/topid" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="ancestor::level3/subtopid" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="subid" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="descr" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$newline" />&nbsp; &nbsp; </xsl:for-each>这几乎是可流式传输的,但不完全是。流式传输不允许您回到 toppid 和 subtopid 元素。使其可流式传输的最简单方法可能是将这些元素的最新值保存在累加器中:<xsl:accumulator name="topid" as="xs:string" initial-value="''">&nbsp; <xsl:accumulator-rule match="topid/text()" select="string(.)"/></xsl:accumulator><xsl:accumulator name="subtopid" as="xs:string" initial-value="''">&nbsp; <xsl:accumulator-rule match="subtopid/text()" select="string(.)"/></xsl:accumulator>然后访问这些值:&nbsp; &nbsp; <xsl:for-each select="level1/level2/level3/level4">&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="accumulator-before('topid')" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="accumulator-before('subtopid')" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="subid" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$delimiter" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="descr" />&nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$newline" />&nbsp; &nbsp; </xsl:for-each>

守着星空守着你

Saxon/C 和 python 可以工作:一位用户已成功使用 Boost.Python 与 C++ 库交互。另一个用户以不同的方式完成了接口:https&nbsp;:&nbsp;//github.com/ajelenak/pysaxon
随时随地看视频慕课网APP

相关分类

Python
我要回答