剥离一些标签并重命名它们

使用 lxml 库,拥有这个 doc xml 文件,我想剥离一些标签并重命名它们:doc.xml


<html>

    <body>

        <h5>Fruits</h5>

        <div>This is some <span attr="foo">Text</span>.</div>

        <div>Some <span>more</span> text.</div>

        <h5>Vegetables</h5>

        <div>Yet another line <span attr="bar">of</span> text.</div>

        <div>This span will get <span attr="foo">removed</span> as well.</div>

        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>

        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>

    </body>

</html>

而不是 html,body 将所有内容包装在“p tag”中,而不是让 h5 和每个 div 使用 lxml 将所有内容作为示例包装如下:我的问题是如何从一种格式以下面的格式包装所有内容?


<p>

<h5 title='Fruits'> 

<div>This is some <span attr='foo'>Test</span>.</div>

<div>Some<span>more</span>text.</div>

</h5>

<h5 title='Vegetables'>

<div>Yet another line <span attr='bar'>of</span>text.</div>

....

</h5>

</p>

使用 lxml,剥离标签:


tree = etree.tostring(doc.xml)

tree1 = lxml.html.fromstring(tree)

etree.strip_tags(tree1, 'body')

有人对此有任何想法吗?


慕森卡
浏览 108回答 2
2回答

皈依舞

创建一个只有标签的新文档。<p>迭代<body>原始文档中标记的后代。如果遇到<h5>标签;将<h5>标签添加到<p>标签并将后续标签作为后代添加到它(<h5>)将标签从原始文档添加到新文档 - 作为其<p>标签 的后代

至尊宝的传说

这是使用 lxml 的 xslt 解决方案。它将处理卸载到 libxml。我在转换样式表中添加了注释:from lxml import etreexsl = etree.XML('''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">&nbsp; &nbsp; <xsl:output method="xml" indent="yes" />&nbsp; &nbsp; <xsl:strip-space elements="*"/>&nbsp; &nbsp; <xsl:template match="/">&nbsp; &nbsp; &nbsp; &nbsp; <p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <xsl:apply-templates select="html/body"/>&nbsp; &nbsp; &nbsp; &nbsp; </p>&nbsp; &nbsp; </xsl:template>&nbsp; &nbsp; <!-- match body, but do not add content; this excludes /html/body elements -->&nbsp; &nbsp; <xsl:template match="body">&nbsp; &nbsp; &nbsp; &nbsp; <xsl:apply-templates />&nbsp; &nbsp; </xsl:template>&nbsp; &nbsp; <xsl:template match="h5">&nbsp; &nbsp; &nbsp; &nbsp; <!-- record the current h5 title -->&nbsp; &nbsp; &nbsp; &nbsp; <xsl:variable name="title" select="."/>&nbsp; &nbsp; &nbsp; &nbsp; <h5>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <xsl:attribute name="title">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <xsl:value-of select="$title" />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </xsl:attribute>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <!-- deep copy of each consecutive div following the current h5 element -->&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <xsl:copy-of select="." />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; </xsl:for-each>&nbsp; &nbsp; &nbsp; &nbsp; </h5>&nbsp; &nbsp; </xsl:template>&nbsp; &nbsp; <!-- match div, but do not output anything since we are copying it into the new h5 element -->&nbsp; &nbsp; <xsl:template match="div" /></xsl:stylesheet>''')transform = etree.XSLT(xsl)with open("doc.xml") as f:&nbsp; &nbsp; print(transform(etree.parse(f)), end='')如果样式表存储在文件名 doc.xsl 中,则可以使用 libxml 实用程序 xsltproc 获得相同的结果:xsltproc doc.xsl doc.xml结果:<?xml version="1.0"?><p>&nbsp; <h5 title="Fruits">&nbsp; &nbsp; <div>This is some <span attr="foo">Text</span>.</div>&nbsp; &nbsp; <div>Some <span>more</span> text.</div>&nbsp; </h5>&nbsp; <h5 title="Vegetables">&nbsp; &nbsp; <div>Yet another line <span attr="bar">of</span> text.</div>&nbsp; &nbsp; <div>This span will get <span attr="foo">removed</span> as well.</div>&nbsp; &nbsp; <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>&nbsp; &nbsp; <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>&nbsp; </h5></p>
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python