对大型XML文件使用Python Iterparse

我需要用Python编写一个解析器,该解析器可以在没有太多内存(只有2 GB)的计算机上处理一些非常大的文件(> 2 GB)。我想在lxml中使用iterparse做到这一点。


我的文件格式为:


<item>

  <title>Item 1</title>

  <desc>Description 1</desc>

</item>

<item>

  <title>Item 2</title>

  <desc>Description 2</desc>

</item>

到目前为止,我的解决方案是:


from lxml import etree


context = etree.iterparse( MYFILE, tag='item' )


for event, elem in context :

      print elem.xpath( 'description/text( )' )


del context

但是,不幸的是,此解决方案仍在消耗大量内存。我认为问题在于,在与每个“ ITEM”打交道之后,我需要做一些清理空孩子的事情。在处理完数据以进行适当清理之后,谁能提出一些建议以解决我的问题?


侃侃尔雅
浏览 1165回答 3
3回答

隔江千里

iterparse()让您在构建树时做些事情,这意味着除非您删除不再需要的树,否则最终还是会剩下整个树。欲了解更多信息:阅读这个由最初的ElementTree实现的作者(但它也适用限于lxml)

万千封印

以我的经验,有或没有element.clear(请参阅F. Lundh和L. Daly)的iterparse 不能总是处理非常大的XML文件:它运行良好一段时间,突然内存消耗飞速上升,并且发生内存错误或系统崩溃。如果遇到相同的问题,也许可以使用相同的解决方案:expat解析器。另请参见F. Lundh或以下使用OP的XML代码段的示例(另加两个表示检查是否存在编码问题的文字):import xml.parsers.expatfrom collections import dequedef iter_xml(inpath: str, outpath: str) -> None:&nbsp; &nbsp; def handle_cdata_end():&nbsp; &nbsp; &nbsp; &nbsp; nonlocal in_cdata&nbsp; &nbsp; &nbsp; &nbsp; in_cdata = False&nbsp; &nbsp; def handle_cdata_start():&nbsp; &nbsp; &nbsp; &nbsp; nonlocal in_cdata&nbsp; &nbsp; &nbsp; &nbsp; in_cdata = True&nbsp; &nbsp; def handle_data(data: str):&nbsp; &nbsp; &nbsp; &nbsp; nonlocal in_cdata&nbsp; &nbsp; &nbsp; &nbsp; if not in_cdata and open_tags and open_tags[-1] == 'desc':&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; data = data.replace('\\', '\\\\').replace('\n', '\\n')&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; outfile.write(data + '\n')&nbsp; &nbsp; def handle_endtag(tag: str):&nbsp; &nbsp; &nbsp; &nbsp; while open_tags:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; open_tag = open_tags.pop()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if open_tag == tag:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; def handle_starttag(tag: str, attrs: 'Dict[str, str]'):&nbsp; &nbsp; &nbsp; &nbsp; open_tags.append(tag)&nbsp; &nbsp; open_tags = deque()&nbsp; &nbsp; in_cdata = False&nbsp; &nbsp; parser = xml.parsers.expat.ParserCreate()&nbsp; &nbsp; parser.CharacterDataHandler = handle_data&nbsp; &nbsp; parser.EndCdataSectionHandler = handle_cdata_end&nbsp; &nbsp; parser.EndElementHandler = handle_endtag&nbsp; &nbsp; parser.StartCdataSectionHandler = handle_cdata_start&nbsp; &nbsp; parser.StartElementHandler = handle_starttag&nbsp; &nbsp; with open(inpath, 'rb') as infile:&nbsp; &nbsp; &nbsp; &nbsp; with open(outpath, 'w', encoding = 'utf-8') as outfile:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; parser.ParseFile(infile)iter_xml('input.xml', 'output.txt')input.xml中:<root>&nbsp; &nbsp; <item>&nbsp; &nbsp; <title>Item 1</title>&nbsp; &nbsp; <desc>Description 1ä</desc>&nbsp; &nbsp; </item>&nbsp; &nbsp; <item>&nbsp; &nbsp; <title>Item 2</title>&nbsp; &nbsp; <desc>Description 2ü</desc>&nbsp; &nbsp; </item></root>output.txt的:Description 1äDescription 2ü
打开App,查看更多内容
随时随地看视频慕课网APP