使用 Python xml.etree.ElementTree 遍历 XML 树的问题

我有一个结构如下所示的 XML 文件(为了这个问题的目的而简化)。对于每条记录,我想提取文章标题和“ArticleId”元素中包含DOI编号的属性“IdType”的值(有时这个属性可能会丢失),然后将文章标题存储在带有DOI的字典中作为关键。


<PubmedArticleSet>

<PubmedArticle>

    <MedlineCitation Status="MEDLINE" Owner="NLM">

        <Article PubModel="Print-Electronic">

            <ArticleTitle>Malathion and dithane induce DNA damage in Vicia faba.</ArticleTitle>

        </Article>

    </MedlineCitation>  

    <PubmedData>

        <ArticleIdList>

            <ArticleId IdType="pubmed">28950791</ArticleId>

            <ArticleId IdType="doi">10.1177/0748233717726877</ArticleId>

        </ArticleIdList>

    </PubmedData>

</PubmedArticle>


为了实现这一点,我使用了 xml.etree.ElementTree,如下所示:


import xml.etree.ElementTree as ET


xmldoc = ET.parse('sample.xml')

root = xmldoc.getroot()

pubs = {}

for elem in xmldoc.iter(tag='ArticleTitle'):

    title = elem.text

    for subelem in xmldoc.iter(tag='ArticleId'):

        if subelem.get("IdType") == "doi":

            doi = subelem.text 

            pubs[doi] = title


if len(pubs) == 0:

   print "No articles found"

else:   

   for pub in pubs.keys():

       print pub + ' ' + pubs[pub]

但是遍历文档树的循环有问题,因为上面的代码导致:


10.1177/0748233717726877 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].

10.1016/j.crvi.2015.02.001 [Influence of Four Kinds of PPCPs on Micronucleus Rate of the Root-Tip Cells of Vicia-faba and Garlic].

也就是说,我得到了正确的 DOI,但只是上一篇文章标题的副本,没有 DOI!


正确的输出应该是:


10.1177/0748233717726877 Malathion and dithane induce DNA damage in Vicia faba.

10.1016/j.crvi.2015.02.001 Impact of dual inoculation with Rhizobium and PGPR on growth and antioxidant status of Vicia faba L. under copper stress.

任何人都可以向我提供一些解决这个烦人问题的提示吗?


aluckdog
浏览 197回答 0
0回答
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python