使用 Python 在包含给定单词的标签之间提取文本

我有一些来自 XML 文档的文本,我试图在其中提取包含某些单词的标签中的文本。


例如下面:


search('adverse')

应该返回包含单词“adverse”的所有标签的文本


Out: 

  [

    "<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"

  ]

和 search('clinical')


应该返回两个结果,因为两个标签包含这些词。


Out: 

  [

    "<title>6.1 Clinical Trials Experience</title>", 

    "<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"

  ]

为此我应该使用哪些工具?正则表达式?BS4?任何建议都非常感谢。


示例文本:


 </highlight>

 </excerpt>

 <component>

 <section id="ID40">

 <id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>

 <title>6.1 Clinical Trials Experience</title>

 <text>

 <paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>

 <list id="ID42" listtype="unordered" stylecode="Disc">

 <item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>


跃然一笑
浏览 185回答 1
1回答
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python