Python BeautifulSoup 在特定标签之后提取文本

我正在尝试使用 beautifulsoup 和 python 从网页中提取信息。我想提取特定标签下方的信息。要知道它是否是正确的标签,我想对其文本进行比较,然后在下一个直接标签中提取文本。

例如,如果以下内容是 HTML 页面源的一部分,


<div class="row">

    ::before

    <div class="four columns">

        <p class="title">Procurement type</p>

        <p class="data strong">Services</p>

    </div>

  <div class="four columns">

      <p class="title">Reference</p>

      <p class="data strong">ANAJSKJD23423-Commission</p>

  </div>

  <div class="four columns">

      <p class="title">Funding Agency</p>

      <p class="data strong">Health Commission</p>

  </div>

  ::after

</div>

<div class="row">

    ::before

    ::after

</div>

<hr>

<div class="row">

    ::before

    <div class="twelve columns">

        <p class="title">Countries</p>

        <p class="data strong">

            <span class>Belgium</span>

            ", "

            <span class>France</span>

            ", "

            <span class>Luxembourg</span>

        </p>

        <p></p>

    </div>

    ::after

</div>

我想检查是否<p class="title">有文本值,Procurement type然后我想打印出服务

同样,如果<p class="title">有文本值,Reference那么我想打印出ANAJSKJD23423-Commission,如果<p class="title">有值,Countries则打印出所有国家,即比利时,法国,卢森堡。


我知道我可以提取所有文本<p class="data strong">并将它们附加到列表中,然后使用索引获取所有值。但问题是,这些发生的顺序<p class="title>是不固定的……有些地方可能会在采购类型之前提到国家。因此,我想对文本值进行检查,然后提取下一个直接标记的文本值。我还是 BeautifulSoup 的新手,因此感谢您提供任何帮助。谢谢


largeQ
浏览 261回答 3
3回答

慕标5832272

你可以用很多方法来做。给你。from bs4 import BeautifulSouphtmldata='''<div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; &nbsp; <p class="title">Procurement type</p>&nbsp; &nbsp; &nbsp; &nbsp; <p class="data strong">Services</p>&nbsp; &nbsp; </div>&nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; <p class="title">Reference</p>&nbsp; &nbsp; &nbsp; <p class="data strong">ANAJSKJD23423-Commission</p>&nbsp; </div>&nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; <p class="title">Funding Agency</p>&nbsp; &nbsp; &nbsp; <p class="data strong">Health Commission</p>&nbsp; </div>&nbsp; ::after</div><div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; ::after</div><hr><div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; <div class="twelve columns">&nbsp; &nbsp; &nbsp; &nbsp; <p class="title">Countries</p>&nbsp; &nbsp; &nbsp; &nbsp; <p class="data strong">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>Belgium</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ", "&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>France</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ", "&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>Luxembourg</span>&nbsp; &nbsp; &nbsp; &nbsp; </p>&nbsp; &nbsp; &nbsp; &nbsp; <p></p>&nbsp; &nbsp; </div>&nbsp; &nbsp; ::after</div>'''soup=BeautifulSoup(htmldata,'html.parser')items=soup.find_all('p', class_='title')for item in items:&nbsp; &nbsp; if ('Procurement type' in item.text) or ('Reference' in item.text):&nbsp; &nbsp; &nbsp; &nbsp; print(item.findNext('p').text)

Qyouu

您还可以:contains在 bs4 4.7.1 中使用伪类。虽然我已经通过了一个列表,但您可以将每个条件分开from bs4 import BeautifulSoup as bsimport rehtml = 'yourHTML'&nbsp; &nbsp;soup = bs(html, 'lxml')items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]print(items)输出:

江户川乱折腾

您可以添加参数检查,当你使用特定的文本.find()或.find_all()再使用.next_sibling或findNext()抓住与内容的下一个标签IE:soup.find('p', {'class':'title'}, text = 'Procurement type')鉴于:html = '''<div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; &nbsp; <p class="title">Procurement type</p>&nbsp; &nbsp; &nbsp; &nbsp; <p class="data strong">Services</p>&nbsp; &nbsp; </div>&nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; <p class="title">Reference</p>&nbsp; &nbsp; &nbsp; <p class="data strong">ANAJSKJD23423-Commission</p>&nbsp; </div>&nbsp; <div class="four columns">&nbsp; &nbsp; &nbsp; <p class="title">Funding Agency</p>&nbsp; &nbsp; &nbsp; <p class="data strong">Health Commission</p>&nbsp; </div>&nbsp; ::after</div><div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; ::after</div><hr><div class="row">&nbsp; &nbsp; ::before&nbsp; &nbsp; <div class="twelve columns">&nbsp; &nbsp; &nbsp; &nbsp; <p class="title">Countries</p>&nbsp; &nbsp; &nbsp; &nbsp; <p class="data strong">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>Belgium</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ", "&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>France</span>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ", "&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class>Luxembourg</span>&nbsp; &nbsp; &nbsp; &nbsp; </p>&nbsp; &nbsp; &nbsp; &nbsp; <p></p>&nbsp; &nbsp; </div>&nbsp; &nbsp; ::after</div>'''你可以这样做:from bs4 import BeautifulSoup&nbsp; &nbsp; &nbsp;soup = BeautifulSoup(html, 'html.parser')alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')for sibling in alpha.next_siblings:&nbsp; &nbsp; try:&nbsp; &nbsp; &nbsp; &nbsp; print (sibling.text)&nbsp; &nbsp; except:&nbsp; &nbsp; &nbsp; &nbsp; continue输出:Services或者ref = soup.find('p', {'class':'title'}, text = 'Reference')for sibling in ref.next_siblings:&nbsp; &nbsp; try:&nbsp; &nbsp; &nbsp; &nbsp; print (sibling.text)&nbsp; &nbsp; except:&nbsp; &nbsp; &nbsp; &nbsp; continue输出:ANAJSKJD23423-Commission&nbsp; &nbsp;&nbsp;或者countries = soup.find('p', {'class':'title'}, text = 'Countries')names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')names = [name.strip() for name in names if not name.isspace()]for country in names:&nbsp; &nbsp; print (country)输出:BelgiumFranceLuxembourg
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python