使用来自网站的 BeautifulSoup 仅获取一些标签 <p>

我尝试仅从 selectet 标签中获取文本,例如:


<div class="article-container">

  <p>tekst 1</p> <!-- this tag -->

  <p>none</p>

  <p>tekst 2</p> <!-- this tag -->

  <p>none</p>

  <p>tekst 3</p> <!-- this tag -->

  <p>none</p>

  <p>tekst 4</p> <!-- this tag -->

</div>

我尝试获取'tekst 1 tekst 2 tekst 3 tekst 4'(但标签中的文本完全不同'tekst 1'等只是示例),


我的简单 python 函数如下所示:


def get_article(url):

    page = requests.get(str(url))

    soup = BeautifulSoup(page.text, 'html.parser')


    article = soup.find(class_='article-container')


    article_only = article.text


    return(article_only)

但他返回了整个文本。有没有办法像上面的例子一样使用 BS 来获取选定的元素?


繁花不似锦
浏览 186回答 3
3回答

狐的传说

所以你只需要 1,3,5,7 元素,你可以这样做:代码:from bs4 import BeautifulSoup as souphtml = """<div class="article-intro"><p>tekst 1</p><p>none</p><p>tekst 2</p><p>none</p><p>tekst 3</p><p>none</p><p>tekst 4</p></div>"""page = soup(html, 'html.parser')div = page.find('div',{'class':'article-intro'})ps = div.find_all('p')for i in range(len(ps)):&nbsp; &nbsp; if i % 2 == 0:&nbsp; &nbsp; &nbsp; &nbsp; print(ps[i].text)输出:tekst 1tekst 2tekst 3tekst 4

米脂

使用正则表达式re并搜索文本。from bs4 import BeautifulSoupimport rehtml='''<div class="article-intro"><p>tekst 1</p><p>none</p><p>tekst 2</p><p>none</p><p>tekst 3</p><p>none</p><p>tekst 4</p></div>'''soup=BeautifulSoup(html,'html.parser')for item in soup.find('div', class_='article-intro').find_all('p', text=re.compile('tekst')):&nbsp; &nbsp; print(item.text)输出:tekst 1tekst 2tekst 3tekst 4或者你可以使用 pythonlambda函数。from bs4 import BeautifulSouphtml='''<div class="article-intro"><p>tekst 1</p><p>none</p><p>tekst 2</p><p>none</p><p>tekst 3</p><p>none</p><p>tekst 4</p></div>'''soup=BeautifulSoup(html,'html.parser')for item in soup.find('div', class_='article-intro').find_all(lambda tag:tag.name=='p' and 'tekst' in tag.text):&nbsp; &nbsp; print(item.text)输出:tekst 1tekst 2tekst 3tekst 4

qq_花开花谢_0

一些不同的选择取决于你真正想做的事情。使用 bs4 4.7.1。from bs4 import BeautifulSoup as bshtml = '''<div class="article-container">&nbsp; <p>tekst 1</p> <!-- this tag -->&nbsp; <p>none</p>&nbsp; <p>tekst 2</p> <!-- this tag -->&nbsp; <p>none</p>&nbsp; <p>tekst 3</p> <!-- this tag -->&nbsp; <p>none</p>&nbsp; <p>tekst 4</p> <!-- this tag --></div>'''soup = bs(html, 'lxml')#odd indicesitems = [item.text for item in soup.select('.article-container p:nth-child(odd)')]print(items)#excluding Noneitems = [item.text for item in soup.select('.article-container p:not(:contains("none"))')]print(items)#including tekstitems = [item.text for item in soup.select('.article-container p:contains("tekst")')]print(items)#providing nth listitems = [item.text for item in soup.select('.article-container p:nth-of-type(1), .article-container p:nth-of-type(3), .article-container p:nth-of-type(5), .article-container p:nth-of-type(7)')]print(items)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python