如何通过 BeautifulSoup 提取正文段落?

我正在尝试使用 BeautifulSoup 从网站中提取文本,但愿意探索其他选项。目前我正在尝试使用这样的东西:


from bs4 import BeautifulSoup

from urllib.request import Request, urlopen


boston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'

hdr = {'User-Agent': 'Mozilla/5.0'}

req = Request(boston_url,headers=hdr)

webpage = urlopen(req)

htmlText = webpage.read().decode('utf-8')

pageText = BeautifulSoup(htmlText, "html.parser")

body = pageText.find_all(text=True)

目标是弄清楚如何提取红色框中的文本。您可以看到我从下面的 CMD 照片中获得的输出。它非常混乱,我不确定如何从中找到正文段落。我可以遍历输出并查找某些词,但我需要对多个站点执行此操作,而且我不知道正文段落中的内容。

http://img1.mukewang.com/63a13b1f0001916312210834.jpg

http://img2.mukewang.com/63a13b2900011bc014110740.jpg

呼如林
浏览 145回答 2
2回答

HUX布斯

它可能比你做的更简单。让我们尝试简化它:import requestsfrom bs4 import BeautifulSoup as bsboston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'hdr = {'User-Agent': 'Mozilla/5.0'}req = requests.get(boston_url,headers=hdr)soup = bs(req.text,'lxml')soup.select('main main div.ma__rich-text>p')[0].text输出:'PERAC has not reviewed the RFP notices or other related materials posted on this page for compliance with M.G.L. Chapter 32, section 23B. The publication of these notices should not be interpreted as an indication that PERAC has made a determination as to that compliance.'

慕姐8265434

您可以使用bs.find('p', text=re.compile('PERAC'))来提取该段落:from bs4 import BeautifulSoupimport requestsimport reheaders = {    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '    'AppleWebKit/537.36 (KHTML, like Gecko) '    'Chrome/83.0.4103.61 Safari/537.36'}boston_url = (     'https://www.mass.gov/service-details/request-for-proposal-rfp-notices')resp = requests.get(boston_url, headers=headers)bs = BeautifulSoup(resp.text)bs.find('p', text=re.compile('PERAC'))
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python