使用 BeautifulSoup 解析单个类中的不同元素

背景:我对 Python 相当有经验,但对 BeautifulSoup 完全是个菜鸟


我试图从一个类中获取 3 个值。我正在使用的页面看起来有一系列元素,如下所示:


<blockquote>

<a name="title"><p><B>Title</b> <table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue"><tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font></td></tr></table> Body Text.

<a name="title2".... etc

</blockquote>

目前,我只是将所有文本转储到这样的列表中:


page_html = soup(page, 'html.parser')


text = []

for a in page_html.select('a'):

    text.append(a.text)

这将返回每行如下所示的结果:


Title Subtitle: Top Text. Body Text.

我真正想要的是能够将每个解析a成数据框中的一行,看起来像:


col1      col2                    col3

Title     Subtitle: Top Text.     Body Text.

但坦率地说,我有点过头了。


RISEBY
浏览 114回答 2
2回答

湖上湖

如果您的所有<a>标签都相同,则可以使用:from bs4 import BeautifulSoupimport pandas as pdpage = '''<blockquote><a name="title"><p><B>Title</b> <table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue"><tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font></td></tr></table> Body Text.</blockquote>'''soup = BeautifulSoup(page, "html.parser")text = []for texts in soup.find_all('a'):&nbsp; &nbsp; paragraph = texts.find('p')&nbsp; &nbsp; title = texts.find('b').text&nbsp; &nbsp; subtitle = texts.find_all('b')[1].text&nbsp; &nbsp; other = ''.join(paragraph.find_all(text=True, recursive=False))&nbsp; &nbsp; d = {'col1': [title], 'col2': [subtitle],'col3' : [other]}&nbsp; &nbsp; df = pd.DataFrame(data=d)print(df)输出 :&nbsp; &nbsp;col1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;col2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; col30&nbsp; Title&nbsp; Subtitle: Top Text.&nbsp; &nbsp; Body Text.

慕的地6264312

仅使用您共享的 HTML 片段:from bs4 import BeautifulSoupcontent = '<a name="title"><p><B>Title</b> ' \&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '<table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue">' \&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '<tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font>' \&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; '</td></tr></table> Body Text.'soup = BeautifulSoup(content, 'html.parser')articles = soup.find_all('a')for article in articles:&nbsp; &nbsp; paragraph = article.find('p')&nbsp; &nbsp; print({&nbsp; &nbsp; &nbsp; &nbsp; 'title': article.find('b').text,&nbsp; &nbsp; &nbsp; &nbsp; 'subtitle': article.select('table i')[0].text,&nbsp; &nbsp; &nbsp; &nbsp; 'body': ''.join(paragraph.find_all(text=True, recursive=False))&nbsp; &nbsp; })由于问题主要是关于 BeautifulSoup,而不是关于 Pandas,我认为字典就足够了,你可以自己将它放入数据框或其他数据结构中吗?结果:{'title': 'Title', 'subtitle': 'Subtitle', 'body': '&nbsp; Body Text.'}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python