如何将网页上的所有文本抓取到 python 中的特定标题?

我正在尝试打印从网页开头到特定标题的网页中的所有文本。


我想要那个网页中的所有文本直到那个标题,然后什么都没有。


我试图运行的代码(python 3):


import requests

import bs4

from bs4 import BeautifulSoup


urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'

res = requests.get(urlpage)

soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text()

 print(soup1)

该代码具有以下输出:


Albert Einstein - Wikipedia

document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements",



幕布斯6054654
浏览 200回答 1
1回答

喵喔喔

你可以试试这个。代码import requestsfrom bs4 import BeautifulSoupurl = 'https://en.wikipedia.org/wiki/Albert_Einstein'res = requests.get(url)soup = BeautifulSoup(res.text, 'lxml')#print(soup.prettify())until_soup = soup.find('h1', class_='firstHeading', text='Albert Einstein').find_all_previous()[::-1][1:]#a list of bs tag objects, print(type(until_soup[0]))#print(until_soup)output = ''.join([str(_) for _ in until_soup])#output is no longer bs tag objects but strings, print(type(output))#print(output)我强烈建议使用 API 调用,如下所示,import wikipediaapiwiki_html = wikipediaapi.Wikipedia(language='en',extract_format=wikipediaapi.ExtractFormat.HTML)p_html = wiki_html.page('Albert Einstein')#print(p_html.text)#it is a string type, you may perform regex matching until the heading you wanted
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python