使用 beautiful-soup 提取特定标签的元素

首页课程实战体系课手记专栏慕课教程

使用 beautiful-soup 提取特定标签的元素

我想从特定标签中提取元素。例如 - 一个站点中有四个。每个标签都有其他兄弟标签，如 p、h3、h4、ul 等。我想分别查看 h2[1] 元素、h2[2] 元素。

这就是我到目前为止所做的。我知道 for 循环没有任何意义。我也尝试附加文本但无法成功。然后我尝试按特定字符串进行搜索，但它给出了该特定字符串的唯一标签，而不是所有其他元素

from bs4 import BeautifulSoup

page = "https://www.us-cert.gov/ics/advisories/icsma-20-079-01"

resp = requests.get(page)

soup = BeautifulSoup(resp.content, "html5lib")

content_div=soup.find('div', {"class": "content"})

all_p= content_div.find_all('p')

all_h2=content_div.find_all('h2')

i=0

for h2 in all_h2:

print(all_h2[i],'\n\n')

print(all_p[i],'\n')

i=i+1

还尝试使用追加

tags = soup.find_all('div', {"class": "content"})

container = []

for tag in tags:

try:

container.append(tag.text)

print(tag.text)

except:

print(tag)

我是编程方面的新手。请原谅我糟糕的编码能力。我只想看到一切都在“缓解”之下。因此，如果我想将其存储在数据库中，它将解析与一列上的缓解相关的所有内容。

呼唤远方

浏览 130回答 1

1回答

ITMISS

["p","ul","h2","div"]您可以使用findNextwith查找静态标签列表recursive=False以保持在顶层：import requestsfrom bs4 import BeautifulSoupimport jsonresp = requests.get("https://www.us-cert.gov/ics/advisories/icsma-20-079-01")soup = BeautifulSoup(resp.content, "html.parser")content_div = soup.find('div', {"class": "content"})h2_list = [ i for i in content_div.find_all("h2")]result = []search_tags = ["p","ul","h2","div"]def getChildren(tag):     text = []    while (tag):        tag = tag.findNext(search_tags, recursive=False)        if (tag is None):            break        elif (tag.name == "div") or (tag.name == "h2"):            break        else:            text.append(tag.text.strip())    return "".join(text)for i in h2_list:    result.append({        "name": i.text.strip(),        "children": getChildren(i)    })print(json.dumps(result, indent=4, sort_keys=True))

0 0

随时随地看视频慕课网APP

相关分类

Html5