HTML 中 div 元素上的 Beautiful Soup 循环

我正在尝试使用 Beautiful Soup 从网页中提取一些值(这里不是很聪明..),这些值是来自 Weatherbug 预报的每小时值。在 Chrome 开发者模式下,我可以看到这些值嵌套在div类中,如下面的片段所示:

https://img4.mukewang.com/650816be000199c517810653.jpg

在 Python 中,我可以尝试模仿 Web 浏览器并找到这些值:


import requests

import bs4 as BeautifulSoup

import pandas as pd

from bs4 import BeautifulSoup


url = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'


header = {

  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",

  "X-Requested-With": "XMLHttpRequest"

}


page = requests.get(url, headers=header)


soup = BeautifulSoup(page.text, 'html.parser')

通过下面的代码,我可以找到 12 个这样的hour-card_mobile_conddiv 类,这似乎是正确的,因为在搜索每小时预测时,我可以看到未来数据的 12 小时/变量。我不确定为什么我要选择移动设备方法来查看...(?)


temp_containers = soup.find_all('div', class_ = 'hour-card__mobile__cond')

print(type(temp_containers))

print(len(temp_containers))

输出:


<class 'bs4.element.ResultSet'>

12

如果我尝试编写一些代码来循环遍历所有这些 div 类以进一步深入,我会在下面做一些不正确的事情。我可以返回 12 个空列表。有人能给我一些可以改进的提示吗?最终,我希望将所有 12 个未来每小时预测值放入 pandas 数据框中。


for div in temp_containers:

    a = div.find_all('div', class_ = 'temp ng-binding')

    print(a)

编辑,基于 pandas 数据框答案的完整代码


import requests

from bs4 import BeautifulSoup

import pandas as pd



r = requests.get(

    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")

soup = BeautifulSoup(r.text, 'html.parser')


stuff = []


for item in soup.select("div.hour-card__mobile__cond"):

    item = int(item.contents[1].get_text(strip=True)[:-1])

    print(item)

    stuff.append(item)



df = pd.DataFrame(stuff)

df.columns = ['temp']


繁星coding
浏览 70回答 2
2回答

梦里花落0921

页面加载后,网站就会动态加载JavaScript。所以你可以使用requests-html或selenium.from selenium import webdriverfrom selenium.webdriver.firefox.options import Optionsoptions = Options()options.add_argument('--headless')driver = webdriver.Firefox(options=options)driver.get(    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")data = driver.find_elements_by_css_selector("div.temp.ng-binding")for item in data:    print(item.text)driver.quit()输出:51°52°53°54°53°53°52°51°51°50°50°49°根据用户请求更新:import requestsfrom bs4 import BeautifulSoupr = requests.get(    "https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103")soup = BeautifulSoup(r.text, 'html.parser')for item in soup.select("div.hour-card__mobile__cond"):    item = int(item.contents[1].get_text(strip=True)[:-1])    print(item, type(item))输出:51 <class 'int'>52 <class 'int'>53 <class 'int'>53 <class 'int'>53 <class 'int'>53 <class 'int'>52 <class 'int'>51 <class 'int'>51 <class 'int'>50 <class 'int'>50 <class 'int'>50 <class 'int'>

千万里不及你

当您看到 class = "temp ng-binding" 时,这意味着该 div 具有“temp”类和“ng-binding”类,因此查找两者都不起作用。另外,当我运行你的脚本时,临时容器的 html 看起来像这样:print(temp_containers[0])<div class="temp">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 51°</div>所以我运行了这个并得到了结果import requestsimport pandas as pdfrom bs4 import BeautifulSoupurl = 'https://www.weatherbug.com/weather-forecast/hourly/san-francisco-ca-94103'header = {&nbsp; "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",&nbsp; "X-Requested-With": "XMLHttpRequest"}page = requests.get(url, headers=header)soup = BeautifulSoup(page.text, 'html.parser')temp_containers = soup.find_all('div', class_ = 'hour-card__mobile__cond')print(type(temp_containers))print(len(temp_containers))for div in temp_containers:&nbsp; &nbsp; a = div.find('div', class_ = 'temp')&nbsp; &nbsp; print(a.text)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5