如何使用 Python 和 Beautiful Soup 从 flexbox 元素/容器中抓取数据

我正在尝试使用 python、beautiful soup 和 selenium 从实用网站抓取数据。我试图抓取的数据是这样的:时间、原因、状态等。当我运行典型的页面请求时,解析页面并解析我正在寻找的数据(id="OutageListTable" 中的数据) ,然后打印出来,div 和字符串无处可寻。当我检查页面元素时,数据就在那里,但它在 flex 容器中。


这是我正在使用的代码:


from urllib.request import urlopen as uReq

from bs4 import BeautifulSoup as soup

import urllib3

from selenium import webdriver


my_url = 'https://www.pse.com/outage/outage-map'


browser = webdriver.Firefox()

browser.get(my_url)


html = browser.page_source

page_soup = soup(html, features='lxml')


outage_list = page_soup.find(id='OutageListTable')

print(outage_list)


browser.quit()

您如何检索 flex/flexbox 容器中的信息?我没有在网上找到任何资源来帮助我解决这个问题。


万千封印
浏览 96回答 2
2回答

慕哥6287543

你把问题想多了。首先没有柔性板容器。这是分配正确的 div 类的简单案例。你应该看看div class_=col-xs-12 col-sm-6 col-md-4 listView-containerfrom bs4 import BeautifulSoupfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.common.exceptions import TimeoutExceptionfrom time import sleep# create object for chrome optionschrome_options = Options()base_url = 'https://www.pse.com/outage/outage-map'chrome_options.add_argument('disable-notifications')chrome_options.add_argument('--disable-infobars')chrome_options.add_argument('start-maximized')chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')# To disable the message, "Chrome is being controlled by automated test software"chrome_options.add_argument("disable-infobars")# Pass the argument 1 to allow and 2 to blockchrome_options.add_experimental_option("prefs", {     "profile.default_content_setting_values.notifications": 2    })# invoke the webdriverbrowser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',                          options = chrome_options)browser.get(base_url)delay = 5 #secodswhile True:    try:        WebDriverWait(browser, delay)        print ("Page is ready")        sleep(5)        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")        #print(html)        soup = BeautifulSoup(html, "html.parser")        for item_n in soup.find_all('div', class_='col-xs-12 col-sm-6 col-md-4 listView-container'):            for item_n_text in item_n.find_all(name="span"):                print(item_n_text.text)    except TimeoutException:        print ("Loading took too much time!-Try again")# close the automated browserbrowser.close()Cause: AccidentStatus: Crew assignedLast updated: 06/02 11:00 PM9. WoodinvilleStart time: 06/02 08:29 PMEst. restoration time: 06/03 03:30 AMCustomers impacted: 2Cause: Under InvestigationStatus: Crew assignedLast updated: 06/03 12:15 AMPage is ready1. BellinghamStart time: 06/02 06:09 PMEst. restoration time: 06/03 06:30 AMCustomers impacted: 1Cause: Trees/VegetationStatus: Crew assignedLast updated: 06/02 11:50 PM2. DemingStart time: 06/02 07:10 PMEst. restoration time: 06/03 03:30 AM

BIG阳

数据通过 Javascript 动态加载。您可以使用requests模块来获取数据。例如:import jsonimport requestsurl = 'https://www.pse.com/api/sitecore/OutageMap/AnonymoussMapListView'data = requests.get(url).json()# uncomment this to print all data:#print(json.dumps(data, indent=4))for d in data['PseMap']:    print('{} - {}'.format(d['DataProvider']['PointOfInterest']['Title'], d['DataProvider']['PointOfInterest']['MapType']))    for info in d['DataProvider']['Attributes']:        print(info['Name'], info['Value'])    print('-' * 80)印刷:Bellingham - OutageStart time 06/02 06:09 PMEst. restoration time 06/03 06:30 AMCustomers impacted 1Cause Trees/VegetationStatus Crew assignedLast updated 06/02 11:50 PM--------------------------------------------------------------------------------Deming - OutageStart time 06/02 07:10 PMEst. restoration time 06/03 03:30 AMCustomers impacted 568Cause AccidentStatus Repair crew onsiteLast updated 06/02 11:50 PM--------------------------------------------------------------------------------Everest - OutageStart time 06/02 10:42 AMCustomers impacted 4Cause Scheduled OutageStatus Repair crew onsiteLast updated 06/02 10:50 AM--------------------------------------------------------------------------------Kenmore - OutageStart time 06/02 09:59 PMEst. restoration time 05/29 01:00 AMCustomers impacted 2Cause Scheduled OutageStatus Repair crew onsiteLast updated 06/02 10:05 PM--------------------------------------------------------------------------------Kent - OutageStart time 06/02 06:43 PMEst. restoration time To Be DeterminedCustomers impacted 26Cause Car/Equip AccidentStatus Waiting for repairsLast updated 06/02 10:15 PM--------------------------------------------------------------------------------Kent - OutageStart time 06/02 10:09 PMEst. restoration time To Be DeterminedCustomers impacted 13Cause Under InvestigationStatus Repair crew onsiteLast updated 06/02 10:15 PM--------------------------------------------------------------------------------Northwest Bellevue - OutageStart time 06/02 11:28 PMEst. restoration time To Be DeterminedCustomers impacted 14Cause Under InvestigationStatus Repair crew onsiteLast updated 06/02 11:30 PM--------------------------------------------------------------------------------Pacific - OutageStart time 06/02 06:19 PMEst. restoration time 06/03 02:30 AMCustomers impacted 3Cause AccidentStatus Crew assignedLast updated 06/02 11:00 PM--------------------------------------------------------------------------------Woodinville - OutageStart time 06/02 08:29 PMEst. restoration time 06/03 03:30 AMCustomers impacted 2Cause Under InvestigationStatus Crew assignedLast updated 06/03 12:15 AM--------------------------------------------------------------------------------
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python