脚本在多个之间使用特定链接时引发错误

我编写了一个脚本scrapy,结合使用selenium来解析CEO网页中不同公司的名称。您可以在登录页面中找到不同公司的名称。CEO但是,一旦您单击公司链接的名称,您就可以获得's 的名称。

以下脚本可以解析不同公司的链接,并使用这些链接来抓取CEO除第二家公司之外的 'S 的名称。当脚本尝试解析CEO使用第二家公司的链接的名称时,它会遇到stale element reference error. 即使在途中遇到该错误,该脚本也会以正确的方式获取其余结果。再一次 - 它只会在使用第二个公司链接解析信息时引发错误。好奇怪!!

这是我迄今为止尝试过的:


import scrapy

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC


class FortuneSpider(scrapy.Spider):


    name = 'fortune'

    url = 'http://fortune.com/fortune500/list/'


    def start_requests(self):

        self.driver = webdriver.Chrome()

        self.wait = WebDriverWait(self.driver,10)

        yield scrapy.Request(self.url,callback=self.get_links)


    def get_links(self,response):

        self.driver.get(response.url)

        for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"]'))):

            company_link = item.find_element_by_css_selector('a[class*="searchResults__cellWrapper--"]').get_attribute("href")

            yield scrapy.Request(company_link,callback=self.get_inner_content)


    def get_inner_content(self,response):

        self.driver.get(response.url)

        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text

        yield {'CEO': chief_executive}

这是我得到的结果类型:


Jeffrey P. Bezos


raise exception_class(message, screen, stacktrace)

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document

  (Session info: chrome=76.0.3809.132)


Darren W. Woods

Timothy D. Cook

Warren E. Buffett

Brian S. Tyler

C. Douglas McMillon

David S. Wichmann

Randall L. Stephenson

Steven H. Collis

and so on------------

如何解决我的脚本在处理第二个公司链接时遇到的错误?


PS 我可以使用他们的 api 来获取所有信息,但我很想知道为什么上面的脚本面临这个奇怪的问题。


喵喔喔
浏览 163回答 3
3回答

慕桂英546537

稍加修改的方法应该可以让您从该站点获得所有所需的内容,而不会出现任何问题。您需要做的就是将所有目标链接存储为方法中的列表get_links()并使用return或yield在对方法进行回调时使用get_inner_content()。您还可以禁用图像以使脚本稍快一些。以下尝试应该为您提供所有结果:import scrapyfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom scrapy.crawler import CrawlerProcessclass FortuneSpider(scrapy.Spider):    name = 'fortune'    url = 'http://fortune.com/fortune500/list/'    def start_requests(self):        option = webdriver.ChromeOptions()        chrome_prefs = {}        option.experimental_options["prefs"] = chrome_prefs        chrome_prefs["profile.default_content_settings"] = {"images": 2}        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}        self.driver = webdriver.Chrome(options=option)        self.wait = WebDriverWait(self.driver,10)        yield scrapy.Request(self.url,callback=self.get_links)    def get_links(self,response):        self.driver.get(response.url)        item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]        return [scrapy.Request(link,callback=self.get_inner_content) for link in item_links]    def get_inner_content(self,response):        self.driver.get(response.url)        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text        yield {'CEO': chief_executive}if __name__ == "__main__":    process = CrawlerProcess()    process.crawl(FortuneSpider)    process.start()或使用yield:def get_links(self,response):    self.driver.get(response.url)    item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]    for link in item_links:        yield scrapy.Request(link,callback=self.get_inner_content) 

小唯快跑啊

要从网页https://fortune.com/fortune500/search/ Selenium本身解析不同公司 CEO 的姓名就足够了,您需要:滚动到网页上的最后一项。收集href属性并存储在列表中。在相邻选项卡中打开href将焦点切换到新打开的选项卡并诱导WebDriverWait,visibility_of_element_located()您可以使用以下Locator Strategies:代码块:# -*- coding: UTF-8 -*-from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions()options.add_argument("start-maximized")options.add_experimental_option("excludeSwitches", ["enable-automation"])options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')driver.get("https://fortune.com/fortune500/search/")driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Explore Lists from Other Years']"))))my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(@class, 'searchResults__cellWrapper--') and contains(@href, 'fortune500')][.//span/div]")))]windows_before  = driver.current_window_handlefor my_href in my_hrefs:    driver.execute_script("window.open('" + my_href +"');")    WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))    windows_after = driver.window_handles    new_window = [x for x in windows_after if x != windows_before][0]    driver.switch_to_window(new_window)    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table/tbody/tr//td[starts-with(@class, 'dataTable__value')]/div"))).text)    driver.close() # close the window    driver.switch_to.window(windows_before) # switch_to the parent_window_handledriver.quit()控制台输出:C. Douglas McMillonDarren W. WoodsTimothy D. CookWarren E. BuffettJeffrey P. BezosDavid S. WichmannBrian S. TylerLarry J. MerloRandall L. StephensonSteven H. CollisMichael K. WirthJames P. HackettMary T. BarraW. Craig JelinekLarry PageMichael C. KaufmannStefano PessinaJames DimonHans E. VestbergW. Rodney McMullenH. Lawrence Culp Jr.Hugh R. FraterGreg C. GarlandJoseph W. GorderBrian T. MoynihanSatya NadellaCraig A. MenearDennis A. MuilenburgC. Allen ParkerMichael L. CorbatGary R. HemingerBrian L. RobertsGail K. BoudreauxMichael S. DellMarc DoyleMichael L. TipsordAlex GorskyVirginia M. RomettyBrian C. CornellDonald H. LaytonDavid P. AbneyMarvin R. EllisonRobert H. SwanMichel A. KhalafDavid S. TaylorGregory J. HayesFrederick W. SmithRamon L. LaguartaJuan R. Luciano..

沧海一幻觉

以下是如何在不使用 Selenium 的情况下更快、更轻松地获取公司详细信息的方法。查看我如何获取company_name并change_the_world提取其他详细信息。import requestsfrom bs4 import BeautifulSoupimport reimport htmlwith requests.Session() as session:    response = session.get("https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2611932")    items = response.json()[1]["items"]    for item in items:        company_name = html.unescape(list(filter(lambda x: x['key'] == 'name', item["fields"]))[0]["value"])        change_the_world = list(filter(lambda x: x['key'] == 'change-the-world-y-n', item["fields"]))[0]["value"]        response = session.get(item["permalink"])        preload_data = BeautifulSoup(response.text, "html.parser").select_one("#preload").text        ceo = re.search('"ceo","value":"(.*?)"', preload_data).groups()[0]        print(f"Company: {company_name}, CEO: {ceo}, Change The World: {change_the_world}")结果:公司:Carvana,首席执行官:Ernest C. Garcia,Change The World:否公司:ManTech International,首席执行官:Kevin M. Phillips,Change The World:否公司:NuStar Energy,首席执行官:Bradley C. Barron,Change The World:否公司:Shutterfly,首席执行官:Ryan O'Hara,改变世界:无公司:Spire,首席执行官:Suzanne Sitherwood,改变世界:无公司:Align Technology,首席执行官:Joseph M. Hogan,改变世界:无公司:Herc控股公司,首席执行官:Lawrence H. Silber,改变世界:不...
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python