脚本在多个之间使用特定链接时引发错误

3回答

慕桂英546537

稍加修改的方法应该可以让您从该站点获得所有所需的内容，而不会出现任何问题。您需要做的就是将所有目标链接存储为方法中的列表get_links()并使用return或yield在对方法进行回调时使用get_inner_content()。您还可以禁用图像以使脚本稍快一些。以下尝试应该为您提供所有结果：import scrapyfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECfrom scrapy.crawler import CrawlerProcessclass FortuneSpider(scrapy.Spider):    name = 'fortune'    url = 'http://fortune.com/fortune500/list/'    def start_requests(self):        option = webdriver.ChromeOptions()        chrome_prefs = {}        option.experimental_options["prefs"] = chrome_prefs        chrome_prefs["profile.default_content_settings"] = {"images": 2}        chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}        self.driver = webdriver.Chrome(options=option)        self.wait = WebDriverWait(self.driver,10)        yield scrapy.Request(self.url,callback=self.get_links)    def get_links(self,response):        self.driver.get(response.url)        item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]        return [scrapy.Request(link,callback=self.get_inner_content) for link in item_links]    def get_inner_content(self,response):        self.driver.get(response.url)        chief_executive = self.wait.until(EC.presence_of_element_located((By.XPATH, '//tr[td[.="CEO"]]//td[contains(@class,"dataTable__value--")]/div'))).text        yield {'CEO': chief_executive}if __name__ == "__main__":    process = CrawlerProcess()    process.crawl(FortuneSpider)    process.start()或使用yield：def get_links(self,response):    self.driver.get(response.url)    item_links = [item.get_attribute("href") for item in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[class*="searchResults__title--"] a[class*="searchResults__cellWrapper--"]')))]    for link in item_links:        yield scrapy.Request(link,callback=self.get_inner_content) 

0 0

小唯快跑啊

要从网页https://fortune.com/fortune500/search/ Selenium本身解析不同公司 CEO 的姓名就足够了，您需要：滚动到网页上的最后一项。收集href属性并存储在列表中。在相邻选项卡中打开href将焦点切换到新打开的选项卡并诱导WebDriverWait，visibility_of_element_located()您可以使用以下Locator Strategies：代码块：# -*- coding: UTF-8 -*-from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECoptions = webdriver.ChromeOptions()options.add_argument("start-maximized")options.add_experimental_option("excludeSwitches", ["enable-automation"])options.add_experimental_option('useAutomationExtension', False)driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')driver.get("https://fortune.com/fortune500/search/")driver.execute_script("arguments[0].scrollIntoView(true);", WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Explore Lists from Other Years']"))))my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(@class, 'searchResults__cellWrapper--') and contains(@href, 'fortune500')][.//span/div]")))]windows_before  = driver.current_window_handlefor my_href in my_hrefs:    driver.execute_script("window.open('" + my_href +"');")    WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))    windows_after = driver.window_handles    new_window = [x for x in windows_after if x != windows_before][0]    driver.switch_to_window(new_window)    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table/tbody/tr//td[starts-with(@class, 'dataTable__value')]/div"))).text)    driver.close() # close the window    driver.switch_to.window(windows_before) # switch_to the parent_window_handledriver.quit()控制台输出：C. Douglas McMillonDarren W. WoodsTimothy D. CookWarren E. BuffettJeffrey P. BezosDavid S. WichmannBrian S. TylerLarry J. MerloRandall L. StephensonSteven H. CollisMichael K. WirthJames P. HackettMary T. BarraW. Craig JelinekLarry PageMichael C. KaufmannStefano PessinaJames DimonHans E. VestbergW. Rodney McMullenH. Lawrence Culp Jr.Hugh R. FraterGreg C. GarlandJoseph W. GorderBrian T. MoynihanSatya NadellaCraig A. MenearDennis A. MuilenburgC. Allen ParkerMichael L. CorbatGary R. HemingerBrian L. RobertsGail K. BoudreauxMichael S. DellMarc DoyleMichael L. TipsordAlex GorskyVirginia M. RomettyBrian C. CornellDonald H. LaytonDavid P. AbneyMarvin R. EllisonRobert H. SwanMichel A. KhalafDavid S. TaylorGregory J. HayesFrederick W. SmithRamon L. LaguartaJuan R. Luciano..

0 0

沧海一幻觉

以下是如何在不使用 Selenium 的情况下更快、更轻松地获取公司详细信息的方法。查看我如何获取company_name并change_the_world提取其他详细信息。import requestsfrom bs4 import BeautifulSoupimport reimport htmlwith requests.Session() as session:    response = session.get("https://content.fortune.com/wp-json/irving/v1/data/franchise-search-results?list_id=2611932")    items = response.json()[1]["items"]    for item in items:        company_name = html.unescape(list(filter(lambda x: x['key'] == 'name', item["fields"]))[0]["value"])        change_the_world = list(filter(lambda x: x['key'] == 'change-the-world-y-n', item["fields"]))[0]["value"]        response = session.get(item["permalink"])        preload_data = BeautifulSoup(response.text, "html.parser").select_one("#preload").text        ceo = re.search('"ceo","value":"(.*?)"', preload_data).groups()[0]        print(f"Company: {company_name}, CEO: {ceo}, Change The World: {change_the_world}")结果：公司：Carvana，首席执行官：Ernest C. Garcia，Change The World：否公司：ManTech International，首席执行官：Kevin M. Phillips，Change The World：否公司：NuStar Energy，首席执行官：Bradley C. Barron，Change The World：否公司：Shutterfly，首席执行官：Ryan O'Hara，改变世界：无公司：Spire，首席执行官：Suzanne Sitherwood，改变世界：无公司：Align Technology，首席执行官：Joseph M. Hogan，改变世界：无公司：Herc控股公司，首席执行官：Lawrence H. Silber，改变世界：不...

0 0