无法从站点上刮掉表格

我正在尝试抓取此网站上的排名表:https ://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/scores_overall/sort_order /asc/cols/分数

但我无法获取数据,现在我有这个代码:

import scrapy

from scrapy import Selector

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from logzero import logfile, logger



class ScrapeTableSpider(scrapy.Spider):

    name = "scrape-table"

    allowed_domains = ["toscrape.com"]

    start_urls = ['http://quotes.toscrape.com']


    def start_requests(self):

        # headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

        for url in self.start_urls:

            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):

        # driver = webdriver.Chrome()

        options = webdriver.ChromeOptions()

        options.add_argument("headless")

        desired_capabilities = options.to_capabilities()

        driver = webdriver.Chrome('C:/chromedriver', desired_capabilities=desired_capabilities)


        driver.get("https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/scores_overall/sort_order/asc/cols/scores")

        driver.implicitly_wait(2)

        for table in driver.find_element_by_xpath('//*[contains(@id,"datatable-1")]//tr'):

            data = [item.text for item in table.find_elements_by_xpath(".//*[self::td or self::th]")]

            print(data)

任何有关如何从表中获取数据的见解将不胜感激。


守着一只汪
浏览 129回答 2
2回答

繁花不似锦

我真的不明白为什么你同时使用 scrapy 和 selenium,但我们可以说只是使用 selenium。要从表中获取文本,您可以执行以下非常简单的操作:from selenium import webdriveroptions = webdriver.ChromeOptions()options.add_argument("headless")desired_capabilities = options.to_capabilities()driver = webdriver.Chrome('C:/chromedriver', desired_capabilities=desired_capabilities)driver.get("https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/scores_overall/sort_order/asc/cols/scores")driver.implicitly_wait(1)table = driver.find_element_by_xpath('//*[@id="datatable-1"]')print(table.text)现在,如果您将表中的所有内容分开,只需使用该find_element_by_xxx函数并通过 xpath 选择其他部分即可。

慕慕森

如果您需要迭代结果,您应该选择 elements 而不是 element。更改您的代码: for table in driver.find_element_by_xpath('//*[contains(@id,"datatable-1")]//tr'):编码:for table in driver.find_elements_by_xpath('//*[contains(@id,"datatable-1")]//tr'):
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python