支持动态页面的Selenium

我试着从网页上抓取产品信息，使用刮除。我被刮过的网页是这样的：

从包含10个产品的ProductList页面开始
单击“Next”按钮将加载接下来的10个产品(url不会在两页之间更改)
我使用LinkExtractor跟踪每个产品链接到产品页面，并获取所需的所有信息

我试图复制Next按钮Ajax调用，但是无法工作，所以我尝试Selenium。我可以在一个单独的脚本中运行Selenium的Webriver，但我不知道如何与scrapy集成。我应该把硒的部分放在我的刮伤蜘蛛里吗？

我的蜘蛛非常标准，如下所示：

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

任何想法都会受到赞赏。谢谢!

慕斯709654

浏览 561回答 2

支持动态页面的Selenium

2回答