为什么我的spider_idle / on-demand / URL-feeding 像逐渐关闭?

我spider_idle设置了一个信号来向蜘蛛提供另一批网址。但是,这在开始时似乎工作正常,但随后Crawled (200)...消息出现的次数越来越少,最终停止出现。我有 115 个测试 URL 可以分发,正如 Scrapy 所说的Crawled 38 pages...那样。下面是蜘蛛和scrapy日志的代码。


一般来说,我正在实现两阶段爬行,第一遍仅将 url 下载到urls.jl文件,第二遍是对这些 URL 执行抓取。我现在正在接近第二个蜘蛛的编码。


import json

import scrapy

import logging

from scrapy import signals

from scrapy.http.request import Request

from scrapy.exceptions import DontCloseSpider



class A2ndexample_comSpider(scrapy.Spider):

    name = '2nd_example_com'

    allowed_domains = ['www.example.com']


    def parse(self, response):

        pass


    @classmethod

    def from_crawler(cls, crawler, *args, **kwargs):

        spider = cls(crawler, *args, **kwargs)

        crawler.signals.connect(spider.idle_consume, signals.spider_idle)

        return spider


    def __init__(self, crawler):

        self.crawler = crawler

        # read from file

        self.urls = []


        with open('urls.jl', 'r') as f:

            for line in f:

                self.urls.append(json.loads(line))

        # How many urls to return from start_requests()

        self.batch_size = 5


    def start_requests(self):

        for i in range(self.batch_size):

            if 0 == len(self.urls):

                return

            url = self.urls.pop(0)

            yield Request(url["URL"])


    def idle_consume(self):

        # Everytime spider is about to close check our urls 

        # buffer if we have something left to crawl

        reqs = self.start_requests()

        if not reqs:

            return

        logging.info('Consuming batch... [left: %d])' % len(self.urls))

        for req in reqs:

            self.crawler.engine.schedule(req, self)

        raise DontCloseSpider

我预计蜘蛛会抓取所有 115 个 URL,而不仅仅是 38 个。此外,如果它不想再抓取,并且 singal-handler 函数没有引发DontCloseSpider,那么它至少不应该关闭然后?


慕码人8056858
浏览 141回答 1
1回答
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python