我spider_idle设置了一个信号来向蜘蛛提供另一批网址。但是,这在开始时似乎工作正常,但随后Crawled (200)...消息出现的次数越来越少,最终停止出现。我有 115 个测试 URL 可以分发,正如 Scrapy 所说的Crawled 38 pages...那样。下面是蜘蛛和scrapy日志的代码。
一般来说,我正在实现两阶段爬行,第一遍仅将 url 下载到urls.jl文件,第二遍是对这些 URL 执行抓取。我现在正在接近第二个蜘蛛的编码。
import json
import scrapy
import logging
from scrapy import signals
from scrapy.http.request import Request
from scrapy.exceptions import DontCloseSpider
class A2ndexample_comSpider(scrapy.Spider):
name = '2nd_example_com'
allowed_domains = ['www.example.com']
def parse(self, response):
pass
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = cls(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle_consume, signals.spider_idle)
return spider
def __init__(self, crawler):
self.crawler = crawler
# read from file
self.urls = []
with open('urls.jl', 'r') as f:
for line in f:
self.urls.append(json.loads(line))
# How many urls to return from start_requests()
self.batch_size = 5
def start_requests(self):
for i in range(self.batch_size):
if 0 == len(self.urls):
return
url = self.urls.pop(0)
yield Request(url["URL"])
def idle_consume(self):
# Everytime spider is about to close check our urls
# buffer if we have something left to crawl
reqs = self.start_requests()
if not reqs:
return
logging.info('Consuming batch... [left: %d])' % len(self.urls))
for req in reqs:
self.crawler.engine.schedule(req, self)
raise DontCloseSpider
我预计蜘蛛会抓取所有 115 个 URL,而不仅仅是 38 个。此外,如果它不想再抓取,并且 singal-handler 函数没有引发DontCloseSpider
,那么它至少不应该关闭然后?
相关分类