猿问

无法消除 process_exception 引发的一些错误

我试图不显示/获取 scrapy 中抛出的一些process_response错误RetryMiddleware。超过最大重试限制时脚本遇到的错误。我在中间件中使用了代理。奇怪的是脚本抛出的异常已经在列表中EXCEPTIONS_TO_RETRY。脚本有时可能会超过最大重试次数而没有成功,这是完全可以的。但是,我只是不希望看到该错误,即使它存在,这意味着抑制或绕过它。


错误是这样的:


Traceback (most recent call last):

  File "middleware.py", line 43, in process_request

    defer.returnValue((yield download_func(request=request,spider=spider)))

twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..

这是process_response里面的RetryMiddleware样子:


class RetryMiddleware(object):

    cus_retry = 3

    EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \

        ConnectionRefusedError, ConnectionDone, ConnectError, \

        ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)


    def process_exception(self, request, exception, spider):

        if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \

                and not request.meta.get('dont_retry', False):

            return self._retry(request, exception, spider)


    def _retry(self, request, reason, spider):

        retries = request.meta.get('cus_retry',0) + 1

        if retries<=self.cus_retry:

            r = request.copy()

            r.meta['cus_retry'] = retries

            r.meta['proxy'] = f'https://{ip:port}'

            r.dont_filter = True

            return r

        else:

            print("done retrying")

我怎样才能消除 中的错误EXCEPTIONS_TO_RETRY?


PS:无论我选择哪个站点,当达到最大重试限制时脚本都会遇到错误。


慕莱坞森
浏览 1599回答 3
3回答

缥缈止盈

尝试修复刮刀本身的代码。有时,错误的解析函数可能会导致您所描述的那种错误。一旦我修复了代码,它就消失了。

白衣染霜花

当达到最大重试次数时,类似的方法parse_error()应该处理蜘蛛中存在的任何错误:def start_requests(self):&nbsp; &nbsp; for start_url in self.start_urls:&nbsp; &nbsp; &nbsp; &nbsp; yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)def parse_error(self, failure):&nbsp; &nbsp; # print(repr(failure))&nbsp; &nbsp; pass然而,我想在这里提出一种完全不同的方法。如果您采用以下路线,则根本不需要任何自定义中间件。包括重试逻辑在内的所有内容都已经存在于蜘蛛中。class mySpider(scrapy.Spider):&nbsp; &nbsp; name = "myspider"&nbsp; &nbsp; start_urls = [&nbsp; &nbsp; &nbsp; &nbsp; "some url",&nbsp; &nbsp; ]&nbsp; &nbsp; proxies = [] #list of proxies here&nbsp; &nbsp; max_retries = 5&nbsp; &nbsp; retry_urls = {}&nbsp; &nbsp; def parse_error(self, failure):&nbsp; &nbsp; &nbsp; &nbsp; proxy = f'https://{ip:port}'&nbsp; &nbsp; &nbsp; &nbsp; retry_url = failure.request.url&nbsp; &nbsp; &nbsp; &nbsp; if retry_url not in self.retry_urls:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self.retry_urls[retry_url] = 1&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; self.retry_urls[retry_url] += 1&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; if self.retry_urls[retry_url] <= self.max_retries:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print("gave up retrying")&nbsp; &nbsp; def start_requests(self):&nbsp; &nbsp; &nbsp; &nbsp; for start_url in self.start_urls:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; proxy = f'https://{ip:port}'&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)&nbsp; &nbsp; def parse(self,response):&nbsp; &nbsp; &nbsp; &nbsp; for item in response.css().getall():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; print(item)不要忘记添加以下行以从上述建议中获得上述结果:custom_settings = {&nbsp; &nbsp; 'DOWNLOADER_MIDDLEWARES': {&nbsp; &nbsp; &nbsp; &nbsp; 'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,&nbsp; &nbsp; }}顺便说一句,我正在使用 scrapy 2.3.0。

慕运维8079593

也许问题不在您这边,但第三方网站可能有问题。也许他们的服务器上存在连接错误,或者可能是安全的,所以没有人可以访问它。因为该错误甚至表明该错误与一方有关,该错误已关闭或无法正常工作,可能首先检查第三方站点是否在请求时正常工作。如果可以的话尝试联系他们。因为错误不是在你这边,而是在党那边,正如错误所说。
随时随地看视频慕课网APP

相关分类

Python
我要回答