无法消除 process_exception 引发的一些错误

首页课程实战体系课手记专栏慕课教程

无法消除 process_exception 引发的一些错误

我试图不显示/获取 scrapy 中抛出的一些process_response错误RetryMiddleware。超过最大重试限制时脚本遇到的错误。我在中间件中使用了代理。奇怪的是脚本抛出的异常已经在列表中EXCEPTIONS_TO_RETRY。脚本有时可能会超过最大重试次数而没有成功，这是完全可以的。但是，我只是不希望看到该错误，即使它存在，这意味着抑制或绕过它。

错误是这样的：

Traceback (most recent call last):

File "middleware.py", line 43, in process_request

defer.returnValue((yield download_func(request=request,spider=spider)))

twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..

这是process_response里面的RetryMiddleware样子：

class RetryMiddleware(object):

cus_retry = 3

EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \

ConnectionRefusedError, ConnectionDone, ConnectError, \

ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed)

def process_exception(self, request, exception, spider):

if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \

and not request.meta.get('dont_retry', False):

return self._retry(request, exception, spider)

def _retry(self, request, reason, spider):

retries = request.meta.get('cus_retry',0) + 1

if retries<=self.cus_retry:

r = request.copy()

r.meta['cus_retry'] = retries

r.meta['proxy'] = f'https://{ip:port}'

r.dont_filter = True

return r

else:

print("done retrying")

我怎样才能消除中的错误EXCEPTIONS_TO_RETRY？

PS：无论我选择哪个站点，当达到最大重试限制时脚本都会遇到错误。

慕莱坞森

浏览 1742回答 3

3回答

缥缈止盈

尝试修复刮刀本身的代码。有时，错误的解析函数可能会导致您所描述的那种错误。一旦我修复了代码，它就消失了。

0 0

白衣染霜花

当达到最大重试次数时，类似的方法parse_error()应该处理蜘蛛中存在的任何错误：def start_requests(self):    for start_url in self.start_urls:        yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)def parse_error(self, failure):    # print(repr(failure))    pass然而，我想在这里提出一种完全不同的方法。如果您采用以下路线，则根本不需要任何自定义中间件。包括重试逻辑在内的所有内容都已经存在于蜘蛛中。class mySpider(scrapy.Spider):    name = "myspider"    start_urls = [        "some url",    ]    proxies = [] #list of proxies here    max_retries = 5    retry_urls = {}    def parse_error(self, failure):        proxy = f'https://{ip:port}'        retry_url = failure.request.url        if retry_url not in self.retry_urls:            self.retry_urls[retry_url] = 1        else:            self.retry_urls[retry_url] += 1                if self.retry_urls[retry_url] <= self.max_retries:            yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)        else:            print("gave up retrying")    def start_requests(self):        for start_url in self.start_urls:            proxy = f'https://{ip:port}'            yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)    def parse(self,response):        for item in response.css().getall():            print(item)不要忘记添加以下行以从上述建议中获得上述结果：custom_settings = {    'DOWNLOADER_MIDDLEWARES': {        'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,    }}顺便说一句，我正在使用 scrapy 2.3.0。

0 0

慕运维8079593

也许问题不在您这边，但第三方网站可能有问题。也许他们的服务器上存在连接错误，或者可能是安全的，所以没有人可以访问它。因为该错误甚至表明该错误与一方有关，该错误已关闭或无法正常工作，可能首先检查第三方站点是否在请求时正常工作。如果可以的话尝试联系他们。因为错误不是在你这边，而是在党那边，正如错误所说。

0 0

随时随地看视频慕课网APP

相关分类

Python