为什么即使请求数只有 1,我也会在 scrapy 响应中收到 429 个请求?

我正在使用scrapy抓取网站,但收到 429 响应。


下面是它的输出日志:


2020-06-06 21:39:45 [scrapy.core.engine] INFO: Spider opened

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2020-06-06 21:39:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

INFO:scrapy.extensions.telnet:Telnet console listening on 127.0.0.1:6023

2020-06-06 21:39:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

DEBUG:scrapy.core.engine:Crawled (429) <GET https://www.realestate.com.au/rent/in-aspendale+gardens,+vic+3195/list-1> (referer: None)

2020-06-06 21:39:46 [scrapy.core.engine] DEBUG: Crawled (429) <GET https://www.realestate.com.au/rent/in-aspendale+gardens,+vic+3195/list-1> (referer: None)

INFO:scrapy.spidermiddlewares.httperror:Ignoring response <429 https://www.realestate.com.au/rent/in-aspendale+gardens,+vic+3195/list-1>: HTTP status code is not handled or not allowed

2020-06-06 21:39:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://www.realestate.com.au/rent/in-aspendale+gardens,+vic+3195/list-1>: HTTP status code is not handled or not allowed

INFO:scrapy.core.engine:Closing spider (finished)

2020-06-06 21:39:46 [scrapy.core.engine] INFO: Closing spider (finished)

INFO:scrapy.statscollectors:Dumping Scrapy stats:

{'downloader/request_bytes': 343,

 'downloader/request_count': 1,

 'downloader/request_method_count/GET': 1,

 'downloader/response_bytes': 2030,

 'downloader/response_count': 1,

 'downloader/response_status_count/429': 1,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2020, 6, 6, 11, 39, 46, 255540),

 'httperror/response_ignored_count': 1,

 'httperror/response_ignored_status_count/429': 1,

 'log_count/DEBUG': 1,

 'log_count/INFO': 10,

 'memusage/max': 50941952,

 'memusage/startup': 50941952,

 'response_received_count': 1,

 'scheduler/dequeued': 1,

你可以看到downloader/request_count只有 1。


暮色呼如
浏览 129回答 1
1回答

斯蒂芬大帝

状态代码429表示连接过多。下载器上的请求计数为 1,因为 429 表示拒绝并且不会通过下载器。他们错误地向他们认为是机器人的任何请求提供 429 代码。经过实验后,由于缺少 cookie 标头,它拒绝了我,该 cookie 标头是在 set-cookie 标头的初始 GET 请求中设置的。这里有一些尝试将 Selenium 作为任何抓取项目中的最后一个选项。尝试使用像下面这样的完整标题和COOKIES_ENABLED = True.Host: www.realestate.com.auUser-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8Accept-Language: en-US,en;q=0.5Accept-Encoding: gzip, deflate, brReferer: https://duckduckgo.com/Connection: keep-aliveUpgrade-Insecure-Requests: 1Pragma: no-cacheCache-Control: no-cacheTE: Trailers
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python