由于scrapy,我正在尝试抓取lacentrale.fr 的网站,但即使我轮换了我的用户代理和IP 地址(感谢TOR),该网站也会检测到我的机器人并向我发送错误值。请您检查我在中间件和设置中使用的代码,并告诉我是否出现问题。
中间件中的代码:
from tutorial.settings import * #USER_AGENT_LIST
import random
from stem.control import Controller
from toripchanger import TorIpChanger
from stem import Signal
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='')
controller.signal(Signal.NEWNYM)
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
print("New Tor connection processed")
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
设置中使用的代码:
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'tutorial.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'tutorial.middlewares.ProxyMiddleware': 100
}
摇曳的蔷薇
相关分类