1 前言
前一阵子看了不少关于分布式爬虫系统的设计相关的博客,现在也想写个练练手,就拿大家都喜欢看的豆瓣电影做个测试好了,代码的框架结构如图所示
分布式结构图.png
编程之前需要熟悉:
redis基本安装和使用(python redis库)
MongoDB基本安装和使用(python mongoengine库)
RabbitMQ消息队列的基本安装和使用(pyhton pika库)
Linux系统的screen 命令 !!!非常便于vps管理
服务端程序基于python3 开发
爬虫客户端基于python3和scrapy开发
开发之前研究了下豆瓣的电影类目下网页格式
https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=电影&start=7100
start 从 0 到9979,指的是第一条数据的序号,每次会返回20条数据,总共有1万条电影信息,我们请求的返回格式如下
请求返回的格式
,然后响应数据的url,就可以通过
bloom-filter过滤后存到我们新的任务队列中,理想状态下 100500次请求后,我们的数据里就会有10000条电影信息了(实际上爬出来了9982条和18条404被和谐的,然而豆瓣反爬真的很厉害,两台机器爬了一天多才完成任务,速度问题后面会讲,主要是笔者没有稳定的ip池,免费的不好用以及客户端太少并且豆瓣ip访问频率过高就返回302或403的反爬虫策略太为严格导致的。。。插句题外话,爬小电影网站时不用分布式,一台机器一天就爬了13万个左右番号信息)
2 代码基本解析
爬取数据比较重要的地方有下面几块
1.redis的使用 (任务进出管理和bloom-filter)
2.mongoDB的使用 (电影数据存储和记录未完成信息)
3.RabbitMQ的使用 (用于爬虫客户端和服务端rpc通讯,发布和完成任务)
2.1 redis 管理任务和去重
笔者将url任务分为a和b两个优先级,a>b ,将启动的url(豆瓣的电影列表) 存在 arank的redis set里面,爬下来的电影详情url 经过bloom-filter去重后存到brank的redis set里面。
大体代码如下
redis_controller.py
#encoding=utf-8import datetimeimport tracebackfrom collections import Iterableimport redisfrom hashlib import md5from sql_model import monogo_controller#连接redispool = redis.ConnectionPool(host='localhost', port=6379, decode_responses=True) arank_str = 'arank'brank_str = 'brank'url_limit = 5r = redis.Redis(connection_pool=pool)def get_out_urls(): ''' 取出给爬虫客户端的任务url 至多五个 :return: ''' arank_data_len = r.scard(arank_str) outdata = [] #从arank等级的redis里面寻找是否有任务 if arank_data_len > 0: for i in range(url_limit): popdata = r.spop(arank_str) if popdata is not None: outdata.append(popdata) try: #返回给客户端时,将未完成任务存在mongoDB中 #完成后再删除 monogo_controller.TempJob(_id=popdata,work_start=datetime.datetime.now()).save(force_insert=True) except: traceback.print_exc() pass else: break #arank等级的redis里面没有任务时 # 寻找brank是否有任务 elif r.scard(brank_str) > 0: for i in range(url_limit): popdata = r.spop(brank_str) if popdata is not None: outdata.append(popdata) try: monogo_controller.TempJob(_id=popdata,work_start=datetime.datetime.now()).save(force_insert=True) except: traceback.print_exc() pass else: break #arank和brank里面都没有任务时,取出mongodb里面超过1h还未完成的任务 else: for mogoi in range(5): timejobs = monogo_controller.TempJob.objects( work_start__lt=(datetime.datetime.now() - datetime.timedelta(hours=1)) ).limit(1).modify(work_start=datetime.datetime.now()) if not timejobs: break outdata.append(timejobs._id) return outdatadef puturl(rank,urls): ''' 将任务存入redis :param rank: 任务等级 :param urls: 链接 :return: ''' assert isinstance(urls,Iterable) for url in urls: #进行bloomfilter 过滤 if not bf.isContains(url.encode()): bf.insert(url.encode()) if rank==arank_str: r.sadd(arank_str,url) else: r.sadd(brank_str,url) def puturl_safe(rank,urls): ''' 不经过bloomfilter,直接将任务放入redis,用于手动造初始化数据url :param rank: :param urls: :return: ''' #安全的加入种子URL assert isinstance(urls,Iterable) for url in urls: if rank==arank_str: r.sadd(arank_str,url) else: r.sadd(brank_str,url)class SimpleHash(object): ''' bloomfilter使用的hash算法 网上找到 ''' def __init__(self, cap, seed): self.cap = cap self.seed = seed def hash(self, value): ret = 0 for i in range(len(value)): ret += self.seed * ret + ord(value[i]) return (self.cap - 1) & retclass BloomFilter(object): def __init__(self, blockNum=1, key='doubanbloomfilter'): """ 初始化布隆过滤器 :param blockNum: one blockNum for about 90,000,000; if you have more strings for filtering, increase it. :param key: the key's name in Redis """ self.server = r self.bit_size = 1 << 31 # Redis的String类型最大容量为512M,现使用256M self.seeds = [5, 7, 11, 13, 31, 37, 61] self.key = key self.blockNum = blockNum self.hashfunc = [] for seed in self.seeds: self.hashfunc.append(SimpleHash(self.bit_size, seed)) def isContains(self, str_input): ''' str_input是否有,没有的话会自动入库 :param str_input: :return: ''' if not str_input: return False m5 = md5() m5.update(str_input) str_input = m5.hexdigest() ret = True name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) ret = ret & self.server.getbit(name, loc) return ret def insert(self, str_input): ''' 将hash出来的几个指存入redis数据库中的bit中 :param str_input: :return: ''' m5 = md5() m5.update(str_input) str_input = m5.hexdigest() name = self.key + str(int(str_input[0:2], 16) % self.blockNum) for f in self.hashfunc: loc = f.hash(str_input) self.server.setbit(name, loc, 1) bf = BloomFilter()
2.2 mongoDB(电影数据存储和记录未完成信息)
数据库使用的是mongoDB,用python mongoengine 插件 进行ORM管理,数据类如下
#电影数据类import datetimeimport mongoengineimport timefrom mongoengine import StringField,DateTimeField,ListField,LongField,FloatField,IntField#连接MongoDBmongoengine.connect('douban',username='pig',password='pig@123456',authentication_source="admin")class TempJob(mongoengine.Document): ''' 未完成任务数据ORM类 ''' _id= StringField(required=True,unique=True,primary_key=True) work_start=DateTimeField(required=True,default=datetime.datetime.now())class MoiveDataModel(mongoengine.Document): ''' 电影数据类 ''' director= ListField(StringField()) douban_id=LongField(unique=True,primary_key=True,required=True) tags=ListField(StringField()) stars=ListField(StringField()) desc=StringField(required=True) douban_remark=FloatField() imdb_tag=StringField() contry=StringField() language=StringField() publictime=DateTimeField() runtime=IntField() votes=IntField() title=StringField(required=True)def delete(urls): #完成任务后删除 for url in urls: TempJob.objects(_id=url).delete()
2.3 rabbitmq 实现爬虫客户端和主服务端进行RPC通讯
通过rabbitmq 实现rpc的方式通讯,大体逻辑就是爬虫客户端通过rpc请求服务端分发任务,同时告知服务端任务完成情况和爬取到的数据对象,服务端收到请求时,数据存到需要存到的地方,并且从redis和mongoDB找到下一批任务返回客户端
rpc 服务端代码如下
import jsonimport tracebackimport pikafrom main_server_side import redis_controllerfrom sql_model import monogo_controllerfrom sql_model.monogo_controller import MoiveDataModel#连接MQ#确保消息queue建立cred = pika.PlainCredentials(username='pig', password='pig123') connection = pika.BlockingConnection( pika.ConnectionParameters(host='xx.xxx.xxx.xxx', credentials=cred)) channel = connection.channel() channel.queue_declare(queue='rpc_queue_douban')def on_request(ch, method, props, body): ''' 收到客户端请求的回调 :param ch: :param method: :param props: :param body: :return: ''' try: print("send_data") jsondata = json.loads(body.decode()) print(jsondata) done_urls=jsondata.get("done") rankstr=jsondata.get('rankstr') rankurls=jsondata.get("new_urls") if done_urls is not None: print("del done_urls") print(done_urls) monogo_controller.delete(done_urls) if rankurls is not None: redis_controller.puturl(rankstr,rankurls) response=redis_controller.get_out_urls() print("response is :") print(response) ch.basic_publish(exchange='', routing_key=props.reply_to, properties=pika.BasicProperties( correlation_id=props.correlation_id , content_type='application/json', content_encoding='utf-8'), body=json.dumps({"isok":True,"ans": response})) ch.basic_ack(delivery_tag=method.delivery_tag) result_map=jsondata.get("result_map") if result_map is not None: for mogodata in result_map: try: print(type(mogodata)) MoiveDataModel(**mogodata).save() except: traceback.print_exc() pass except Exception as e: traceback.print_exc()#设置每次只处理一次请求(单线程)channel.basic_qos(prefetch_count=1,)# 监听rpc_queue_doubanchannel.basic_consume(on_request, queue='rpc_queue_douban') print(" Awaiting DOUBAN RPC requests")#等待请求channel.start_consuming()
对应的rpc客户端设计如下
#!/usr/bin/env python#encoding=utf-8import json import uuid import pikaclass RPCClient(object): def __init__(self): self.credentials = pika.PlainCredentials('pig', 'pig123') self.connection = pika.BlockingConnection(pika.ConnectionParameters(host='xx.xx.xx.xx', credentials=self.credentials)) self.channel = self.connection.channel() #设置回调为匿名唯一queue result_queue=self.channel.queue_declare(exclusive=True) self.callback_queue_name=result_queue.method.queue self.channel.basic_consume(self.onresponse,self.callback_queue_name,no_ack=True) self.responsedata=None def onresponse(self,channel, method, properties, body): if self.corrid == properties.correlation_id: self.responsedata=body def call(self,query_dict): #correlation_id生成一个uuid self.corrid=str(uuid.uuid4()) self.channel.basic_publish(exchange='',routing_key='rpc_queue_douban',body=json.dumps(query_dict) ,properties=pika.BasicProperties(content_type='application/json',content_encoding='utf-8' ,correlation_id=self.corrid,reply_to=self.callback_queue_name)) while self.responsedata is None: self.connection.process_data_events(time_limit=None) backresponse=self.responsedata self.responsedata=None return json.loads(backresponse.decode())
2.4爬虫客户端 scrapy接入RPC
scrapy客户端利用rpc通讯从服务端拿到任务,通过xpath解析页面拿到数据,代码如下
# -*- coding: utf-8 -*-import jsonimport randomimport reimport scrapyimport timeimport loggingfrom urllib.parse import unquotefrom scrapy import Requestfrom scrapy_client_side.scrapy_client_side.client_side import RPCClient logging.basicConfig(filename='douban_spider.log', filemode="a", level=logging.ERROR)class DoubanSpider(scrapy.Spider): name = 'douban_spider' urlpre = "https://movie.douban.com/" done_urls = [] result_map = [] rankstr = None new_urls = [] #豆瓣触发反爬机制时会返回403和302 #这种时候爬虫暂停两个小时再爬取基本没有异常 handle_httpstatus_list = [403,302] def start_requests(self): while (True): #rpc请求成功后随机停止30-60s,降低促发反爬虫的概率 if self.rankstr is None: try: rpc_response = RPCClient().call({"query": "start"}) except: #rpc有时会和服务端连接失败,等待1分后重试 time.sleep(60) continue else: try: rpc_response = RPCClient().call( {"done": self.done_urls, "rankstr": self.rankstr, "new_urls": self.new_urls, "result_map": self.result_map}) print("get data from server sleep ") time.sleep(random.randint(30,40)) except: time.sleep(random.randint(55,65)) continue try: ansurls = rpc_response.get("ans") print("ansis:") print(ansurls) #每次将数据rpc提交给服务端后清理掉 self.done_urls = [] self.rankstr = None self.new_urls = [] self.result_map = [] if not ansurls : time.sleep(30) else: for url in ansurls: print("yield") yield Request(self.urlpre + url, callback=self.parse,errback=self.errback_httpbin) except: time.sleep(30) pass def errback_httpbin(self, failure): print(repr(failure)) def parse(self, response): if not response.status==200: time.sleep(7200) yield Request(response.url, callback=self.parse,errback=self.errback_httpbin) elif response.url.count(r'j/new_search_subjects') > 0: resjson = json.loads(response.text) urls = (unquote(data.get("url").replace("https://movie.douban.com/","")) for data in resjson.get('data')) self.new_urls.extend(urls) self.rankstr = 'brank' self.done_urls.append(unquote(response.url.replace("https://movie.douban.com/",""))) elif response.url.count(r'subject/') > 0: try: response_dict = {} # director = ListField(StringField) # douban_id = LongField(unique=True, primary_key=True, required=True) # tags = ListField(StringField) # stars = ListField(StringField) # desc = StringField(required=True) # douban_remark = FloatField() # imdb_tag = FloatField() # contry = StringField() # language = StringField() # publictime = DateTimeField() # runtime = IntField() # votes = IntField() response_dict["director"] = response.xpath("//a[contains(@rel,'v:directedBy')]/text()").extract() response_dict["douban_id"] = int(response.xpath("//a[@share-id]/@share-id").get()) response_dict["tags"] = response.xpath("//div[contains(@class,'tags-body')]/a/text()").extract() response_dict["stars"] = response.xpath("//a[contains(@rel,'v:starring')]/text()").extract() response_dict["desc"] = "".join( response.xpath("//span[contains(@property,'v:summary')]/text()").extract()).replace("\u3000", " ") response_dict["douban_remark"] = float( response.xpath("//strong[contains(@property,'v:average')]/text()").get()) response_dict["imdb_tag"] = response.xpath("//a[contains(@href,'imdb')]/text()").get() response_dict["contry"] = response.xpath("//span[contains(text(),'制片国家')]/following-sibling::text()").get() response_dict["language"] = response.xpath("//span[contains(text(),'语言')]/following-sibling::text()").get() str = response.xpath("//span[contains(@property,'v:initialReleaseDate')]/text()").get() try: timestr = re.findall(r"\d{4}-\d{2}-\d{2}", str)[0] response_dict["publictime"] = timestr except: pass # 时长 try: response_dict["runtime"] = int(response.xpath("//span[contains(@property,'v:runtime')]/@content").get()) except: response_dict["runtime"]=-1 pass response_dict["votes"] = int(response.xpath("//span[contains(@property,'v:votes')]/text()").get()) response_dict["title"] =response.xpath("//title/text()").get().replace("\n","").replace('(豆瓣)',"").strip() print(response_dict) self.rankstr="" self.result_map.append(response_dict) self.done_urls.append(unquote(response.url.replace("https://movie.douban.com/", ""))) except Exception as e: logging.exception("spider parse error") pass
可以通过python代码调用爬虫启动,并且设置setting项
from scrapy.crawler import CrawlerProcessfrom scrapy.utils.project import get_project_settingsfrom scrapy_client_side.scrapy_client_side.spiders.douban_spider import DoubanSpider s=get_project_settings() s.set("USER_AGENT",'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:58.0) Gecko/20100101 Firefox/58.0') s.set("ROBOTSTXT_OBEY" , False) DOWNLOAD_DELAY = 10RANDOMIZE_DOWNLOAD_DELAY = Trues.set('DOWNLOAD_DELAY',DOWNLOAD_DELAY) s.set('RANDOMIZE_DOWNLOAD_DELAY',RANDOMIZE_DOWNLOAD_DELAY) s.set('CONCURRENT_REQUESTS',1) s.set('DOWNLOAD_TIMEOUT',60) process1 = CrawlerProcess(s) process1.crawl(DoubanSpider) process1.start()
2.5造初始化的url数据
from main_server_side import redis_controller urlstep=[]for i in range(0,9981,20): if i==9980: num=9979 else: num=i urlstep.append("j/new_search_subjects?sort=T&range=0,10&tags=电影&start=%s"%(num)) redis_controller.puturl_safe(redis_controller.arank_str,urlstep)
关键代码就是这些了,使用时稍微组织下代码结构就可以了,
scrapy项目用scrapy startproject xxxx 命令生成,直接python -m的方式启动rpc服务端代码 和控制爬虫的python脚本代码运行
3后记
分布式爬取数据笔者认为解决了带宽和ip限制的问题,在这种情况下爬取效率和vps数量成正比,因为个人vps空间不足没有将下载的网页缓存到主服务器或者别的oss服务器上(这一步笔者认为是比较重要的,因为缓存下来后,当有别的字段要解析时速度快多)。这里写一下也是记录下自己的设计思路,也和各位读者朋友探讨下技术吧
作者:战五渣_lei
链接:https://www.jianshu.com/p/2f5edab11059