Scrapy 中的顺序请求调用

需要身份验证的网站提供搜索服务。搜索包括两个步骤。


首先,从产品序列号检索基本信息(库存、尺寸等)的请求。


其次,鉴于之前的搜索和几个附加字段,第二个请求将显示产品价格。问题是必须以严格的顺序调用步骤。例如,给定两个产品A和B,以下序列将产生错误 -> basic_info(A), basic_info(B), get_price(A)=> 显示错误,因为服务器期望get_price(B). 鉴于必须进行身份验证,我不能丢弃 cookie。在下面的场景中,有没有办法保证顺序请求调用顺序?


def after_auth_success(self, response):

    for product in prod_list:

        yield FormRequest("basic_info_url", ..., calback = self.on_basic_info)


def on_basic_info(self, response):

    yield FormRequest("get_price_url", ..., calback = self.on_price_info)


def on_price_info(self, response):

    #Scrape result... 

    #<price would be scraped correctly only if the requests are done in order> 

    yield result

预期结果:


Only one thread running the sequence 

basic_info_url | get_price_url |  basic_info_url | get_price_url ...

实际结果:


If CONCURRENT_REQUEST=1 => Invoke all basic_info_url and after invoke all get_price_url


呼啦一阵风
浏览 224回答 1
1回答

Smart猫小萌

最后,我找到了一种获得所需行为的方法。这个想法是进行一种递归,最后一步将返回整个结果。为了迭代递归,我们使用元属性共享列表。result = list()def after_auth_success(self, response):&nbsp; &nbsp; first_prod = prod_list.pop(0)&nbsp; &nbsp; basic_url = build_url("basic_info_url", first_prod)&nbsp; &nbsp; yield FormRequest(basic_url, meta = {'prod_list': prod_list}, calback = self.on_basic_info)def on_basic_info(self, response):&nbsp; &nbsp; yield FormRequest("get_price_url", meta = {'prod_list':response.meta['prod_list']}, calback = self.on_price_info)def on_price_info(self, response):&nbsp; &nbsp; #Scrape result and add the result into a dict called node&nbsp;&nbsp; &nbsp; result.append(result_node)&nbsp; &nbsp; prod_list = response.meta['prod_list']&nbsp; &nbsp; if prod_list:&nbsp; &nbsp; &nbsp; &nbsp; first_prod = prod_list.pop(0)&nbsp; &nbsp; &nbsp; &nbsp; basic_url = build_url("basic_info_url", first_prod)&nbsp; &nbsp; &nbsp; &nbsp; yield FormRequest(basic_url, meta = {'prod_list': prod_list}, calback = self.on_basic_info)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; yield {'data': result}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python