我正在尝试使用 python scrapy 库抓取网页。
我有如下代码:
class AutoscoutDetailsSpider(scrapy.Spider):
name = "vehicle details"
reference_url = ''
reference = ''
def __init__(self, reference_url, reference, *args, **kwargs):
super(AutoscoutDetailsSpider, self).__init__(*args, **kwargs)
self.reference_url = reference_url
self.reference = reference
destination_url = "https://www.autoscout24.be/nl/aanbod/volkswagen-polo-1-2i-12v-base-birthday-climatronic-benzine-zilver-8913b173-cad5-ec63-e053-e250040a09a8"
self.start_urls = [destination_url]
add_logs(self.start_urls)
def handle_error_response(self):
add_logs("NOT EXISTS. REFERENCE {} AND REFERENCE URL {}.".format(self.reference, self.reference_url))
def handle_gone_response(self):
add_logs("SOLD or NOT AVAILABLE Reference {} and reference_url {} is sold or not available.".format(self.reference, self.reference_url))
def parse(self, response):
add_logs("THIS IS RESPONSE {}".format(response))
if response.status == 404:
self.handle_error_response()
if response.status == 410:
self.handle_gone_response()
if response.status == 200:
pass
def start_get_vehicle_job(reference_url, reference):
try:
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(AutoscoutDetailsSpider, reference_url, reference)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
capture_error(str(e))
q.put(e)
因此,首先执行的是main,主要我start_get_vehicle_job用reference_url和reference作为参数调用。然后start_get_vehicle_job调用scrapy spider AutoscoutDetailsSpider。
在__init__我添加需要抓取的网址。paramsreference和reference_urlin__init__是正确的。add_logs函数只是向数据库添加一些文本。而add_logs在我的情况下,__init__增加了正确的URL。
慕妹3146593
相关分类