单个 Scrapy 项目与多个项目

首先，当我写这样的路径时'/path'，因为我是 Ubuntu 用户。如果您是 Windows 用户，请调整它。那是文件管理系统的问题。灯光示例假设您想抓取2 个或更多不同的网站。第一个是泳装零售网站。二是关于天气。您想同时了解这两种情况，因为您想观察泳衣价格和天气之间的联系，以便预测较低的购买价格。请注意pipelines.py我将使用 mongo 集合，因为这是我使用的，我暂时不需要 SQL。如果您不了解 mongo，请考虑将集合等同于关系数据库中的表。scrapy 项目可能如下所示：spiderswebsites.py, 在这里你可以写下你想要的蜘蛛数量。import scrapyfrom ..items.py import SwimItem, WeatherItem#if sometimes you have trouble to import from parent directory you can do#import sys#sys.path.append('/path/parentDirectory')class SwimSpider(scrapy.Spider):    name = "swimsuit"    start_urls = ['https://www.swimsuit.com']    def parse (self, response):        price = response.xpath('span[@class="price"]/text()').extract()        model = response.xpath('span[@class="model"]/text()').extract()        ... # and so on        item = SwimItem() #needs to be called -> ()        item['price'] = price        item['model'] = model        ... # and so on        return itemclass WeatherSpider(scrapy.Spider):    name = "weather"    start_urls = ['https://www.weather.com']    def parse (self, response):        temperature = response.xpath('span[@class="temp"]/text()').extract()        cloud = response.xpath('span[@class="cloud_perc"]/text()').extract()        ... # and so on        item = WeatherItem() #needs to be called -> ()        item['temperature'] = temperature        item['cloud'] = cloud        ... # and so on        return itemitems.py, 在这里你可以写下你想要的项目模式的数量。import scrapyclass SwimItem(scrapy.Item):    price = scrapy.Field()    stock = scrapy.Field()    ...    model = scrapy.Field()class WeatherItem(scrapy.Item):    temperature = scrapy.Field()    cloud = scrapy.Field()    ...    pressure = scrapy.Field()pipelines.py，我在哪里使用 Mongoimport pymongofrom .items import SwimItem,WeatherItemfrom .spiders.spiderswebsites import SwimSpider , WeatherSpiderclass ScrapePipeline(object):    def __init__(self, mongo_uri, mongo_db):        self.mongo_uri = mongo_uri        self.mongo_db = mongo_db    @classmethod #this is a decorator, that's a powerful tool in Python    def from_crawler(cls, crawler):        return cls(        mongo_uri=crawler.settings.get('MONGODB_URL'),        mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test')        )    def open_spider(self, spider):        self.client = pymongo.MongoClient(self.mongo_uri)        self.db = self.client[self.mongo_db]            def close_spider(self, spider):         self.client.close()    def process_item(self, item, spider):        if isinstance(spider, SwimItem):            self.collection_name = 'swimwebsite'        elif isinstance(spider, WeatherItem):            self.collection_name = 'weatherwebsite'        self.db[self.collection_name].insert(dict(item))因此，当您查看我的示例项目时，您会发现该项目根本不依赖于项目模式，因为您可以在同一个项目中使用多种项目。在上面的模式中，优点是您可以根据settings.py需要保留相同的配置。但是不要忘记你可以“自定义”你的蜘蛛的命令。如果您希望您的蜘蛛运行与默认设置稍有不同，您可以设置为scrapy crawl spider -s DOWNLOAD_DELAY=35而不是25您编写的settings.py设置。函数式编程而且这里我的例子很轻。实际上，您很少对原始数据感兴趣。你需要很多代表很多线条的治疗方法。为了提高代码的可读性，您可以在模块中创建函数。但要小心意大利面条代码。functions.py, 定制模块from re import searchdef cloud_temp(response): #for WeatherSpider    """returns a tuple containing temperature and percentage of clouds"""    temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C"    cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%"    #treatments, you want to record it as integer    temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12    cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30    return (cloud,temperature)它屈服了spiders.pyimport scrapyfrom items.py import SwimItem, WeatherItemfrom functions.py import *...class WeatherSpider(scrapy.Spider):    name = "weather"    start_urls = ['https://www.weather.com']    def parse (self, response):        cloud , temperature = cloud_temp(response) "this is shorter than the previous one        ... # and so on        item = WeatherItem() #needs to be called -> ()        item['temperature'] = temperature        item['cloud'] = cloud        ... # and so on        return item此外，它在调试方面也有相当大的改进。假设我想做一个scrapy shell session。>>> scrapy shell https://www.weather.com...#I check in the sys path if the directory where my `functions.py` module is present.>>> import sys>>> sys.path #returns a list of paths>>> #if the directory is not present>>> sys.path.insert(0, '/path/directory')>>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself>>> from functions.py import *>>> cloud_temp(response) #checking if it returns what I want.这比复制和粘贴一段代码更舒服。而且因为 Python 是一种用于函数式编程的优秀编程语言，所以您应该从中受益。这就是为什么我告诉你“更一般地说，如果你限制行数，提高可读性，限制错误，任何模式都是有效的。” 它的可读性越高，您就越能限制错误。您编写的行数越少（例如避免复制和粘贴对不同变量的相同处理），您限制的错误就越少。因为当你纠正一个函数本身时，你纠正了所有依赖它的东西。所以现在，如果你对函数式编程不是很熟悉，我可以理解你为不同的项目模式制作了几个项目。您可以利用当前的技能并改进它们，然后随着时间的推移改进您的代码。

单个 Scrapy 项目与多个项目

2回答