猿问

如何在Python脚本中运行Scrapy

我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法。我找到2个资料来解释这一点:


http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/


http://snipplr.com/view/67006/using-scrapy-from-a-script/


我不知道应该把我的Spider代码放在哪里以及如何从main函数中调用它。请帮忙。这是示例代码:


# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 

# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.

# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. 


#!/usr/bin/python

import os

os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports


from scrapy import log, signals, project

from scrapy.xlib.pydispatch import dispatcher

from scrapy.conf import settings

from scrapy.crawler import CrawlerProcess

from multiprocessing import Process, Queue


class CrawlerScript():


    def __init__(self):

        self.crawler = CrawlerProcess(settings)

        if not hasattr(project, 'crawler'):

            self.crawler.install()

        self.crawler.configure()

        self.items = []

        dispatcher.connect(self._item_passed, signals.item_passed)


    def _item_passed(self, item):

        self.items.append(item)


    def _crawl(self, queue, spider_name):

        spider = self.crawler.spiders.create(spider_name)

        if spider:

            self.crawler.queue.append_spider(spider)

        self.crawler.start()

        self.crawler.stop()

        queue.put(self.items)


    def crawl(self, spider):

        queue = Queue()

        p = Process(target=self._crawl, args=(queue, spider,))

        p.start()

        p.join()

        return queue.get(True)

智慧大石
浏览 1128回答 3
3回答

LEATH

所有其他答案均参考Scrapyv0.x。根据更新的文档,Scrapy 1.0要求:import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider(scrapy.Spider):    # Your spider definition    ...process = CrawlerProcess({    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})process.crawl(MySpider)process.start() # the script will block here until the crawling is finished
随时随地看视频慕课网APP

相关分类

Python
我要回答