为什么最后一个函数没有执行?

我正在抓取 Dmoz 网站,我抓取了关于页面,但是当我按名称创建另一个函数parse_editor并尝试抓取时,它没有给我结果。


from ..items import DmoztutorialItem

import scrapy



class DmozSpiderSpider(scrapy.Spider):

    name = 'Dmoz'

    start_urls = ['http://dmoz-odp.org/']

    about_page = 'http://dmoz-odp.org/docs/en/about.html'

    editor = 'http://dmoz-odp.org/docs/en/help/become.html'


    def parse(self, response):

        # collect data on first page

        items = {

            'Navbar': response.css('#main-nav a::text').extract(),

            'Category_names': response.css('.top-cat a::text').extract(),

            'Subcategories': response.css('.sub-cat a::text').extract(),

            'About_page': self.about_page,

            'Become_an_editor': self.editor

        }


        # save and call request to another page

        yield response.follow(self.about_page, self.parse_about, self.editor, self.parse_editor, meta={'items': items})


    def parse_about(self, response):

        # do your stuff on second page

        items = response.meta['items'] 

        items['Headings'] = response.css('h2::text , #mainContent h1::text').extract()  # add your logics

        items['Paragraphs'] = response.css('p::text').extract()

        items['3 Projects'] = response.css('li~ li+ li b a::text , li:nth-child(1) b a::text').extract()

        items['About Dmoz'] = response.css('.nav ul a::text , li:nth-child(2) b a::text').extract()

        items['Languages'] = response.css('.nav~ .nav a::text').extract()

        items['You can make a difference'] = response.css('dd::text , #about-contribute::text').extract()

        items['Further information'] = response.css('li::text , #about-more-info a::text').extract()

        yield items


    def parse_editor(self, response):

        # do your stuff on third page

        editor_items = response.meta['items']

        editor_items['Heading'] = response.css('#mainContent h1::text').extract()

        yield editor_items


明月笑刀无情
浏览 152回答 1
1回答

斯蒂芬大帝

你把所有东西都写在一个里面response.follow,那是错误的。它需要一对 url-callback。所以将它们写在两个单独的函数中:不正确的变体:yield response.follow(self.about_page, self.parse_about, self.editor, self.parse_editor, meta={'items': items})正确的变体:yield response.follow(self.about_page, self.parse_about, meta={'items': items})yield response.follow(self.editor, self.parse_editor, meta={'items': items})你可以先写follow在parse函数中;调用parse_about并follow在parse_editor函数中生成第二项并产生最终项。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python