Scrapy:登录后抓取下一页

我对网络抓取还很陌生。成功登录 Quotes.toscrape.com 网站后,我尝试抓取页面。我的代码(scrapytest/spiders/quotes_spider.py)如下:


import scrapy

from scrapy.http import FormRequest

from ..items import ScrapytestItem

from scrapy.utils.response import open_in_browser

from scrapy.spiders.init import InitSpider



class QuoteSpider(scrapy.Spider):

    name = 'scrapyquotes'

    login_url = 'http://quotes.toscrape.com/login'

    start_urls = [login_url]


    def parse(self,response):

        token = response.css('input[name="csrf_token"]::attr(value)').extract_first()

        yield scrapy.FormRequest(url=self.login_url,formdata={

            'csrf_token':token,

            'username':'roberthng',

            'password':'dsadsadsa'

        },callback = self.start_scraping)


    def start_scraping(self,response):

        items = ScrapytestItem()

        all_div_quotes=response.css('div.quote')


        for quotes in all_div_quotes:

            title = quotes.css('span.text::text').extract()

            author = quotes.css('.author::text').extract()

            tag = quotes.css('.tag::text').extract()


            items['title'] = title

            items['author'] = author

            items['tag'] = tag


            yield items


        #Go to Next Page:     

        next_page = response.css('li.next a::attr(href)').get()

        if next_page is not None:

            yield response.follow(next_page, callback=self.parse)

每当我在终端(VSC)上通过 $ scrapy scrapy scrapyquotes 运行此代码时,该代码只能抓取登录并抓取第一页。总是爬不到第二页。下面是出现的错误消息:

2020-10-10 12:26:42 [scrapy.core.engine] 调试:已抓取 (200) <GET http://quotes.toscrape.com/page/2/>(参考: http://quotes.toscrape .com/

2020-10-10 12:26:42 [scrapy.core.scraper] 错误:蜘蛛处理错误 <GET http://quotes.toscrape.com/page/2/>(参考: http://quotes.toscrape. com/ )

我怀疑这与 start_urls 有关,但是当我将其更改为“http://quotes.toscrape.com/page/1”时,代码甚至不会抓取第一页。谁能帮我解决这个代码吗?先感谢您!


BIG阳
浏览 144回答 2
2回答

哈士奇WWW

您将错误的函数传递给回调,您的self.parse函数只能在登录页面上使用。if&nbsp;next_page&nbsp;is&nbsp;not&nbsp;None:&nbsp; &nbsp;&nbsp;&nbsp;yield&nbsp;response.follow(next_page,&nbsp;callback=self.start_scraping)

MMTTMM

这是来自您的执行日志:&nbsp; File "C:\Users\Robert\Documents\Demos\vstoolbox\scrapytest\scrapytest\spiders\quotes_spider.py", line 15, in parse&nbsp; &nbsp; yield scrapy.FormRequest(url=self.login_url,formdata={&nbsp; File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 31, in __init__&nbsp; &nbsp; querystr = _urlencode(items, self.encoding)&nbsp; File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in _urlencode&nbsp; &nbsp; values = [(to_bytes(k, enc), to_bytes(v, enc))&nbsp; File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\http\request\form.py", line 71, in <listcomp>&nbsp; &nbsp; values = [(to_bytes(k, enc), to_bytes(v, enc))&nbsp; File "C:\Users\Robert\anaconda3\envs\condatest\lib\site-packages\scrapy\utils\python.py", line 104, in to_bytes&nbsp; &nbsp; raise TypeError('to_bytes must receive a str or bytes 'TypeError: to_bytes must receive a str or bytes object, got NoneType简而言之,它告诉您formdata参数中的参数是None,但预计它是“a str 或 bytes 对象”。鉴于您formdata有三个字段,只有一个是变量,token必须返回空。&nbsp; &nbsp; ...&nbsp; &nbsp; token = response.css('input[name="csrf_token"]::attr(value)').extract_first()&nbsp; &nbsp; yield scrapy.FormRequest(url=self.login_url,formdata={&nbsp; &nbsp; &nbsp; &nbsp; 'csrf_token':token,&nbsp; &nbsp; &nbsp; &nbsp; 'username':'roberthng',&nbsp; &nbsp; &nbsp; &nbsp; 'password':'dsadsadsa'&nbsp; &nbsp; },callback = self.start_scraping)但是,如果您位于登录页面,您的选择器会正确返回值。我的假设是,当您定义下一页的请求时,您正在将回调设置为您的parse方法(或者根本不设置它,这会导致parse默认)。我说假设,因为你没有发布那部分代码。您的代码示例停在这里:&nbsp; &nbsp; #Go to Next Page:&nbsp; &nbsp; &nbsp;&nbsp; &nbsp; next_page = response.css('li.next a::attr(href)').get()&nbsp; &nbsp; if next_page is not None:因此,请确保在此之后为请求正确设置回调函数。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python