我对网络抓取还很陌生。成功登录 Quotes.toscrape.com 网站后,我尝试抓取页面。我的代码(scrapytest/spiders/quotes_spider.py)如下:
import scrapy
from scrapy.http import FormRequest
from ..items import ScrapytestItem
from scrapy.utils.response import open_in_browser
from scrapy.spiders.init import InitSpider
class QuoteSpider(scrapy.Spider):
name = 'scrapyquotes'
login_url = 'http://quotes.toscrape.com/login'
start_urls = [login_url]
def parse(self,response):
token = response.css('input[name="csrf_token"]::attr(value)').extract_first()
yield scrapy.FormRequest(url=self.login_url,formdata={
'csrf_token':token,
'username':'roberthng',
'password':'dsadsadsa'
},callback = self.start_scraping)
def start_scraping(self,response):
items = ScrapytestItem()
all_div_quotes=response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
#Go to Next Page:
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
每当我在终端(VSC)上通过 $ scrapy scrapy scrapyquotes 运行此代码时,该代码只能抓取登录并抓取第一页。总是爬不到第二页。下面是出现的错误消息:
2020-10-10 12:26:42 [scrapy.core.engine] 调试:已抓取 (200) <GET http://quotes.toscrape.com/page/2/>(参考: http://quotes.toscrape .com/)
2020-10-10 12:26:42 [scrapy.core.scraper] 错误:蜘蛛处理错误 <GET http://quotes.toscrape.com/page/2/>(参考: http://quotes.toscrape. com/ )
我怀疑这与 start_urls 有关,但是当我将其更改为“http://quotes.toscrape.com/page/1”时,代码甚至不会抓取第一页。谁能帮我解决这个代码吗?先感谢您!
哈士奇WWW
MMTTMM
相关分类