我想删除 [ ] 括号 scrapy 添加到它的所有输出中,为此您只需在 xpath 语句的末尾添加 [0] ,如下所示:
'a[@class="question-hyperlink"]/text()').extract()[0]
这在某些情况下解决了 [] 问题,但在其他情况下,scrapy 将每第二行输出返回为空白,因此在使用 [0] 时它到达第二行时出现错误:
Index error: list index out of range
如何防止scrapy创建空行?这似乎是一个常见问题,但每个人在导出为 CSV 时都会遇到这个问题,而对我来说,在导出为 CSV 之前,它是带有scrapy 响应的。
项目.py:
import scrapy
from scrapy.item import Item, Field
class QuestionItem(Item):
title = Field()
url = Field()
class PopularityItem(Item):
votes = Field()
answers = Field()
views = Field()
class ModifiedItem(Item):
lastModified = Field()
modName = Field()
不会每隔一行输出为空白并因此与 [0] 一起使用的蜘蛛:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import QuestionItem
class QuestionSpider(Spider):
name = "questions"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"http://stackoverflow.com/questions?pagesize=50&sort=newest",
]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = QuestionItem()
item['title'] = question.xpath(
'a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath(
'a[@class="question-hyperlink"]/@href').extract()[0]
yield item
每隔一行输出为空白的蜘蛛:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import PopularityItem
class PopularitySpider(Spider):
name = "popularity"
allowed_domains = ["stackoverflow.com"]
start_urls = [
"https://stackoverflow.com/",
]
def parse(self, response):
popularity = response.xpath('//div[contains(@class, "question-summary narrow")]/div')
for poppart in popularity:
相关分类