Scrapy返回垃圾数据，例如空格和换行符。我该如何过滤？

首页课程实战体系课手记专栏慕课教程

Scrapy返回垃圾数据，例如空格和换行符。我该如何过滤？

我写了一个蜘蛛，它返回的数据充满了空格和换行符。换行符还导致extract()方法以列表形式返回。在触摸选择器之前如何过滤它们？之后过滤这些extract()称为DRY原则，因为我需要从页面中提取很多数据，这些数据是无属性的，这使得解析它的唯一方法是通过索引。

我该如何过滤？

它会返回错误的数据，像这样

{ 'aired': ['\n ', '\n Apr 3, 2016 to Jun 26, 2016\n '],

'broadcast': [], 'duration': ['\n ', '\n 24 min. per ep.\n '], 'episodes': ['\n ', '\n 13\n '], 'favourites': ['\n ', '\n 22,673\n'], 'genres': ['Action', 'Comedy', 'School', 'Shounen', 'Super Power'], 'image_url': ['https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

'https://myanimelist.cdn-dena.com/images/anime/10/78745.jpg',

海绵宝宝撒

浏览 310回答 2

2回答

叮当猫咪

查看您的代码，您可以尝试使用xpaths normalize-spacemal_item['aired'] = border_class.xpath('normalize-space(.//div[11]/text())').extract()*未经测试，但似乎合法。对于更一般的答案，yourString.strip('someChar')或yourString.replace('this','withThis')效果很好（但在使用json对象进行操作的情况下，它可能不如其他方法有效）。如果这些字符出现在原始数据中，则需要手动将其删除或跳过它们。

0 0

qq_花开花谢_0

换行符还导致extract（）方法作为列表返回导致这种行为的原因不是换行符，而是节点在文档树中出现的方式。由元素节点分隔的文本节点（例如）<a>, <br>, <hr>被视为单独的实体，并且scrappy会按这样生成它们（实际上extract()，即使仅选择了单个节点，也应该总是返回列表）。XPath具有几个基本的值处理/过滤功能，但是它非常有限。在调用extract（）之后对它们进行过滤会破坏DRY原理您似乎相信，过滤这些输出的唯一正确方法是在选择器表达式中执行此操作。但是，如此严格地讲这些原则是没有用的，您是从目标节点内部选择文本节点，这些文本节点必然具有过多的空白或散布在其容器的各处。按内容进行XPath筛选非常缓慢，因此应在其外部进行。后期处理报废字段是一种常见做法。您可能想阅读有关刮板式装载机和处理器的信息。否则，最简单的方法是：# import re...def join_clean(texts):    return re.sub(r'\s+', ' ', ' '.join(texts)).strip()...mal_item['type'] = join_clean(border_class.xpath('.//div[8]/a/text()').extract())

0 0

随时随地看视频慕课网APP

相关分类

Python