代码和注释
content_s = '' for i_content in content: #去掉空格并连接 content_s += "".join(i_content.split()) douban_item['introduce'] = content_s
111111111111111
编写解析文件
11111111111111
第一次抓取
4461231
45464546
当在spider模块中有需要解析的新的URL请求时
yield scrapy.Request(url, callback = self.parse)
xpath:
以// 开头,后面接关键字,然后加中括号,中括号内第一字符是@。
/P
from ***.items import ***item
l = len(content) for i in range(l): for j in range(i+1,l): content_s = "".join(content[i].split())+" "+"".join(content[j].split()) douban_item['introduce'] = content_s print(douban_item)
现在豆瓣还追加了导演会导致每个类有两行
我的代码可以让它们合并为一行
'/text()' 解析文本信息
输入“.”进一步细分接下来的xpath
在content=i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text").extract()
for i_content in content:
content_s = "".join(i_content.split()),,,后面省略
在视频中没有.extract(),本机ubuntu16+python3环境,运行提示没有split属性。必须加上extract()才可以
Scrapy框架构成
class DoubanSpiderSpider(scrapy.Spider): name = 'douban_spider' allowed_domains = ['movie.douban.com'] start_urls = ['http://movie.douban.com/top250'] # 默认的解析方法 def parse(self, response): movie_list = response.xpath("//div[@class='article']//ol[@class='grid_view']/li") for i_item in movie_list: douban_item = DoubanItem() douban_item['serial_number'] = i_item.xpath(".//div[@class='item']//em/text()").extract_first() douban_item['movie_name'] = i_item.xpath( ".//div[@class='info']/div[@class='hd']/a/span[1]/text()").extract_first() content = i_item.xpath(".//div[@class='info']//div[@class='bd']/p[1]/text()").extract() for i_content in content: content_s = ''.join(i_content.split()) douban_item['introduce'] = content_s douban_item['star'] = i_item.xpath(".//span[@class='rating_num']/text()").extract_first() douban_item['evaluate'] = i_item.xpath(".//div[@class='star']/span[4]/text()").extract_first() douban_item['describe'] = i_item.xpath(".//p[@class='quote']//span/text()").extract_first() yield douban_item # 解析下一页规则,取的后页的xpath next_link = response.xpath("//span[@class='next']/link/@href").extract() if next_link: next_link = next_link[0] yield scrapy.Request('http://movie.douban.com/top250' + next_link, callback=self.parse)
douban_item['evaluate']=i_item.xpath(".//div[@class='star']//span[4]/text()").extract_first()
这里的span[4]是指<div class='star'>下第4个span也就是<span>xxx人评价这行
同理,可将douban_item['star']=i_item.xpath(".//div[@class='star']//span[@class='rating_num']/text()").extract_first()修改成douban_item['star']=i_item.xpath(".//div[@class='star']//span[2]/text()").extract_first(),结果一致
from scrapy import cmdine cmdline.execute('scrapy crawl douban_spider')
douban_spider.py
完成parse部分
循环条目
导入item文件
写xpath,解析内容
多行数据处理
将数据yield到pipelines
解析下一页规则,取后页的xpath,有则回调
cmdline.execute('scrapy crawl douban_spider'.split())