我试图根据其类别刮擦黄页。因此,我从文本文件加载类别并将其提供给start_urls。我在这里面临的问题是为每个类别单独保存输出。以下是我试图实现的代码:
CATEGORIES = []
with open('Catergories.txt', 'r') as f:
data = f.readlines()
for category in data:
CATEGORIES.append(category.strip())
在 settings.py 中打开文件,并在蜘蛛中列出要访问的列表。
蜘蛛:
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import YellowItem
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
class YpSpider(CrawlSpider):
categories = settings.get('CATEGORIES')
name = 'yp'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms={0}&geo_location_terms=New%20York'
'%2C '
'%20NY'.format(*categories)]
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item',
follow=True),
Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''),
follow=True),
)
def parse_item(self, response):
categories = settings.get('CATEGORIES')
print(categories)
item = YellowItem()
# for data in response.xpath('//section[@class="info"]'):
item['title'] = response.xpath('//h1/text()').extract_first()
item['phone'] = response.xpath('//p[@class="phone"]/text()').extract_first()
item['street_address'] = response.xpath('//h2[@class="address"]/text()').extract_first()
email = response.xpath('//a[@class="email-business"]/@href').extract_first()
try:
item['email'] = email.replace("mailto:", '')
except AttributeError:
pass
繁华开满天机
青春有我
相关分类