猿问

使用 Scrapy 从 Business Insider 抓取股票详细信息

我正在尝试从以下站点提取每只股票的“名称”、“最新价格”和“%”字段: https ://markets.businessinsider.com/index/components/s&p_500

但是,即使我已经确认我的 XPaths 在 Chrome 控制台中为这些字段工作,我也没有得到任何数据。

作为参考,我一直在使用本指南: https ://realpython.com/web-scraping-with-scrapy-and-mongodb/

items.py

from scrapy.item import Item, Field


class InvestmentItem(Item):

    ticker = Field()

    name = Field()

    px = Field()

    pct = Field()

investment_spider.py


from scrapy import Spider

from scrapy.selector import Selector

from investment.items import InvestmentItem


class InvestmentSpider(Spider):

    name = "investment"

    allowed_domains = ["markets.businessinsider.com"]

    start_urls = [

            "https://markets.businessinsider.com/index/components/s&p_500",

            ]


    def parse(self, response):

        stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')


        for stock in stocks:

            item = InvestmentItem()

            item['name'] = stock.xpath('td[1]/a/text()').extract()[0]

            item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]

            item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]


            yield item

控制台输出:


...

2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)

2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)

2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)

2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

...

2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)


噜噜哒
浏览 98回答 2
2回答

蝴蝶不菲

您在 xpath 表达式的开头缺少“./”。我已经简化了你的 xpaths:def parse(self, response):&nbsp; &nbsp; stocks = response.xpath('//table[@class="table table-small"]/tr')&nbsp; &nbsp; for stock in stocks[1:]:&nbsp; &nbsp; &nbsp; &nbsp; item = InvestmentItem()&nbsp; &nbsp; &nbsp; &nbsp; item['name'] = stock.xpath('./td[1]/a/text()').get()&nbsp; &nbsp; &nbsp; &nbsp; item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()&nbsp; &nbsp; &nbsp; &nbsp; item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()&nbsp; &nbsp; &nbsp; &nbsp; yield item

阿波罗的战车

XPATH版本&nbsp; &nbsp; def parse(self, response):&nbsp; &nbsp; &nbsp; &nbsp; rows = response.xpath('//*[@id="index-list-container"]/div[2]/table/tr')&nbsp; &nbsp; &nbsp; &nbsp; for row in rows:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'name' : row.xpath('td[1]/a/text()').extract(),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'price':row.xpath('td[2]/text()[1]').extract(),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'pct':row.xpath('td[5]/span[2]/text()').extract(),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'datetime':row.xpath('td[7]/span[2]/text()').extract(),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }CSS版本&nbsp; &nbsp; def parse(self, response):&nbsp; &nbsp; &nbsp; &nbsp; table = response.css('div#index-list-container table.table-small')&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; rows = table.css('tr')&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; for row in rows:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; name = row.css("a::text").get()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; high_low = row.css('td:nth-child(2)::text').get()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; date_time = row.css('td:nth-child(7) span:nth-child(2) ::text').get()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield {&nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'name' : name,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'high_low': high_low,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'date_time' : date_time&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }结果{"high_low": "\r\n146.44", "name": "3M", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},{"high_low": "\r\n42.22", "name": "AO Smith", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},{"high_low": "\r\n91.47", "name": "Abbott Laboratories", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},{"high_low": "\r\n92.10", "name": "AbbVie", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},{"high_low": "\r\n193.71", "name": "Accenture", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},{"high_low": "\r\n73.08", "name": "Activision Blizzard", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},{"high_low": "\r\n385.26", "name": "Adobe", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},{"high_low": "\r\n133.48", "name": "Advance Auto Parts", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
随时随地看视频慕课网APP

相关分类

Python
我要回答