猿问

使用 Beautiful Soup 抓取亚马逊评论

我需要从这个亚马逊页面上抓取一些信息:


https://www.amazon.com/dp/B07Q6H83VY/ref=sspa_dk_detail_6?pd_rd_i=B07Q6H83VY&pd_rd_w=n4cqh&pf_rd_p=48d372c1-f7e1-4b8b-9d02-4bd86f5158c5&pd_rd_wg=8d6Pd&pf_rd_r=AES6X22PPPPREK5DD60G&pd_rd_r=2a4ff4e6-f8ce-4d62-8106-cffd53838b9e&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyTTZUQzQ0Q05TOVZJJmVuY3J5cHRlZElkPUEwMDU2MjE0Q05HOUFSMkdQTkhPJmVuY3J5cHRlZEFkSWQ9QTA4NTIyNzAxOVZYM1dISEVBUk1DJndpZGdldE5hbWU9c3BfZGV0YWlsJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ&th=1


具体来说,我会对这些领域感兴趣:


Author | Star | Date | Title | Review

例如:


    Gi

1.0 out of 5 stars Unacceptable Launch State for PS4


Reviewed in the United States on September 14, 2019


Platform: PlayStation 4Edition: Super DeluxeVerified Purchase


因为我以前从来没有这样做过,所以我想知道我是否可以用 Scrapy/BeautifulSoup/Selenium 来做这件事,或者我是否需要一个 API,尽管这些信息来自


Author under <span class="a-profile-name">Gi</span>


Rating <span class="a-icon-alt">1.0 out of 5 stars</span>


Review <div data-hook="review-collapsed" aria-expanded="false" class="a-expander-content a-expander-partial-collapse-content" style="padding-bottom: 19px;"> ...TEXT...</div>


人到中年有点甜
浏览 152回答 2
2回答

慕姐8265434

Scrapy 将是完成此任务的不错选择。这将是一个非常简单的蜘蛛,它将能够收集所需的信息。import scrapyclass TestSpider(scrapy.Spider):&nbsp; &nbsp; name = 'test'&nbsp; &nbsp; start_urls = ['https://www.amazon.com/dp/B07Q6H83VY']&nbsp; &nbsp; def parse(self, response):&nbsp; &nbsp; &nbsp; &nbsp; for row in response.css('div.review'):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item = {}&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item['author'] = row.css('span.a-profile-name::text').extract_first()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item['rating'] = int(float(rating.strip().replace(',', '.')))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item['title'] = row.css('span.review-title > span::text').extract_first()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; created_date = row.css('span.review-date::text').extract_first().strip()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item['created_date'] = created_date&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; review_content = row.css('div.reviewText ::text').extract()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; review_content = [rc.strip() for rc in review_content if rc.strip()]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; item['content'] = ', '.join(review_content)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; yield item输出示例:{&nbsp; &nbsp; &nbsp; &nbsp; "author": "Jhona Diaz",&nbsp; &nbsp; &nbsp; &nbsp; "rating": 4,&nbsp; &nbsp; &nbsp; &nbsp; "title": "Recomendable solo si eres fan ya que si está algo caro",&nbsp; &nbsp; &nbsp; &nbsp; "created_date": "Reviewed in Mexico on November 23, 2019",&nbsp; &nbsp; &nbsp; &nbsp; "content": "Buena calidad y pues muy completo"&nbsp; &nbsp; },&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; "author": "MANUEL MENDOZA OLVERA",&nbsp; &nbsp; &nbsp; &nbsp; "rating": 5,&nbsp; &nbsp; &nbsp; &nbsp; "title": "Perfecto Estado",&nbsp; &nbsp; &nbsp; &nbsp; "created_date": "Reviewed in Mexico on September 28, 2019",&nbsp; &nbsp; &nbsp; &nbsp; "content": "excelente, la edición es de caja&nbsp; metálica y llegó intacta"&nbsp; &nbsp; },

神不在的星期二

首先做 pip install selenium第二个使用 Python 库 dryscrape 来抓取 javascript 驱动的网站。在这个网址https://phantomjs.org/download.htmlfrom selenium import webdriver#the path below&nbsp; from dryscrape&nbsp; folder&nbsp; from step2&nbsp;&nbsp;driver = webdriver.PhantomJS(executable_path='C:\\Users\\nayef\\Desktop\\New folder\\phantomjs-2.1.1-windows\\bin\\phantomjs')driver.get('https://www.amazon.com/dp/B07Q6H83VY')p_element = driver.find_element_by_id('deliveryMessageMirId')driver.get(my_url)p_element = driver.find_element_by_id(id_='intro-text')print(p_element.text)# result:Arrives: Friday, Aug 7 Details
随时随地看视频慕课网APP

相关分类

Python
我要回答