Selenium/BeautifulSoup - Python - 遍历多个页面

首页课程实战体系课手记专栏慕课教程

Selenium/BeautifulSoup - Python - 遍历多个页面

我一天中的大部分时间都在研究和测试在零售商网站上循环浏览一组产品的最佳方式。

虽然我成功地在第一页上收集了一组产品（和属性），但我一直难以找到循环浏览网站页面以继续我的抓取的最佳方式。

根据我下面的代码，我尝试使用“while”循环和 Selenium 单击网站的“下一页”按钮，然后继续收集产品。

问题是我的代码仍然没有超过第 1 页。

我在这里犯了一个愚蠢的错误吗？在此站点上阅读 4 或 5 个类似的示例，但没有一个具体到足以在此处提供解决方案。

from selenium import webdriver

from bs4 import BeautifulSoup

driver = webdriver.Chrome()

driver.get('https://www.kohls.com/catalog/mens-button-down-shirts-tops-clothing.jsp?CN=Gender:Mens+Silhouette:Button-Down%20Shirts+Category:Tops+Department:Clothing&cc=mens-TN3.0-S-buttondownshirts&kls_sbp=43160314801019132980443403449632772558&PPP=120&WS=0')

products.clear()

hyperlinks.clear()

reviewCounts.clear()

starRatings.clear()

products = []

hyperlinks = []

reviewCounts = []

starRatings = []

pageCounter = 0

maxPageCount = int(html_soup.find('a', class_ = 'totalPageNum').text)+1

html_soup = BeautifulSoup(driver.page_source, 'html.parser')

prod_containers = html_soup.find_all('li', class_ = 'products_grid')

while (pageCounter < maxPageCount):

for product in prod_containers:

# If the product has review count, then extract:

if product.find('span', class_ = 'prod_ratingCount') is not None:

# The product name

name = product.find('div', class_ = 'prod_nameBlock')

name = re.sub(r"\s+", " ", name.text)

products.append(name)

# The product hyperlink

hyperlink = product.find('span', class_ = 'prod_ratingCount')

hyperlink = hyperlink.a

hyperlink = hyperlink.get('href')

hyperlinks.append(hyperlink)

# The product review count

reviewCount = product.find('span', class_ = 'prod_ratingCount').a.text

reviewCounts.append(reviewCount)

# The product overall star ratings

starRating = product.find('span', class_ = 'prod_ratingCount')

starRating = starRating.a

starRating = starRating.get('alt')

starRatings.append(starRating)

富国沪深

浏览 269回答 2

2回答

随时随地看视频慕课网APP