如何从包含分页的站点中提取链接？（使用硒）

首页课程实战体系课手记专栏慕课教程

我想从以下站点中提取链接，但其中确实包含分页：我想在MoreInfo Button下提取链接：

我正在使用以下代码段：

import time

import requests

import csv

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.action_chains import ActionChains

import re

browser = webdriver.Chrome()

time.sleep(5)

browser.get('https://www.usta.com/en/home/play/facility-listing.html?searchTerm=&distance=5000000000&address=Palo%20Alto,%20%20CA')

wait = WebDriverWait(browser,15)

def extract_data(browser):

links = browser.find_elements_by_xpath("//div[@class='seeMoreBtn']/a")

return [link.get_attribute('href') for link in links]

element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, "//a[@class='glyphicon glyphicon-chevron-right']")))

max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)

# extract from the current (1) page

print("Page 1")

print(extract_data(browser))

for page in range(2, max_pages + 1):

print("Page %d" % page)

next_page = browser.find_element_by_xpath("//a[@class='glyphicon glyphicon-chevron-right']").click()

print(extract_data(browser))

print("-----")

当我运行上面的脚本时，我得到了这个错误**（我对正则表达式不太了解，也只是在探索这个概念）**：

Traceback (most recent call last):

File "E:/Python/CSV/testingtesting.py", line 29, in <module>

max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)

AttributeError: 'NoneType' object has no attribute 'group'

如果可能的话，请给我建议解决方案。我以某种方式设法使用等待并单击分页链接来提取链接。但是它花费的时间增加了将近13秒的等待时间

慕容森

浏览 242回答 1

随时随地看视频慕课网APP