使用 selenium 和 beautifulsoup 进行网页抓取时

3回答

九州编程

你不需要硒的费用。您可以对页面执行相同的 GET 请求，然后从返回的 json 中提取 html 并使用 bs4 解析并提取链接import requestsfrom bs4 import BeautifulSoup as bsr = requests.get('https://epl.bibliocommons.com/item/load_ugc_content/2300646980').json()soup = bs(r['html'], 'lxml')links = [i['href'] for i in soup.select('[data-test-id="staff-lists-that-include-this-title"] + div [href]')]print(len(links))print(links)

0 0

炎炎设计

我已经抓取了您的页面并编写了一个 XPath，它将找到li“包含此职位的员工列表”下的所有元素。更新为包含wait所有相关li元素的a 。WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPath, "//div[h4[text()='Staff Lists that include this Title']]/div[2]/ul/li[@class='']")))driver.find_elements_by_xpath("//div[h4[text()='Staff Lists that include this Title']]/div[2]/ul/li[not(contains(@class, 'extra'))]")此 XPath 查询包含文本“包含此职位的员工列表”的元素下的div所有li项目的主元素h4。然后我们查询div[2]哪些包含相关li项目。最后的查询是针对li具有 EMPTY 类名的元素。从页面源码中可以看出，有很多隐藏的li带有class='extra'属性的元素。我们不想要这些li元素，因此我们继续查询not(contains(@class=, 'extra'))以获取li没有extra类名的元素。如果上述 XPath 不起作用，我还修改了您在原始问题中发布的另一个 XPath：WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPath, "//*[@id="rightBar"]/div[3]/div/div[2]/ul/li[not(contains(@class, 'extra'))]")))driver.find_elements_by_xpath("//*[@id="rightBar"]/div[3]/div/div[2]/ul/li[not(contains(@class, 'extra'))]")对于您提供的 URL，两个查询都检索了 5 个结果：

0 0

慕斯709654

获取所有的anchor 标签下的Staff Lists that Include that TitleinduceWebDriverWait和 presence_of_all_elements_located() 这会给你100 个链接。from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECdriver=webdriver.Chrome()driver.get("https://epl.bibliocommons.com/item/show/2300646980")elements=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH,'//h4[@data-test-id="staff-lists-that-include-this-title"]/following::div[1]//li/a')))print(len(elements))for ele in elements:    print(ele.get_attribute('href'))

0 0