猿问

无法获取相关链接并丢弃其他链接

我已经在 python 中编写了一个脚本,结合 selenium 和 BeautifulSoup,从网页中获取指向属性详细信息的链接。由于内容非常动态,我使用 selenium 来获取页面源。当我运行我的脚本时,我得到了很多链接,包括那些必需的链接。


如何仅从三个容器中的每个容器中获取相关链接?


我的尝试:


from bs4 import BeautifulSoup

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.common.keys import Keys

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC


def fetch_info(link):

    driver.get(link)

    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))

    soup = BeautifulSoup(driver.page_source, "lxml")

    linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")]

    return linklist


if __name__ == '__main__':

    url = "https://www.khov.com/find-new-homes/arizona/buckeye"

    driver = webdriver.Chrome()

    wait = WebDriverWait(driver,10)

    for newlink in fetch_info(url):

        print(newlink)

    driver.quit()

结果我有:


/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills

/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado

/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado

/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-at-silverstone

/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/skye

/find-new-homes/arizona/phoenix/85020/k-hovnanian-homes/pointe-16

/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/fusion-ii-at-the-meadows

/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/aire

/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-at-silverstone

/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/montage-at-the-meadows

/find-new-homes/arizona/sun-city/85373/four-seasons/k.-hovnanian-s-four-seasons-at-ventana-lakes



ABOUTYOU
浏览 154回答 3
3回答

隔江千里

您需要包括特色 ID 和结果。您可以使用 Or 进行组合。最新的 bs4 支持not.#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer  .propertyWrapper :not([onclick])[href*=find]这也可以缩短为#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer但这种缩短可能不那么强大。

杨魅力

列表切片会起作用吗?def fetch_info(link):    driver.get(link)    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))    soup = BeautifulSoup(driver.page_source, "lxml")    linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")][:3]    return linklist

慕婉清6462132

您可以只检查链接中所需的关键字并打印它们,而忽略其他关键字:if __name__ == '__main__':    url = "https://www.khov.com/find-new-homes/arizona/buckeye"    driver = webdriver.Chrome()    wait = WebDriverWait(driver,10)    for newlink in fetch_info(url):        if url.split('/')[-1] in newlink:            print(newlink)    driver.quit()输出:/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
随时随地看视频慕课网APP

相关分类

Python
我要回答