猿问

无法从网页解析某些名称及其相关网址

我已经使用请求和 BeautifulSoup 创建了一个 python 脚本来解析配置文件名称以及从网页到其配置文件名称的链接。内容似乎是动态生成的,但它们存在于页面源中。所以,我尝试了以下方法,但不幸的是我什么也没得到。

网站链接

到目前为止我的尝试:

import requests

from bs4 import BeautifulSoup


URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'


headers = {

    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',

    'accept-encoding': 'gzip, deflate, br',

    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',

    'cache-control': 'max-age=0',

    'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',

    'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

}


def get_info(link):

    res = requests.get(link,headers=headers)

    soup = BeautifulSoup(res.text,"lxml")

    for item in soup.select(".media__content"):

        profileUrl = item.get("href")

        profileName = item.select_one("[itemprop='name']").get_text()

        print(profileUrl,profileName)


if __name__ == '__main__':

    get_info(URL)

如何从该页面获取内容?


交互式爱情
浏览 207回答 3
3回答

largeQ

所需内容在页面源中可用。当使用相同的user-agent.所以,我曾经fake_useragent随机提供相同的请求。如果您不经常使用它,它会起作用。工作解决方案:import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoinfrom fake_useragent import UserAgentURL = 'https://www.century21.com/real-estate-agents/Dallas,TX'def get_info(s,link):    s.headers["User-Agent"] = ua.random    res = s.get(link)    soup = BeautifulSoup(res.text,"lxml")    for item in soup.select(".media__content a[itemprop='url']"):        profileUrl = urljoin(link,item.get("href"))        profileName = item.select_one("span[itemprop='name']").get_text()        print(profileUrl,profileName)if __name__ == '__main__':    ua = UserAgent()    with requests.Session() as s:        get_info(s,URL)部分输出:https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a Stewart Kipnesshttps://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Andrea-Anglin-Bulin-2631495a Andrea Anglin Bulinhttps://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Betty-DeVinney-2631507a Betty DeVinneyhttps://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Sabra-Waldman-2657945a Sabra Waldmanhttps://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Russell-Berry-2631447a Russell Berry

心有法竹

看起来你也可以构建 url(虽然看起来更容易抓住它)import requestsfrom bs4 import BeautifulSoup as bsURL = 'https://www.century21.com/real-estate-agents/Dallas,TX'headers = {    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',    'accept-encoding': 'gzip, deflate, br',    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',    'cache-control': 'max-age=0',    'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',    'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}r = requests.get(URL, headers = headers)soup = bs(r.content, 'lxml')items = soup.select('.media')ids = []names = []urls = []for item in items:    if item.select_one('[data-agent-id]') is not None:        anId = item.select_one('[data-agent-id]')['data-agent-id']        ids.append(anId)        name = item.select_one('[itemprop=name]').text.replace(' ','-')        names.append(name)        url = 'https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/' + name + '-' + anId + 'a'        urls.append(url)results = list(zip(names,  urls))print(results)

慕哥6287543

页面内容不是通过 javascript 呈现的。你的代码在我的情况下很好。您在查找 profileUrl 和处理nonetype异常方面遇到了一些问题。您必须专注于a标签才能获取数据你应该试试这个:import requestsfrom bs4 import BeautifulSoupURL = 'https://www.century21.com/real-estate-agents/Dallas,TX'headers = {    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',    'accept-encoding': 'gzip, deflate, br',    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',    'cache-control': 'max-age=0',    'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',    'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}def get_info(link):    res = requests.get(link,headers=headers)    soup = BeautifulSoup(res.text,"lxml")    results = []    for item in soup.select(".media__content"):        a_link = item.find('a')        if a_link:            result = {                    'profileUrl': a_link.get('href'),                    'profileName' : a_link.get_text()                }        results.append(result)    return resultsif __name__ == '__main__':    info = get_info(URL)    print(info)    print(len(info))输出:[{'profileName': 'Stewart Kipness',  'profileUrl': '/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a'},  ...., {'profileName': 'Courtney Melkus',  'profileUrl': '/CENTURY-21-Realty-Advisors-47551c/Courtney-Melkus-7389925a'}]941
随时随地看视频慕课网APP

相关分类

Python
我要回答