BeautifulSoup 找不到所有 div 标签

我已经开始了一个私人项目:在 Visual Studio Code (1.41.0) 中使用 Python 和 BeautifulSoup 进行网页抓取。

我能够抓取与我的“问题网站”具有相同结构的另一个网站。然而现在我遇到了,BeautifulSoup 没有找到所有 div 标签(每个站点应该有 20 个,而我只找到了其中 3 个)。


<div class="css-15dj4ut"></div>我从 中得到了所有<div class="css-fh99y9 excbu0j0">...</div>,但没有从 中得到<div class="css-roynbj excbu0j0"></div>。你知道为什么吗?


迭代每个 url 以访问每个站点。


for i in range(0, endIndex):

try:

    if i == 0:

        urls.append(basicUrl)

        page = urllib.request.urlopen(urls[i])

        soup = BeautifulSoup(page, 'html.parser')


        getSurgeonName(soup)


    else:

        urls.append(basicUrl + urlAddon + str(i + 1))

        page = urllib.request.urlopen(urls[i])

        soup = BeautifulSoup(page, 'html.parser')


        getSurgeonName(soup)


except:

    print("An URL request error occured.")

函数版本1:


def getSurgeonName(soup):

    # gets just first 3 surgeons of site

    docName = re.compile('css-15dj4ut')

    docNameTags = soup.find_all('div', attrs={'class': docName})

    for a in docNameTags:

            docNameList.append(a.getText())

功能版本2:


def getSurgeonName(soup):


    parentClass = re.compile('css-fh99y9 excbu0j0')

    parentItems = soup.find_all('div', attrs={'class': parentClass})


    for parent in parentItems:

           children = parent.findChildren('div', {"class": "css-15dj4ut"}) 

           docNameList.append(children[0].getText())


    parentClass = re.compile('css-roynbj excbu0j0')

    parentItems = soup.find_all('div', attrs={'class': parentClass})


    for parent in parentItems:

           children = parent.findChildren('div', {'class': 'css-15dj4ut'}) 

           docNameList.append(children[0].getText())


慕桂英4014372
浏览 139回答 1
1回答

大话西游666

实际上,您所需的desired数据是通过JavaScript页面加载动态加载的,因此requests包将无法JavaScript动态渲染。但我已经能够找到script保存数据的标签,然后将其加载到string中。JSON dictJSON在这里你可以解析任何你想要的:)。import requestsfrom bs4 import BeautifulSoupimport jsonr = requests.get("https://www.comparis.ch/gesundheit/arzt/pathologie")soup = BeautifulSoup(r.content, 'html.parser')script = soup.find("script", {'id': '__NEXT_DATA__'}).textdata = json.loads(script)print(data.keys())&nbsp; # JSON Dictdumper = json.dumps(data, indent=4)print(dumper)&nbsp; # to see it in human readble format就像是:for item in data['props']['pageProps']['doctorResults']['doctorModels']:&nbsp; &nbsp; print(item['name'])输出:Mohamed AbdouDr. med. Heiner AdamsDr. med. Franziska AebersoldProf. Dr. med. Adriano AguzziDr. med. Maria AmmannProsper AnaniDr. med. Max ArnaboldiDr. med. Walter ArnoldDr. med. Irena BaltisserDr. med. Fridolin BannwartDr. med. Yara BanzDr. med. André BarghornDr. Jessica BarizziProf. Dr. med. Daniel BaumhoerAudrey Baur ChaubertDr. med. Christian Georg BayerlDr. med. Marc BeerDr. med. Sabina BerezowskaDr. med. Steffen BergeltDr. med. Barbara Elisabeth Berger-Denzler
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5