猿问

无法在Python中的Beautiful Soup中获取div标签,

我正在尝试下载官方网站上提供的所有口袋妖怪图像。我这样做的原因是因为我想要高质量的图像。以下是我编写的代码。


from bs4 import BeautifulSoup as bs4

import requests

request = requests.get('https://www.pokemon.com/us/pokedex/')

soup = bs4(request.text, 'html')

print(soup.findAll('div',{'class':'container       pokedex'}))

输出是


[]

我做错了什么吗?另外,从官方网站抓取合法吗?有没有任何标签或东西可以说明这一点?谢谢


PS:我是 BS 和 html 的新手。


翻阅古今
浏览 147回答 2
2回答

噜噜哒

图像是动态加载的,因此您必须使用selenium它们来抓取它们。这是执行此操作的完整代码:from selenium import webdriverimport timeimport requestsdriver = webdriver.Chrome()driver.get('https://www.pokemon.com/us/pokedex/')time.sleep(4)li_tags = driver.find_elements_by_class_name('animating')[:-3]li_num = 1for li in li_tags:    img_link = li.find_element_by_xpath('.//img').get_attribute('src')    name = li.find_element_by_xpath(f'/html/body/div[4]/section[5]/ul/li[{li_num}]/div/h5').text    r = requests.get(img_link)        with open(f"D:\\{name}.png", "wb") as f:        f.write(r.content)    li_num += 1driver.close()输出:12张口袋妖怪图片。这是前两张图片:图片1:图片2:另外,我注意到页面底部有一个加载更多按钮。单击时,它会加载更多图像。单击“加载更多”按钮后,我们必须继续向下滚动才能加载更多图像。如果我没记错的话,网站上一共有 893 张图片。为了抓取所有 893 张图像,您可以使用以下代码:from selenium import webdriverimport timeimport requestsdriver = webdriver.Chrome()driver.get('https://www.pokemon.com/us/pokedex/')time.sleep(3)load_more = driver.find_element_by_xpath('//*[@id="loadMore"]')driver.execute_script("arguments[0].click();",load_more)lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")match=Falsewhile(match==False):        lastCount = lenOfPage        time.sleep(1.5)        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")        if lastCount==lenOfPage:            match=Trueli_tags = driver.find_elements_by_class_name('animating')[:-3]li_num = 1for li in li_tags:    img_link = li.find_element_by_xpath('.//img').get_attribute('src')    name = li.find_element_by_xpath(f'/html/body/div[4]/section[5]/ul/li[{li_num}]/div/h5').text    r = requests.get(img_link)        with open(f"D:\\{name}.png", "wb") as f:        f.write(r.content)    li_num += 1driver.close()

元芳怎么了

如果您首先检查网络选项卡,这可能会更容易完成:import timeimport requestsendpoint = "https://www.pokemon.com/us/api/pokedex/kalos"# contains all metadatadata = requests.get(endpoint).json()# collect keys needed to save the pictureitems = [{"name": item["name"], "link": item["ThumbnailImage"]} for item in data]# remove duplicatesd = [dict(t) for t in {tuple(d.items()) for d in items}]assert len(d) == 893for pokemon in d:    response = requests.get(pokemon["link"])    time.sleep(1)    with open(f"{pokemon['name']}.png", "wb") as f:        f.write(response.content)
随时随地看视频慕课网APP

相关分类

Python
我要回答