Web 抓取 - 挑战在我的代码中阐明层次结构

目标:

我正在尝试抓取 100 多个网页,特别是每个网页的配方成分。如果我们举一个例子——其中包含鸡蛋三明治 ( url )的食谱,我为此使用了许多 Python 依赖项,包括BeautifulSoup, splinter.Browser, ChromeDrivermanager.


预期输出:

一旦我收集了成分,我想将它们保存在字典中。下面的例子 -


recipes = {"quick_and_easy_egg_salad_sandwich_recipe":

['1-2 tablespoons mayonnaise (to taste)',

 '2 tablespoons chopped celery',

 '2 slices white, wheat, multigrain, or rye bread, toasted or plain']

我取得的成就:

1. 我已经能够“粗略地”确定(通过 Web Inspector)我需要关注什么—— 看起来每种成分都有它自己的,但看起来我要么误解了层次结构,要么误解了我的代码是不正确的。

http://img2.mukewang.com/6459b7760001093d06530254.jpg

<li class='ingredient'>


2.我的代码如下-


executable_path = {'executable_path': ChromeDriverManager().install()}

browser = Browser('chrome', **executable_path)


webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'

browser.visit(webpage_url)

time.sleep(1)

website_html = browser.html

website_soup = BeautifulSoup(website_html, 'html.parser')

ingredients = website_soup.find('h3', class_="Ingredients")

ingredientsList = ingredients.find('li', class_ = "ingredient")

print({ingredients})

当我尝试打印时,{ingredients}我得到一个AttributeError: 'NoneType' object has no attribute 'find'


我知道我的代码有缺陷的消息,但是我只是不知道如何解决这个问题,想知道是否有人有任何建议?


达令说
浏览 106回答 2
2回答

慕森卡

尝试这个,import requestsfrom bs4 import BeautifulSoupresp = requests.get("https://www.simplyrecipes.com/recipes/egg_salad_sandwich/")soup = BeautifulSoup(resp.text, "html.parser")div_ = soup.find("div", attrs={"class": "recipe-callout"})recipes = {"_".join(div_.find("h2").text.split()):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[x.text for x in div_.findAll("li", attrs={"class": "ingredient"})]}

慕标5832272

听起来你的代码应该在下面,在我删除了不必要的h3检索之后executable_path = {'executable_path': ChromeDriverManager().install()}browser = Browser('chrome', **executable_path)webpage_url = 'https://www.simplyrecipes.com/recipes/egg_salad_sandwich/'browser.visit(webpage_url)time.sleep(1)website_html = browser.htmlwebsite_soup = BeautifulSoup(website_html, 'html.parser')ingredientsList = website_soup.find('li', class_ = "ingredient")print({ingredients})您正在尝试查找具有不存在的类名的h3元素Ingredients
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python