我正在从事一个大型网页抓取项目,其中每个网页的 HTML 结构都彼此不同。我想从网页上抓取产品描述,并且我正在使用 BeautifulSoup 包。
例如,我尝试抓取的产品描述存储在 HTML 结构中:
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some content" </p>
<p> "Some content" </p>
<p> "Product description" </p>
</div>
<div class="product-description">
<p> "Title" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Some-content" </p>
<p> "Product description" </p>
</div>
我编写了一个 for 循环,根据页面结构从 div 类“产品描述”获取数据。我的示例代码片段:
requests = (grequests.get(url) for url in urls)
responses = grequests.imap(requests, grequests.Pool(1000))
for response in responses:
html_soup = BeautifulSoup(response.text, 'html.parser')
if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text
elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.next_sibling.text
else:
product_description = html_soup.find(
'div', class_='product_description').next_element.next_sibling.text
我期望 if 条件检查当前 HTML 级别是否有同级,如果没有则检查后续条件。然而,经过 3000 次迭代后,我得到了Attribute error一句话Nonetype object has no attribute next_sibling。下面附上截图:
我知道一定有其他更简单的方法来处理这个动态页面结构。任何帮助将非常感激。提前致谢!
斯蒂芬大帝
相关分类