网页抓取动态 HTML 页面结构

我正在从事一个大型网页抓取项目,其中每个网页的 HTML 结构都彼此不同。我想从网页上抓取产品描述,并且我正在使用 BeautifulSoup 包。

例如,我尝试抓取的产品描述存储在 HTML 结构中:

<div class="product-description">

  <p> "Title" </p>

  <p> "Some content" </p>

  <p> "Product description" </p>

</div>



<div class="product-description">

  <p> "Title" </p>

  <p> "Product description" </p>

</div>


<div class="product-description">

  <p> "Title" </p>

  <p> "Some content" </p>

  <p> "Some content" </p>

  <p> "Product description" </p>

</div>



<div class="product-description">

  <p> "Title" </p>

  <p> "Some-content" </p>

  <p> "Some-content" </p>

  <p> "Some-content" </p>

  <p> "Product description" </p>

</div>

我编写了一个 for 循环,根据页面结构从 div 类“产品描述”获取数据。我的示例代码片段:


requests = (grequests.get(url) for url in urls)

responses = grequests.imap(requests, grequests.Pool(1000))


for response in responses:


        html_soup = BeautifulSoup(response.text, 'html.parser')


        if html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling:

                product_description = html_soup.find('div',class_='product_description').next_element.next_sibling.next_sibling.next_sibling.next_sibling.text


        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling.next_sibling:

                product_description = html_soup.find(

                  'div', class_='product_description').next_element.next_sibling.next_sibling.next_sibling.text


        elif html_soup.find('div', class_='product-description').next_element.next_sibling.next_sibling:

                product_description = html_soup.find(

                  'div', class_='product_description').next_element.next_sibling.next_sibling.text


        else:

                product_description = html_soup.find(

                  'div', class_='product_description').next_element.next_sibling.text


我期望 if 条件检查当前 HTML 级别是否有同级,如果没有则检查后续条件。然而,经过 3000 次迭代后,我得到了Attribute error一句话Nonetype object has no attribute next_sibling。下面附上截图:

https://img1.sycdn.imooc.com/6595025300019e1d16000141.jpg

我知道一定有其他更简单的方法来处理这个动态页面结构。任何帮助将非常感激。提前致谢!



慕侠2389804
浏览 34回答 1
1回答

斯蒂芬大帝

尝试这个:for i in soup.find_all('div',class_="product-description"):&nbsp; &nbsp; try:&nbsp; &nbsp; &nbsp; &nbsp; print(i.find_all('p')[-1].text)&nbsp; &nbsp; except:&nbsp; &nbsp; &nbsp; &nbsp; pass这里的汤是:<div class="product-description">&nbsp; <p> "Title" </p>&nbsp; <p> "Some content" </p>&nbsp; <p> "Product description" </p></div><div class="product-description">&nbsp; <p> "Title" </p>&nbsp; <p> "Product description" </p></div><div class="product-description">&nbsp; <p> "Title" </p>&nbsp; <p> "Some content" </p>&nbsp; <p> "Some content" </p>&nbsp; <p> "Product description" </p></div><div class="product-description">&nbsp; <p> "Title" </p>&nbsp; <p> "Some-content" </p>&nbsp; <p> "Some-content" </p>&nbsp; <p> "Some-content" </p>&nbsp; <p> "Product description" </p></div>
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5