使用 BeautifulSoup 和 pandas 将列表项内标题下方的文本抓取到列中

我正在尝试使用 BeautifulSoup 和 pandas 来抓取和存储一些项目。下面的代码仅部分有效。正如您所看到的,它刮掉了“Engine426/425 HP”,而我只希望将字符串“426/425 HP”存储在“engine”列中。我想抓取下面 HTML 中的所有 4 个 h5 字符串(请参阅下面所需的输出)。我希望有人能帮助我,谢谢!


import numpy as np

import pandas as pd

from bs4 import BeautifulSoup

import requests

import re


main_url = "https://www.example.com/"


def getAndParseURL(url):

    result = requests.get(url)

    soup = BeautifulSoup(result.text, 'html.parser')

    return(soup)


soup = getAndParseURL(main_url)


engine = []


engine.append(soup.find("ul", class_ = re.compile('list-inline lot-breakdown-list')).li.text)


scraped_data = pd.DataFrame({'engine': engine})


scraped_data.head()


              engine

0   Engine426/425 HP


超文本标记语言


<div class="lot-breakdown">

    <ul class="list-inline lot-breakdown-list">

        <li>

            <h5>Engine</h5>426/425 HP</li>

        <li>

            <h5>Trans</h5>Automatic</li>

        <li>

            <h5>Color</h5>Alpine White</li>

        <li>

            <h5>Interior</h5>Black</li>

    </ul>

</div>

所需输出


scraped_data[['engine', 'trans', 'color', 'interior']] = pd.DataFrame([['426/425 HP', 'Automatic', 'Alpine White', 'Black']], index=scraped_data.index)

scraped_data


              engine        trans          color  interior

0         426/425 HP    Automatic   Alpine White     Black


一只甜甜圈
浏览 95回答 1
1回答

蝴蝶刀刀

您可以通过多种方式实现这一目标:&nbsp; &nbsp; from bs4 import BeautifulSoup , NavigableString&nbsp; &nbsp; import requests&nbsp; &nbsp; main_url = "https://www.example.com/"&nbsp; &nbsp; def getAndParseURL(url):&nbsp; &nbsp; &nbsp; &nbsp; result = requests.get(url)&nbsp; &nbsp; &nbsp; &nbsp; soup = BeautifulSoup(result.text, 'html.parser')&nbsp; &nbsp; &nbsp; &nbsp; return(soup)&nbsp; &nbsp; soup = getAndParseURL(main_url)&nbsp; &nbsp; #ul&nbsp; &nbsp;= soup.select('ul[class="list-inline lot-breakdown-list"] li')&nbsp; &nbsp; #for li in ul :&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;#x = li.find(text=True, recursive=False) # Will give you the text of the li skipping the text of child tag&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;#y = ' '.join([t for t in li.contents if type(t)== NavigableString]) # contents [<h5>Engine</h5>, '426/425 HP'] the text you want has a type of NavigableString and That's what we are returning .&nbsp; &nbsp; ul = soup.select('ul[class="list-inline lot-breakdown-list"] li', recursive=True)&nbsp; &nbsp; lis_e = []&nbsp; &nbsp; for li in ul:&nbsp; &nbsp; &nbsp; &nbsp; lis = []&nbsp; &nbsp; &nbsp; &nbsp; lis.append(li.contents[1])&nbsp; &nbsp; &nbsp; &nbsp; lis_e.extend(lis)&nbsp; &nbsp; engine.append(lis_e[0])&nbsp; &nbsp; trans.append(lis_e[1])&nbsp; &nbsp; color.append(lis_e[2])&nbsp; &nbsp; interior.append(lis_e[3])&nbsp; &nbsp; scraped_data = pd.DataFrame({'engine': engine, 'transmission': trans, 'color': color, 'interior': interior})&nbsp; &nbsp; scraped_data
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5