如何使用python获取<li>和<span>标签中的每个值

我正在尝试从网站https://www.cellartracker.com/m/wines/12344 中抓取一些数据。我无法理解如何获取不属于标签中任何类的每个值。以下是我正在寻找的网站代码:


<ul class="twin-set-list">

        <li><span>Vintage</span> 2000</li>

        <li><span>Type</span> Red</li>

        <li><span>Producer</span> Balnaves of Coonawarra</li>

        <li><span>Varietal</span> Cabernet Sauvignon</li>

        <li><span>Designation</span> The Tally Reserve</li>

        <li><span>Vineyard</span> n/a</li>

        <li><span>Country</span> Australia</li>

        <li><span>Region</span> South Australia</li>

        <li><span>SubRegion</span> Limestone Coast</li>

        <li><span>Appellation</span> Coonawarra</li>

    </ul>

像 2000、Red 等值没有任何类,所以我可以使用什么方式来获取数据。我在 python 中尝试了以下代码(下面仅给出了 html 部分):


    from bs4 import BeautifulSoup


html = """<ul class="twin-set-list">

            <li><span>Vintage</span> 2000</li>

            <li><span>Type</span> Red</li>

            <li><span>Producer</span> Balnaves of Coonawarra</li>

            <li><span>Varietal</span> Cabernet Sauvignon</li>

            <li><span>Designation</span> The Tally Reserve</li>

            <li><span>Vineyard</span> n/a</li>

            <li><span>Country</span> Australia</li>

            <li><span>Region</span> South Australia</li>

            <li><span>SubRegion</span> Limestone Coast</li>

            <li><span>Appellation</span> Coonawarra</li>

        </ul>"""


soup = BeautifulSoup(html, 'html.parser')


need = {}


for li_tag in soup.find_all('ul', {'class':'twin-set-list'}):

    for span_tag in li_tag.find_all('li'):

        field = span_tag.find('span').text

        value = span_tag.find('span').text

        need[field] = value


print(need)

谁能建议我如何提取这些数据?


Helenr
浏览 946回答 3
3回答

狐的传说

您可以通过以下方式替换您的代码:field = span_tag.find('span').text&nbsp;value = span_tag.text.replace(field,'')它不是很干净,但它适用于您的代码。

慕桂英4014372

您可以遍历对象的contents属性bs4:from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup&nbsp;as&nbsp;soup d&nbsp;=&nbsp;[[getattr(c,&nbsp;'text',&nbsp;c).strip()&nbsp;for&nbsp;c&nbsp;in&nbsp;i]&nbsp;for&nbsp;i&nbsp;in&nbsp;soup(html,&nbsp;'html.parser').find_all('li')]输出:[['Vintage',&nbsp;'2000'],&nbsp;['Type',&nbsp;'Red'],&nbsp;['Producer',&nbsp;'Balnaves&nbsp;of&nbsp;Coonawarra'],&nbsp;['Varietal',&nbsp;'Cabernet&nbsp;Sauvignon'],&nbsp;['Designation',&nbsp;'The&nbsp;Tally&nbsp;Reserve'],&nbsp;['Vineyard',&nbsp;'n/a'],&nbsp;['Country',&nbsp;'Australia'],&nbsp;['Region',&nbsp;'South&nbsp;Australia'],&nbsp;['SubRegion',&nbsp;'Limestone&nbsp;Coast'],&nbsp;['Appellation',&nbsp;'Coonawarra']]

一只甜甜圈

也许你可以试试这个:for li_tag in soup.find_all('ul', {'class':'twin-set-list'}):for span_tag in li_tag.find_all('li'):&nbsp; &nbsp; field = span_tag.find('span').text&nbsp; &nbsp; value = span_tag.text&nbsp; &nbsp; value = value[len(field)+1:]&nbsp; &nbsp; need[field] = value以防万一,如果您在“值”中有相同的字段,请不要替换它,而是使用 subtring。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python