我正在尝试使用 BeautifulSoup 和 pandas 来抓取和存储一些项目。下面的代码仅部分有效。正如您所看到的,它刮掉了“Engine426/425 HP”,而我只希望将字符串“426/425 HP”存储在“engine”列中。我想抓取下面 HTML 中的所有 4 个 h5 字符串(请参阅下面所需的输出)。我希望有人能帮助我,谢谢!
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
main_url = "https://www.example.com/"
def getAndParseURL(url):
result = requests.get(url)
soup = BeautifulSoup(result.text, 'html.parser')
return(soup)
soup = getAndParseURL(main_url)
engine = []
engine.append(soup.find("ul", class_ = re.compile('list-inline lot-breakdown-list')).li.text)
scraped_data = pd.DataFrame({'engine': engine})
scraped_data.head()
engine
0 Engine426/425 HP
超文本标记语言
<div class="lot-breakdown">
<ul class="list-inline lot-breakdown-list">
<li>
<h5>Engine</h5>426/425 HP</li>
<li>
<h5>Trans</h5>Automatic</li>
<li>
<h5>Color</h5>Alpine White</li>
<li>
<h5>Interior</h5>Black</li>
</ul>
</div>
所需输出
scraped_data[['engine', 'trans', 'color', 'interior']] = pd.DataFrame([['426/425 HP', 'Automatic', 'Alpine White', 'Black']], index=scraped_data.index)
scraped_data
engine trans color interior
0 426/425 HP Automatic Alpine White Black
蝴蝶刀刀
相关分类