使用BeautifulSoup从html中提取除script标签内容之外的文本

我有这样的html


<span class="age">

    Ages 15

    <span class="loc" id="loc_loads1">

     </span>

     <script>

        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

     </script>

</span>

我正在尝试Age 15使用BeautifulSoup


所以我写了python代码如下


代码:


from bs4 import BeautifulSoup as bs

import urllib3


URL = 'html file'


http = urllib3.PoolManager()


page = http.request('GET', URL)


soup = bs(page.data, 'html.parser')

age = soup.find("span", {"class": "age"})


print(age.text)

输出:


Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

我只想要标签Age 15内的功能script。有没有办法只获取 text: Age 15?或者有什么方法可以排除script标签的内容?


PS:script标签太多,URL不同。我不喜欢从输出中替换文本。


呼啦一阵风
浏览 343回答 2
2回答

幕布斯7119047

用 .find(text=True)前任:from bs4 import BeautifulSouphtml = """<span class="age">&nbsp; &nbsp; Ages 15&nbsp; &nbsp; <span class="loc" id="loc_loads1">&nbsp; &nbsp; &nbsp;</span>&nbsp; &nbsp; &nbsp;<script>&nbsp; &nbsp; &nbsp; &nbsp; getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);&nbsp; &nbsp; &nbsp;</script></span>"""soup = BeautifulSoup(html, "html.parser")print(soup.find("span", {"class": "age"}).find(text=True).strip())输出:Ages 15

临摹微笑

迟到的答案,但为了将来参考,您还可以使用分解()从 中删除所有script元素html,即:soup = BeautifulSoup(html, "html.parser")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# remove script and style elements&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;for script in soup(["script", "style"]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; script.decompose()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;print(soup.find("span", {"class": "age"}).text.strip())&nbsp; &nbsp;&nbsp;# Ages 15
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python