使用BeautifulSoup从html中提取除script标签内容之外的文本

首页课程实战体系课手记专栏慕课教程

使用BeautifulSoup从html中提取除script标签内容之外的文本

我有这样的html

Ages 15

</span>

getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

</script>

</span>

我正在尝试Age 15使用BeautifulSoup

所以我写了python代码如下

代码：

from bs4 import BeautifulSoup as bs

import urllib3

URL = 'html file'

http = urllib3.PoolManager()

page = http.request('GET', URL)

soup = bs(page.data, 'html.parser')

age = soup.find("span", {"class": "age"})

print(age.text)

输出：

Age 15 getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);

我只想要标签Age 15内的功能script。有没有办法只获取 text: Age 15？或者有什么方法可以排除script标签的内容？

PS：script标签太多，URL不同。我不喜欢从输出中替换文本。

呼啦一阵风

浏览 359回答 2

2回答

幕布斯7119047

用 .find(text=True)前任：from bs4 import BeautifulSouphtml = """<span class="age">    Ages 15    <span class="loc" id="loc_loads1">     </span>     <script>        getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);     </script></span>"""soup = BeautifulSoup(html, "html.parser")print(soup.find("span", {"class": "age"}).find(text=True).strip())输出：Ages 15

0 0

临摹微笑

迟到的答案，但为了将来参考，您还可以使用分解（）从中删除所有script元素html，即：soup = BeautifulSoup(html, "html.parser")                  # remove script and style elements                         for script in soup(["script", "style"]):                       script.decompose()                                     print(soup.find("span", {"class": "age"}).text.strip())    # Ages 15

0 0

随时随地看视频慕课网APP