BeautifulSoup 保留一些文本,但删除标签的其余部分

我正在使用一个从论坛抓取数据的机器人。我这里有这个可以使用:

<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>

从此我想得到

This is a test post with a few emotes :grin: :heart:

我该怎么做呢?如果表情也位于文本中间,我也希望能够做到这一点。


喵喔喔
浏览 182回答 1
1回答

青春有我

from bs4 import BeautifulSoup, CDatatxt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''# load main soup:soup = BeautifulSoup(txt, 'html.parser')# find CDATA inside <description>, make new soupsoup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')# replace <img> with their alt=...for img in soup2.select('img'):&nbsp; &nbsp; img.replace_with(img['alt'])# print textprint(soup2.p.text)印刷:This is a test post with a few emotes :grin: :heart:
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python