在 Python 中去除 \n、\、\t、\xa0、â\x80\x93 字符文本的最快方法

我正在使用 beautifulsoup 转换 html 数据，收集“p”标签中的所有文本并将其转换为字符串。我这样做是使用：

source = BeautifulSoup(response.text, "html.parser")

content = ""

for section in source.findAll('p'):

content += section.get_text()

但是，当我转换它时，上面提到的标签分散在整个字符串中。我尝试了多种方法从我正在使用的字符串中删除所有这些字符，例如：

unicodedata.normalize('NFKC', text)

content = u" ".join(content.split())

text.strip(), text.rstrip()

是否有可以从字符串中删除这些标签的库。其中一些方法解决了一些问题，但大多数仍然存在。

编辑：这是一个字符串示例：https ://pastebin.com/2DGECKXa

哈士奇WWW

浏览 428回答 2

2回答

摇曳的蔷薇

您可以使用该.replace方法编写一个函数来执行此操作。unwanted_chars = ['\n', '\t', 'r', '\xa0', 'â\x80\x93'] # Edit this to include all characters you want to removedef clean_up_text(text, unwanted_chars=unwanted_chars):        for char in unwanted_chars:        text = text.replace(char, '')    return text然后您可以应用该功能clean_up_text来删除所有不需要的字符。new_text = clean_up_text(old_text)

森栏

看看这是否有效from simplified_scrapy.simplified_doc import SimplifiedDocdoc = SimplifiedDoc(response.text)content = ""for section in doc.ps:    content += section.text    # content += section.unescape()print (content)

随时随地看视频慕课网APP