在本地 HTML 文件上使用 Python 中的 Beautiful Soup 时出现错误的重音字符

首页课程实战体系课手记专栏慕课教程

在本地 HTML 文件上使用 Python 中的 Beautiful Soup 时出现错误的重音字符

我对 Python 中的 Beautiful Soup 非常熟悉，我一直用来抓取实时网站。

现在我正在抓取本地 HTML 文件（链接，如果您想测试代码），唯一的问题是重音字符没有以正确的方式表示（在抓取实时网站时，我从未发生过这种情况）。

这是代码的简化版本

import requests, urllib.request, time, unicodedata, csv

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('AH.html'), "html.parser")

tables = soup.find_all('table')

titles = tables[0].find_all('tr')

print(titles[55].text)

打印以下输出

2:22 - Il Destino Ãˆ GiÃ Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

而正确的输出应该是

2:22 - Il Destino È Già Scritto (2017 ITA/ENG) [1080p] [BLUWORLD]

我寻找解决方案，阅读了许多问题/答案并找到了这个答案，我通过以下方式实现了它

import requests, urllib.request, time, unicodedata, csv

from bs4 import BeautifulSoup

import codecs

response = open('AH.html')

content = response.read()

html = codecs.decode(content, 'utf-8')

soup = BeautifulSoup(html, "html.parser")

但是，它运行时出现以下错误

Traceback (most recent call last):

File "C:\Users\user\AppData\Local\Programs\Python\Python37-32\lib\encodings\utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)

TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

File "C:\Users\user\Desktop\score.py", line 8, in <module>

html = codecs.decode(content, 'utf-8')

TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'str')

我想解决这个问题很容易，但是怎么办呢？

慕桂英4014372

浏览 164回答 2

2回答

慕姐8265434

使用open('AH.html')使用默认编码对文件进行解码，该默认编码可能不是文件的编码。 BeautifulSoup理解 HTML 标头，特别是以下内容表明该文件是 UTF-8 编码的：<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">以二进制模式打开文件并BeautifulSoup计算出来：with open("AH.html","rb") as f:     soup = BeautifulSoup(f, 'html.parser')有时，网站设置的编码不正确。在这种情况下，如果您知道编码应该是什么，您可以自己指定编码。with open("AH.html",encoding='utf8') as f:     soup = BeautifulSoup(f, 'html.parser')

0 0

梦里花落0921

from bs4 import BeautifulSoupwith open("AH.html") as f:    soup = BeautifulSoup(f, 'html.parser')    tb = soup.find("table")    for item in tb.find_all("tr")[55]:        print(item.text)我不得不说，您的第一个代码实际上很好并且应该可以工作。关于第二个代码，您正在尝试decode str哪个是错误的。因为decode函数是为byte object.我相信您正在使用Windows它的默认编码不是cp1252的地方UTF-8。您能否运行以下代码：import sys print(sys.getdefaultencoding()) print(sys.stdin.encoding) print(sys.stdout.encoding) print(sys.stderr.encoding)并检查你的输出是否是UTF-8或cp1252。请注意，如果您使用VSCodewith Code-Runner，请在终端中运行您的代码py code.py解决方案（来自聊天）(1) 如果您使用的是 Windows 10打开控制面板并通过小图标更改视图单击区域单击管理选项卡单击更改系统区域设置...勾选“Beta：使用 Unicode UTF-8...”框单击“确定”并重新启动您的电脑（2）如果你不是Windows 10或者只是不想改变之前的设置，那么在第一段代码中改为open("AH.html")，open("AH.html", encoding="UTF-8")即写：from bs4 import BeautifulSoupwith open("AH.html", encoding="UTF-8") as f:    soup = BeautifulSoup(f, 'html.parser')    tb = soup.find("table")    for item in tb.find_all("tr")[55]:        print(item.text)

0 0

随时随地看视频慕课网APP

相关分类

Html5