请求库在Python 2和Python 3上崩溃

首页课程实战体系课手记专栏慕课教程

请求库在Python 2和Python 3上崩溃

我正在尝试使用以下代码解析带有requests和BeautifulSoup库的任意网页：

try:

response = requests.get(url)

except Exception as error:

return False

if response.encoding == None:

soup = bs4.BeautifulSoup(response.text) # This is line 809

else:

soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

在大多数网页上，这都可以正常工作。但是，在某些任意页面（<1％）上，出现此崩溃：

Traceback (most recent call last):

File "/home/dotancohen/code/parser.py", line 155, in has_css

soup = bs4.BeautifulSoup(response.text)

File "/usr/lib/python3/dist-packages/requests/models.py", line 809, in text

content = str(self.content, encoding, errors='replace')

TypeError: str() argument 2 must be str, not None

作为参考，这是请求库的relevent方法：

@property

def text(self):

"""Content of the response, in unicode.

if Response.encoding is None and chardet module is available, encoding

will be guessed.

"""

# Try charset from content-type

content = None

encoding = self.encoding

# Fallback to auto-detected encoding.

if self.encoding is None:

if chardet is not None:

encoding = chardet.detect(self.content)['encoding']

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace') # This is line 809

except LookupError:

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# So we try blindly encoding.

content = str(self.content, errors='replace')

return content

可以看出，抛出此错误时，我没有传递编码。我如何错误地使用该库，以及如何防止该错误？这是在Python 3.2.3上实现的，但我也可以在Python 2上获得相同的结果。

临摹微笑

浏览 219回答 1

1回答

天涯尽头无女友

这意味着服务器未发送标头中内容的编码，并且chardet库也无法确定内容的编码。实际上，您实际上是在测试是否缺少编码；如果没有可用的编码，为什么要尝试获取解码的文本？您可以尝试将解码留给BeautifulSoup解析器：if response.encoding is None:   soup = bs4.BeautifulSoup(response.content)并有没有必要在编码BeautifulSoup通过，因为如果.text没有失败，你正在使用Unicode和BeautifulSoup反正会忽略编码参数：else:   soup = bs4.BeautifulSoup(response.text)

0 0

随时随地看视频慕课网APP