我正在编写一个脚本,该脚本转到链接列表并解析信息。
它适用于大多数站点,但在某些情况下令人窒息:“ UnicodeEncodeError:'ascii'编解码器无法在位置13编码字符'\ xe9':序数不在范围内(128)”
它在python3上urlib的client.py上停止
确切的链接是:http : //finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html
这里有很多类似的帖子,但是似乎没有答案对我有用。
我的代码是:
from urllib import request
def __request(link,debug=0):
try:
html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
unicode_html = html.decode('utf-8','ignore')
# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
if debug:
print('The server couldn\'t fulfill the request for ' + link)
print('Error code: ', e.code)
return ''
except URLError as e:
if isinstance(e.reason, socket.timeout):
print('timeout')
return ''
else:
return unicode_html
这调用了请求功能
链接=' http: //finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'页面= __request(链接)
追溯是:
Traceback (most recent call last):
File "<string>", line 250, in run_nodebug
File "C:\reader\get_news.py", line 276, in <module>
main()
File "C:\reader\get_news.py", line 255, in main
body = get_article_body(item['link'],debug=0)
File "C:\reader\get_news.py", line 155, in get_article_body
page = __request('na',url)
File "C:\reader\get_news.py", line 50, in __request
html = request.urlopen(link, timeout=35).read()
File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\Lib\urllib\request.py", line 469, in open
response = self._open(req, data)
File "C:\Python33\Lib\urllib\request.py", line 487, in _open
任何帮助表示赞赏它使我发疯,我想我已经尝试过x.decode和类似内容的所有组合
MM们
相关分类