在尝试解码 unicode 序列时,至少有一个关于 SO 的相关问题被证明是有用的。
我正在预处理大量不同类型的文本。有些是经济的,有些是技术的,等等。警告之一是转换 unicode 序列:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek.
这样的字符串需要转换为实际字符:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojtĕch Čamek.
可以这样做:
s = "'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek."
s = s.encode('utf-8').decode('unicode-escape')
(至少这在s从utf-8编码文本文件中获取输入行时有效。我似乎无法让它在像 REPL.it 这样的在线服务上工作,其中输出的编码/解码方式不同。)
在大多数情况下,这可以正常工作。但是,当在输入字符串中看到目录结构路径时(我的数据集中的技术文档通常是这种情况),就会UnicodeDecodeError出现 s。
鉴于以下数据unicode.txt:
'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\u0115ch \u010camek, Financial Director and Director of Controlling.
Voor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\udfs\math.dll (op Windows)).
使用字节串表示:
b"'Korado's output has gone up from 180,000 radiators per year to almost 1.7 million today,' says Vojt\\u0115ch \\u010camek, Financial Director and Director of Controlling.\r\nVoor alle bestanden kan de naam met de volledige padnaam (bijvoorbeeld: /u/slick/udfs/math.a (op UNIX), d:\\udfs\\math.dll (op Windows))."
解码输入文件中的第二行时,以下脚本将失败:
with open('unicode.txt', 'r', encoding='utf-8') as fin, open('unicode-out.txt', 'w', encoding='utf-8') as fout:
lines = ''.join(fin.readlines())
lines = lines.encode('utf-8').decode('unicode-escape')
fout.write(lines)
有痕迹:
Traceback (most recent call last):
File "C:/Python/files/fast_aligning/unicode-encoding.py", line 3, in <module>
lines = lines.encode('utf-8').decode('unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 275-278: truncated \uXXXX escape
Process finished with exit code 1
我如何确保第一句话仍然正确“翻译”,如前所示,但第二句话保持不变?因此,给出的两行的预期输出如下,其中第一行已更改,第二行未更改。
相关分类