MacOS 和 Windows 中相同字符的不同代码点

首页课程实战体系课手记专栏慕课教程

MacOS 和 Windows 中相同字符的不同代码点

我有一小段代码，我在其中检查字符的代码点Ü。

Locale lc = Locale.getDefault();

System.out.println(lc.toString());

System.out.println(Charset.defaultCharset());

System.out.println(System.getProperty("file.encoding"));

String inUnicode = "\u00dc";

String glyph = "Ü";

System.out.println("inUnicode " + inUnicode + " code point " + inUnicode.codePointAt(0));

System.out.println("glyph " + glyph + " code point " + glyph.codePointAt(0));

当我在 MacOS x 和 Windows 10 上运行此代码时，我获得了不同的代码点值，请参阅下面的输出。

MacOS 上的输出

en_US

UTF-8

inUnicode Ü code point 220

glyph Ü code point 220

Windows 上的输出

en_US

windows-1252

Cp1252

in unicode Ü code point 220

glyph ?? code point 195

我在https://en.wikipedia.org/wiki/Windows-1252#Character_set检查了 windows-1252 的代码页，这里的代码点Ü是220. 对于String glyph = "Ü";为什么会出现代码点为195在Windows？根据我的理解，glyph应该已经正确呈现，并且代码点应该是220因为它是在 Windows-1252 中定义的。

如果我替换String glyph = "Ü";为String glyph = new String("Ü".getBytes(), Charset.forName("UTF-8"));然后glyph正确呈现并且代码点值为220. 无论语言环境和字符集如何，这是在任何操作系统上标准化 String 行为的正确有效方法吗？

汪汪一只猫

浏览 176回答 1

1回答

狐的传说

195 是十六进制的 0xC3。在 UTF-8 中，Ü编码为 bytes 0xC3 0x9C。System.getProperty("file.encoding")说 Windows 上的默认文件编码不是 UTF-8，但很明显你的 Java 文件实际上是用 UTF-8 编码的。println()输出的事实glyph ??（注 2 ?，意味着char存在2 s），并且您能够使用 UTF-8 解码原始字符串字节Charset，证明了这一点。glyph应该具有单一的char，其值是0x00DC，而不是2个char（胡）的值是0x00C3 0x009C。 在 Windows 上getCodepointAt(0)返回0x00C3(195)，因为您的 Java 文件是用 UTF-8 编码的，但正在加载，就好像它是用 Windows-1252 编码的一样，所以 2 个字节0xC3 0x9C被解码为 characters0x00C3 0x009C而不是 character 0x00DC。您需要在运行 Java 时指定实际的文件编码，例如：java -Dfile.encoding=UTF-8 ...

0 0

随时随地看视频慕课网APP