处理 xml 文件时的 UTF8 编码无效

我有一个处理 XML 文件以读取一些值的 Java 代码。我收到一个错误:无效的 UTF8 编码,我试图将文件内容复制到 NotePad++ 上的另一个文件中,该过程运行良好,但如果我只将文件另存为其他名称,则会给出相同的错误。抱歉,我不能把我的 XML 文件放在这里,因为它太大了,我只会放 header 和 trailer。感谢您提供任何帮助来解决此错误。我处理 xml 文件的 java 代码:


XPathFactory f=XPathFactory.newInstance();

    XPath x=f.newXPath();


    InputSource source=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );

    InputSource source2=new InputSource(new FileInputStream("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml") );


    XPathExpression trlr=x.compile("pers/trailer/text()");

    XPathExpression hdr=x.compile("pers/header/CD/text()");


    String s=trlr.evaluate(source);

    String s2=hdr.evaluate(source2);

    System.out.println("header :"+s+" trailer"+s2);

pers 是 xml 文件中的根标记:


XML 文件如下所示:


<?xml version = '1.0' encoding = 'UTF-8'?>

<pers>

 <header>555</header>

 .

 .

 .

 .

 <trailer>666</trailer>


</pers>


jeck猫
浏览 322回答 2
2回答

智慧大石

使用 java 编写脚本来检测有问题的行。AtomicInteger lineno = new AtomicInteger();Path path = Paths.get("... .xml");Files.lines(path, StandardCharsets.ISO_8859_1)&nbsp; &nbsp; .forEach(line -> {&nbsp; &nbsp; &nbsp; &nbsp; int no = lineno.incrementAndGet();&nbsp; &nbsp; &nbsp; &nbsp; byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);&nbsp; &nbsp; &nbsp; &nbsp; try {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; new String(b, StandardCharsets.UTF_8);&nbsp; &nbsp; &nbsp; &nbsp; } catch (Exception e) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //throw new IllegalStateException(e);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; });人们可能会认为这是一个数据错误。一般来说,它也可能是错误的缓冲读取:当一个多字节序列在缓冲区边界上被破坏时;然后可能会出现两个错误的半序列。在标准库代码中不太可能。为了确保代码new String(...)不会被 JVM 丢弃,可能:int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)&nbsp; &nbsp; .mapToInt(line -> {&nbsp; &nbsp; &nbsp; &nbsp; int no = lineno.incrementAndGet();&nbsp; &nbsp; &nbsp; &nbsp; byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);&nbsp; &nbsp; &nbsp; &nbsp; try {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return new String(b, StandardCharsets.UTF_8).length();&nbsp; &nbsp; &nbsp; &nbsp; } catch (Exception e) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new IllegalStateException(e); // Must throw or return int&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }).sum();System.out.println("Ignore this: " + sowhat);人们可能会认为这是一个数据错误。一般来说,它也可能是错误的缓冲读取:当一个多字节序列在缓冲区边界上被破坏时;然后可能会出现两个错误的半序列。在标准库代码中不太可能。为了确保代码new String(...)不会被 JVM 丢弃,可能:int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)&nbsp; &nbsp; .mapToInt(line -> {&nbsp; &nbsp; &nbsp; &nbsp; int no = lineno.incrementAndGet();&nbsp; &nbsp; &nbsp; &nbsp; byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);&nbsp; &nbsp; &nbsp; &nbsp; try {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return new String(b, StandardCharsets.UTF_8).length();&nbsp; &nbsp; &nbsp; &nbsp; } catch (Exception e) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new IllegalStateException(e); // Must throw or return int&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }).sum();非法的 XML 字符(在 1.0 版中)?[#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]int sowhat = Files.lines(path, StandardCharsets.ISO_8859_1)&nbsp; &nbsp; .mapToInt(line -> {&nbsp; &nbsp; &nbsp; &nbsp; int no = lineno.incrementAndGet();&nbsp; &nbsp; &nbsp; &nbsp; byte[] b = line.getBytes(StandardCharsets.ISO_8859_1);&nbsp; &nbsp; &nbsp; &nbsp; if (!legal(b)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.printf("[%d] %s%n%s%n", no, line, e.getMessage());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; throw new IllegalStateException(e); // Must throw or return int&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }).sum();static boolean legal(byte[] bytes) {&nbsp; &nbsp; String s = new String(bytes, StandardCharsets.UTF_8);&nbsp; &nbsp; for (char ch : s.toCharArray()) {&nbsp; &nbsp; &nbsp; &nbsp; int x = ch;&nbsp; &nbsp; &nbsp; &nbsp; if ((0 <= x && x <= 8)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// ASCII control chars&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; || (0xB <= x && x <= 0xC)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; || (0xE <= x && x <= 0x1F)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; || (0x7f <= x && x <= 0x84)&nbsp; // DEL + Unicode control chars&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; || (0x86 <= x && x <= 0x9F)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return false;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return true;}如果这不起作用,我已经让你足够长的时间了。拆分文件并验证零件。

守着一只汪

我使用此代码将文件转换为 UTF-8 格式:&nbsp;File source = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file.xml");&nbsp; &nbsp; String srcEncoding="ISO-8859-1";&nbsp; &nbsp; File target = new File("C:\\Users\\cc\\eclipse-workspace\\data\\file2.xml");&nbsp; &nbsp; String tgtEncoding="UTF-8";&nbsp; &nbsp; &nbsp; try (&nbsp; &nbsp; &nbsp; &nbsp; BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(source), srcEncoding));&nbsp; &nbsp; &nbsp; &nbsp; BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding)); ) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; char[] buffer = new char[16384];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; int read;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while ((read = br.read(buffer)) != -1)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bw.write(buffer, 0, read);&nbsp; }
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Java