Java正则表达式匹配UTF-8字符串(无副本)

我正在从 SocketChannel 加载大型 UTF-8 文本,并且需要提取一些值。模式匹配java.util.regex对此非常有用,但是解码为 Java 的 UTF-16 withCharBuffer cb = UTF_8.decode(buffer);会复制此缓冲区,使用双倍的空间。

有没有办法以 UTF-8 创建 CharBuffer“视图”,或者以其他方式与字符集进行模式匹配?


神不在的星期二
浏览 146回答 1
1回答

吃鸡游戏

您可以创建轻量级CharSequence包装ByteBuffer,无需正确的 UTF8 处理即可执行简单的字节到字符转换。只要您的正则表达式仅包含 Latin1 字符,它就可以在“天真”转换的字符串上工作。只有与 reg ex 匹配的范围才需要从 UTF8 正确解码。下面的代码说明了这种方法。import java.io.UnsupportedEncodingException;import java.nio.ByteBuffer;import java.nio.CharBuffer;import java.nio.charset.Charset;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.junit.Test;import junit.framework.Assert;public class RegExSnippet {&nbsp; &nbsp; private static Charset UTF8 = Charset.forName("UTF8");&nbsp; &nbsp; @Test&nbsp; &nbsp; public void testByteBufferRegEx() throws UnsupportedEncodingException {&nbsp; &nbsp; &nbsp; &nbsp; // this UTF8 byte encoding of test string&nbsp; &nbsp; &nbsp; &nbsp; byte[] bytes = ("lkfmd;wmf;qmfqv amwfqwmf;c "&nbsp; &nbsp; &nbsp; &nbsp; + "<tag>This is some non ASCII text 'кирилицеский текст'</tag>"&nbsp; &nbsp; &nbsp; &nbsp; + "kjnfdlwncdlka-lksnflanvf ").getBytes(UTF8);&nbsp; &nbsp; &nbsp; &nbsp; ByteBuffer bb = ByteBuffer.wrap(bytes);&nbsp; &nbsp; &nbsp; &nbsp; ByteSeqWrapper bsw = new ByteSeqWrapper(bb);&nbsp; &nbsp; &nbsp; &nbsp; // pattern should contain only LATIN1 characters&nbsp; &nbsp; &nbsp; &nbsp; Matcher m = Pattern.compile("<tag>(.*)</tag>").matcher(bsw);&nbsp; &nbsp; &nbsp; &nbsp; Assert.assertTrue(m.find());&nbsp; &nbsp; &nbsp; &nbsp; String body = m.group(1);&nbsp; &nbsp; &nbsp; &nbsp; // extracted part is properly decoded as UTF8&nbsp; &nbsp; &nbsp; &nbsp; Assert.assertEquals("This is some non ASCII text 'кирилицеский текст'", body);&nbsp; &nbsp; }&nbsp; &nbsp; public static class ByteSeqWrapper implements CharSequence {&nbsp; &nbsp; &nbsp; &nbsp; final ByteBuffer buffer;&nbsp; &nbsp; &nbsp; &nbsp; public ByteSeqWrapper(ByteBuffer buf) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; this.buffer = buf;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; public int length() {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return buffer.remaining();&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; public char charAt(int index) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return (char) (0xFF & buffer.get(index));&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; public CharSequence subSequence(int start, int end) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ByteBuffer bb = buffer.duplicate();&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bb.position(bb.position() + start);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bb.limit(bb.position() + (end - start));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return new ByteSeqWrapper(bb);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; @Override&nbsp; &nbsp; &nbsp; &nbsp; public String toString() {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // a little hack to apply proper encoding&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // to a parts extracted by matcher&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; CharBuffer cb = UTF8.decode(buffer);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return cb.toString();&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Java