使用 Split 方法创建分词器

由于您的要求不说“最大20”的任何地方，改[a-z]{3,20}到[a-z]{3,}无限长。正则表达式不能小写标记，因此需要toLowerCase()单独调用。如果您在调用正则表达式之前这样做，您的正则表达式将正常工作。如果您打算在调用正则表达式后调用toLowerCase()每个令牌，则需要更改为. 最简单的就是先做。[a-z][a-zA-Z]上面的意思是你的代码应该修改如下：Pattern pattern = Pattern.compile("[a-z]{3,}");Matcher matcher = pattern.matcher(EXAMPLE_TEST.toLowerCase());输出Start index: 0 End index: 4 thisStart index: 5 End index: 12 marianaStart index: 13 End index: 17 johnStart index: 18 End index: 21 barStart index: 22 End index: 26 barrStart index: 33 End index: 38 fffffStart index: 39 End index: 43 aaaa要使用做同样的事情split，您需要拆分由非字母字符或最多 2 个连续字母字符组成的任何字符序列。String[] split = EXAMPLE_TEST.toLowerCase().split("(?:[^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+");System.out.println(Arrays.toString(split));输出[this, mariana, john, bar, barr, fffff, aaaa]解释：(?:              Start non-capturing repeating group:   [^a-z]+           Match one or more nonalphabetic characters |                 Or   (?<![a-z])        Not preceded by an alphabetic character   [a-z]{1,2}        Match 1-2 alphabetic characters   (?![a-z])         Not followed by an alphabetic character)+               Match one or more of the above注：该+后[^a-z]可以去掉，因为+在年底会做重复反正，但正则表达式应该用更好的表现+出现。原始代码和拆分代码之间的区别在于，如果输入以非字母字符开头，拆分将返回一个空字符串作为第一个结果。

使用 Split 方法创建分词器

2回答