猿问

使用 Split 方法创建分词器

我正在尝试创建简单的分词器,它在空格上拆分、小写标记、删除所有非字母字符,并仅保留 3 个或更多字符的术语。我写了这段代码,它可以处理小写、非字母字符,并且只保留 3 个或更多字符。但是我想用split这个方法,不知道怎么用。请提出一些建议。


public class main {


    public static final String EXAMPLE_TEST = "This Mariana John bar Barr "

        + "12364 FFFFF aaaa a s d f g.";


    public static void main(String[] args) {

        Pattern pattern = Pattern.compile("(\\s[a-z]{3,20})");

        Matcher matcher = pattern.matcher(EXAMPLE_TEST);


        while (matcher.find()) {

            System.out.print("Start index: " + matcher.start());

            System.out.print(" End index: " + matcher.end() + " ");

            System.out.println(matcher.group());

        }

    }

}


MYYA
浏览 161回答 2
2回答

慕田峪7331174

如果您不必跟踪索引:List<String> processed = Arrays.stream(EXAMPLE_TEST.split(" ")).map(String::toLowerCase)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .map(s -> s.replaceAll("[^a-z]", "")).filter(s -> s.length() >= 3).collect(Collectors.toList());for (String s : processed) {&nbsp; &nbsp; System.out.println(s);}但是您的示例输出也显示了索引。然后您必须将其存储在其他容器中(例如 Map):Map<Integer, String> processed = Arrays.stream(EXAMPLE_TEST.split(" ")).collect(Collectors.toMap(s -> EXAMPLE_TEST.indexOf(s), s -> s.toLowerCase().replaceAll("[^a-z]", "")));Map<Integer, String> filtered = processed.entrySet().stream().filter(entry -> entry.getValue().length() >= 3).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));for (Map.Entry<Integer, String> entry : filtered.entrySet()) {&nbsp; &nbsp; System.out.println("Start index: " + entry.getKey() + " " + entry.getValue());}

四季花海

由于您的要求不说“最大20”的任何地方,改[a-z]{3,20}到[a-z]{3,}无限长。正则表达式不能小写标记,因此需要toLowerCase()单独调用。如果您在调用正则表达式之前这样做,您的正则表达式将正常工作。如果您打算在调用正则表达式后调用toLowerCase()每个令牌,则需要更改为. 最简单的就是先做。[a-z][a-zA-Z]上面的意思是你的代码应该修改如下:Pattern pattern = Pattern.compile("[a-z]{3,}");Matcher matcher = pattern.matcher(EXAMPLE_TEST.toLowerCase());输出Start index: 0 End index: 4 thisStart index: 5 End index: 12 marianaStart index: 13 End index: 17 johnStart index: 18 End index: 21 barStart index: 22 End index: 26 barrStart index: 33 End index: 38 fffffStart index: 39 End index: 43 aaaa要使用 做同样的事情split,您需要拆分由非字母字符或最多 2 个连续字母字符组成的任何字符序列。String[] split = EXAMPLE_TEST.toLowerCase().split("(?:[^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+");System.out.println(Arrays.toString(split));输出[this, mariana, john, bar, barr, fffff, aaaa]解释:(?:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Start non-capturing repeating group:&nbsp; &nbsp;[^a-z]+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Match one or more nonalphabetic characters&nbsp;|&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Or&nbsp; &nbsp;(?<![a-z])&nbsp; &nbsp; &nbsp; &nbsp; Not preceded by an alphabetic character&nbsp; &nbsp;[a-z]{1,2}&nbsp; &nbsp; &nbsp; &nbsp; Match 1-2 alphabetic characters&nbsp; &nbsp;(?![a-z])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Not followed by an alphabetic character)+&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Match one or more of the above注:该+后[^a-z]可以去掉,因为+在年底会做重复反正,但正则表达式应该用更好的表现+出现。原始代码和拆分代码之间的区别在于,如果输入以非字母字符开头,拆分将返回一个空字符串作为第一个结果。
随时随地看视频慕课网APP

相关分类

Java
我要回答