大输入时Node.JS Regex引擎失败

首页课程实战体系课手记专栏慕课教程

大输入时Node.JS Regex引擎失败

这个问题有点复杂，并且使用谷歌搜索并没有真正的帮助。我将尝试仅介绍其相关方面。

我有一个大致如下格式的大文档：

样本输入：

ABC is a word from one line of this document. It is followed by

some random line

PQR which happens to be another word.

This is just another line

I have to fix my regular expression.

Here GHI appears in the middle.

This may be yet another line.

VWX is a line

this is the last line

我试图根据以下内容删除文本部分：

来自以下任何一个：

美国广播公司

防御

GHI

到（保留这个词）中的任何一个：

PQR

STU

大众汽车

组成“从”的单词可以出现在一行中的任何位置（请查看GHI）。但是要删除，则需要删除整个行。（需要删除包含GHI的整个行，如下面的示例输出所示）

样本输出：

PQR which happens to be another word.

This is just another line

I have to fix my regular expression.

VWX is a line

this is the last line

在我对非常大的输入文件（49KB）运行它之前，上面的示例实际上对我来说似乎很容易

我尝试过的是：

我当前使用的正则表达式是（不区分大小写和多行修饰符）：

问题

上面的regexp在小型文本文件上效果很好。但是在大文件上失败/破坏了引擎。我已经针对以下方面进行了尝试：

V8（Node.js）：挂起

犀牛：挂

Python：挂起

Java ：（ StackoverflowError堆栈跟踪发布在此问题的末尾）

IonMonkey（Firefox）：工作！

实际输入：

我的原始输入：http : //ideone.com/W4sZmB

我的正则表达式（为清晰起见，分成多行）：

^.*\\b(patient demographics|electronically signed|md|rn|mspt|crnp|rt)\\b

(.|\\s)*?

问题：

我的正则表达式正确吗？是否可以进一步优化以避免出现此问题？

万一是正确的，为什么其他引擎无限挂起？下面是堆栈跟踪的一部分：

堆栈跟踪：

Exception in thread "main" java.lang.StackOverflowError

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218)

at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)

at java.util.regex.Pattern$Branch.match(Pattern.java:4114)

at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)

at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

墨色风雨

浏览 141回答 3

3回答

随时随地看视频慕课网APP