如何使用正则表达式按单词分隔文本?

OpenFileDialog openFileDialog = new OpenFileDialog();

            if (openFileDialog.ShowDialog() == true)

            {

                //your code

            }我有 .srt 文件,它有一些文本结构。例子:


1

00:00:01,514 --> 00:00:04,185

I'm investigating

Saturday night's shootings.


2

00:00:04,219 --> 00:00:05,754

What's to investigate?

Innocent people

我希望得到像“我是”、“正在调查”、“星期六”这样的分裂词。


我创造了模式


@"[a-zA-Z']"

这将我的文字分开几乎是正确的。但是 .srt 文件也包含一些无用的标签结构,就像这样


<i>

我想删除。


如何构建我的模式,将文本按单词分隔并删除“<”和“>”之间的所有文本(包括大括号)?


Cats萌萌
浏览 125回答 2
2回答

HUX布斯

好吧,很难以一种方式在正则表达式中做到这一点(至少对我来说是这样),但你可以分两步做到这一点。首先,您从字符串中删除 html 字符,然后提取之后的单词。看看下面。var text = "00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings.<i>"// remove all html charvar noHtml = Regex.Replace(text, @"(<[^>]*>).*", "");// and now you could get only the words by using @"[a-zA-Z']" on noHtml. You should get "I'm investigating Saturday night's shootings."

LEATH

您可以否定环顾四周以断言不存在由以下not <s 结束的序列,并且不存在后跟 not s 序列的 a 序列。><>using System;using System.Text.RegularExpressions;public class Program{&nbsp; &nbsp; public static void Main()&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; string input = @"<garbage>Hello world, <rubbish>it's a wonderful day.<trash>";&nbsp; &nbsp; &nbsp; &nbsp; foreach (Match match in Regex.Matches(input, @"(?<!<[^>]*)[a-zA-Z']+(?![^<]*>)"))&nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Console.WriteLine(match.Value);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}输出:Helloworldit'sawonderfulday.NET 小提琴
打开App,查看更多内容
随时随地看视频慕课网APP