如何使用正则表达式按单词分隔文本？

首页课程实战体系课手记专栏慕课教程

如何使用正则表达式按单词分隔文本？

OpenFileDialog openFileDialog = new OpenFileDialog();

if (openFileDialog.ShowDialog() == true)

{

//your code

}我有 .srt 文件，它有一些文本结构。例子：

00:00:01,514 --> 00:00:04,185

I'm investigating

Saturday night's shootings.

00:00:04,219 --> 00:00:05,754

What's to investigate?

Innocent people

我希望得到像“我是”、“正在调查”、“星期六”这样的分裂词。

我创造了模式

@"[a-zA-Z']"

这将我的文字分开几乎是正确的。但是 .srt 文件也包含一些无用的标签结构，就像这样

<i>

我想删除。

如何构建我的模式，将文本按单词分隔并删除“<”和“>”之间的所有文本（包括大括号）？

Cats萌萌

浏览 134回答 2

2回答

HUX布斯

好吧，很难以一种方式在正则表达式中做到这一点（至少对我来说是这样），但你可以分两步做到这一点。首先，您从字符串中删除 html 字符，然后提取之后的单词。看看下面。var text = "00:00:01,514 --> 00:00:04,185 I'm investigating Saturday night's shootings.<i>"// remove all html charvar noHtml = Regex.Replace(text, @"(<[^>]*>).*", "");// and now you could get only the words by using @"[a-zA-Z']" on noHtml. You should get "I'm investigating Saturday night's shootings."

0 0

LEATH

您可以否定环顾四周以断言不存在由以下not <s 结束的序列，并且不存在后跟 not s 序列的 a 序列。><>using System;using System.Text.RegularExpressions;public class Program{    public static void Main()    {        string input = @"<garbage>Hello world, <rubbish>it's a wonderful day.<trash>";        foreach (Match match in Regex.Matches(input, @"(?<!<[^>]*)[a-zA-Z']+(?![^<]*>)"))        {            Console.WriteLine(match.Value);        }    }}输出：Helloworldit'sawonderfulday.NET 小提琴

0 0

随时随地看视频慕课网APP