猿问

使用 Python 从语义上检测文本块

我有这个示例日志文本块:


20190122 09:00,000 ###PERFORMANCE string1 string2 string3

20190122 09:10,500 number1 string1 string2 string3

20190122 09:24,670 number2 string1 string2 string3

20190122 10:05,000 number3 string1 string2 string3

20190122 10:33,960 number4 string1 string2 string3

20190122 11:00,321 number5 string1 string2 string3

20190122 11:40,256 ###PERFORMANCE string1 string2 string3

20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2

20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2

20190123 08:00,000 ###PERFORMANCE string1 string2 string3

20190123 08:10,500 number1 string1 string2 string3

20190123 08:24,670 number2 string1 string2 string3

20190123 09:05,000 number3 string1 string2 string3

20190123 10:33,960 number4 string1 string2 string3

20190123 10:00,321 number5 string1 string2 string3

20190123 13:40,256 ###PERFORMANCE string1 string2 string3

20190124 10:00,000 ###PERFORMANCE string1 string2 string3

20190124 10:10,500 number1 string1 string2 string3

20190124 10:24,670 number2 string1 string2 string3

20190124 11:05,000 number3 string1 string2 string3

20190124 12:33,960 number4 string1 string2 string3

20190124 13:00,321 number5 string1 string2 string3

20190124 13:40,256 ###PERFORMANCE string1 string2 string3

我想用 Python 做的是检测每个###PERFORMANCE文本块,如本例所示:

如您所见,有 3 个感兴趣的块,每个块都由###PERFORMANCE字符串中的文本分隔。第一个从第 1 行开始到第 7 行结束。第 7 行和第 10 行之间的内容不能被视为感兴趣的块。每个块的字符串行也可能不同(所以按行号不是一个好主意)。


到目前为止,我所做的只是逐行读取文本文件:


logFile = "testLog.txt"


with open(logFile) as f:

    content = f.readlines()

# you may also want to remove whitespace characters like `\n` at the end of each line

content = [x.strip() for x in content]


for line in content:

    print(line)

我可以通过哪种方式来完成这项任务?使用 NLTK 是个好主意吗?它甚至适用于这项任务吗?任何一般建议?


桃花长相依
浏览 186回答 2
2回答

一只萌萌小番薯

我认为您可以通过简单的检查来完成所需的工作。让我解释一下我是否正确理解。你可以有一个标志(真/假值)来检测你是否在有趣的块中。每当您找到“###PERFORMANCE”时,您都可以更改此标志。然后您可以将这两个块保存在两个列表或您喜欢的任何结构中。下面是代码片段logFile = "logfile.txt"with open(logFile) as f:    content = f.readlines()# you may also want to remove whitespace characters like `\n` at the end of each linecontent = [x.strip() for x in content]# flagare_we_in_the_interesting_block = False;# two lists to save the liensinteresting_block = [];non_interesting_block = [];for line in content:    # check if there is the text ###PERFORMANCE    is_there_performance = line.find('###PERFORMANCE');    # if it's not there, it returns -1    if is_there_performance > 0:        are_we_in_the_interesting_block = not are_we_in_the_interesting_block;    else:            if are_we_in_the_interesting_block:            # here I append to a list, but you can do your processing            interesting_block.append(line);        else:            # here processing of the non interesting parts            non_interesting_block.append(line);print('Interesting blocks')print(interesting_block)print('\n')print('Non interesting blocks')print(non_interesting_block)产生的输出将是Interesting blocks['20190122 09:10,500 number1 string1 string2 string3', '20190122 09:24,670 number2 string1 string2 string3', '20190122 10:05,000 number3 string1 string2 string3', '20190122 10:33,960 number4 string1 string2 string3', '20190122 11:00,321 number5 string1 string2 string3', '20190123 08:10,500 number1 string1 string2 string3', '20190123 08:24,670 number2 string1 string2 string3', '20190123 09:05,000 number3 string1 string2 string3', '20190123 10:33,960 number4 string1 string2 string3', '20190123 10:00,321 number5 string1 string2 string3', '20190124 10:10,500 number1 string1 string2 string3', '20190124 10:24,670 number2 string1 string2 string3', '20190124 11:05,000 number3 string1 string2 string3', '20190124 12:33,960 number4 string1 string2 string3', '20190124 13:00,321 number5 string1 string2 string3']Non interesting blocks['20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2', '20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2']然后,interesting_block[n]如果需要,您可以访问以获取第 n 行。

慕后森

由于您只是在 PERFORMANCE 分隔符上进行匹配,因此使用 NLTK 似乎有点过分。一个简单的方法是使用一个简单的匹配(是行中的预期字符串),然后根据它切换您的捕获模式。例如:in_block = FalseIDENTIFIER = 'PERFORMANCE'with open(logfile) as f:    for line in f.readlines():        if IDENTIFIER in line:            # Toggle the boolean            in_block = not in_block        if in_block:            print(line)
随时随地看视频慕课网APP

相关分类

Python
我要回答