我有这个示例日志文本块:
20190122 09:00,000 ###PERFORMANCE string1 string2 string3
20190122 09:10,500 number1 string1 string2 string3
20190122 09:24,670 number2 string1 string2 string3
20190122 10:05,000 number3 string1 string2 string3
20190122 10:33,960 number4 string1 string2 string3
20190122 11:00,321 number5 string1 string2 string3
20190122 11:40,256 ###PERFORMANCE string1 string2 string3
20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2
20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2
20190123 08:00,000 ###PERFORMANCE string1 string2 string3
20190123 08:10,500 number1 string1 string2 string3
20190123 08:24,670 number2 string1 string2 string3
20190123 09:05,000 number3 string1 string2 string3
20190123 10:33,960 number4 string1 string2 string3
20190123 10:00,321 number5 string1 string2 string3
20190123 13:40,256 ###PERFORMANCE string1 string2 string3
20190124 10:00,000 ###PERFORMANCE string1 string2 string3
20190124 10:10,500 number1 string1 string2 string3
20190124 10:24,670 number2 string1 string2 string3
20190124 11:05,000 number3 string1 string2 string3
20190124 12:33,960 number4 string1 string2 string3
20190124 13:00,321 number5 string1 string2 string3
20190124 13:40,256 ###PERFORMANCE string1 string2 string3
我想用 Python 做的是检测每个###PERFORMANCE文本块,如本例所示:
如您所见,有 3 个感兴趣的块,每个块都由###PERFORMANCE字符串中的文本分隔。第一个从第 1 行开始到第 7 行结束。第 7 行和第 10 行之间的内容不能被视为感兴趣的块。每个块的字符串行也可能不同(所以按行号不是一个好主意)。
到目前为止,我所做的只是逐行读取文本文件:
logFile = "testLog.txt"
with open(logFile) as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
for line in content:
print(line)
我可以通过哪种方式来完成这项任务?使用 NLTK 是个好主意吗?它甚至适用于这项任务吗?任何一般建议?
一只萌萌小番薯
慕后森
相关分类