如何将文本文件中的数据提取到定义为空白行之间的数据行的句子中?

数据位于文本文件中,我想将其中的数据分组为句子。句子的定义是所有行依次排列,每行至少有 1 个字符。包含数据的行之间有空白行,因此我希望空白行标记句子的开头和结尾。有没有办法通过列表理解来做到这一点?


文本文件中的示例。数据看起来像这样:


This is the

first sentence.


This is a really long sentence

and it just keeps going across many

rows there will not necessarily be 

punctuation

or consistency in word length

the only difference in ending sentence

is the next row will be blank


here would be the third sentence

as 

you see

the blanks between rows of data 

help define what a sentence is


this would be sentence 4

i want to pull data

from text file

as such (in sentences) 

where sentences are defined with

blank records in between


this would be sentence 5 since blank row above it

and continues but ends because blank row(s) below it


子衿沉夜
浏览 69回答 2
2回答

GCT1015

您可以使用 . 获取整个文件作为单个字符串file_as_string = file_object.read()。由于您想将此字符串拆分为空行,这相当于拆分两个后续换行符,因此我们可以这样做sentences = file_as_string.split("\n\n")。最后,您可能想要删除句子中间仍然存在的换行符。您可以通过列表理解来做到这一点,将换行符替换为空:sentences = [s.replace('\n', '') for s in sentences]总共给出:file_as_string = file_object.read()sentences = file_as_string.split("\n\n")sentences = [s.replace('\n', '') for s in sentences]

蝴蝶不菲

为此,您可以非常有效地使用正则表达式拆分。如果您只想用双空格分隔,请使用:^[ \t]*$演示在Python中,你可以这样做:import re   with open(fn) as f_in:    sentencences=re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)如果要删除\n文本中的单个内容:with open(fn) as f_in:    sentencences=[re.sub(r'[ \t]*(?:\r?\n){1,}', ' ', s)          for s in re.split(r'\r?\n^[ \t]*$', f_in.read(), flags=re.M)]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python