问题定义
将每一行分成句子。假设以下字符分隔句子:句点 ('.')、问号 ('?') 和感叹号 ('!')。这些定界符也应该从返回的句子中省略。删除每个句子中的任何前导或尾随空格。如果在上述之后,一个句子是空白的(空字符串,''),则应该省略该句子。返回句子列表。句子的顺序必须与它们在文件中出现的顺序相同。
这是我当前的代码
import re
def get_sentences(doc):
assert isinstance(doc, list)
result = []
for line in doc:
result.extend(
[sentence.strip() for sentence in re.split(r'\.|\?|\!', line) if sentence]
)
return result
# Demo:
get_sentences(demo_input)
输入
demo_input = [" This is a phrase; this, too, is a phrase. But this is another sentence.",
"Hark!",
" ",
"Come what may <-- save those spaces, but not these --> ",
"What did you say?Split into 3 (even without a space)? Okie dokie."]
期望的输出
["This is a phrase; this, too, is a phrase",
"But this is another sentence",
"Hark",
"Come what may <-- save those spaces, but not these -->",
"What did you say",
"Split into 3 (even without a space)",
"Okie dokie"]
但是,我的代码产生了这个:
['This is a phrase; this, too, is a phrase',
'But this is another sentence',
'Hark',
'',
'Come what may <-- save those spaces, but not these -->',
'What did you say',
'Split into 3 (even without a space)',
'Okie dokie']
问题:为什么''即使我的代码忽略了它,我也会在其中得到那个空句子?
我可以使用以下代码解决问题,但我将不得不再次浏览列表,我不想这样做。我想在同一个过程中做到这一点。
import re
def get_sentences(doc):
assert isinstance(doc, list)
result = []
for line in doc:
result.extend([sentence.strip() for sentence in re.split(r'\.|\?|\!', line)])
result = [s for s in result if s]
return result
# Demo:
get_sentences(demo_input)
HUX布斯
相关分类