我有一个包含句子的字符串。如果该字符串包含的字符多于给定的数字。我想将此字符串拆分为几个字符串,其字符数少于最大字符数,但仍包含完整的句子。
我做了下面的操作,似乎运行良好,但不确定将其投入生产时是否会遇到错误。下面的看起来还好吗?
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(my_text)
sentences_split = []
shortened_sentence = ""
for idx, sentence in enumerate(sentences):
if len(shortened_sentence) + len(sentence) < 5120:
shortened_sentence += sentence
if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):
sentences_split.append(shortened_sentence)
shortened_sentence = ""
print(sentences_split)
哔哔one
相关分类