猿问

Python 中的句子拆分不超过字符数

我有一个包含句子的字符串。如果该字符串包含的字符多于给定的数字。我想将此字符串拆分为几个字符串,其字符数少于最大字符数,但仍包含完整的句子。


我做了下面的操作,似乎运行良好,但不确定将其投入生产时是否会遇到错误。下面的看起来还好吗?


from nltk.tokenize import sent_tokenize


sentences = sent_tokenize(my_text)

sentences_split = []

shortened_sentence = ""


for idx, sentence in enumerate(sentences):

    if len(shortened_sentence) + len(sentence) < 5120:

        shortened_sentence += sentence

        

    if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)):

        sentences_split.append(shortened_sentence)

        shortened_sentence = ""

        

print(sentences_split)


湖上湖
浏览 104回答 1
1回答

哔哔one

为了更好地解释我对第二个 if 块问题的观点(以注释形式表达),请参阅以下示例。我们想要 max len=15 的字符串,即本例中的 1520 是 16。正如您所看到的,列表中的前 3 项是 5 + 6 + 4 = 15,因此,fisrt 应由列表中的前 3 项组成shortened_sentence。但事实并非如此。因为第二个if的逻辑不正确。sentences = ['abcde', 'fghijk', 'lmno', 'pqr']# we need sentences with less than 16 charsprint([len(sentence) for sentence in sentences])sentences_split = []shortened_sentence = ""for idx, sentence in enumerate(sentences):&nbsp; &nbsp; if len(shortened_sentence) + len(sentence) < 16:&nbsp; &nbsp; &nbsp; &nbsp; shortened_sentence += sentence&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp; &nbsp; if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):&nbsp; &nbsp; &nbsp; &nbsp; sentences_split.append(shortened_sentence)&nbsp; &nbsp; &nbsp; &nbsp; shortened_sentence = ""&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(sentences_split)print([len(sentence) for sentence in sentences_split])输出[5, 6, 4, 3]['abcdefghijk', 'lmnopqr'][11, 7]将其与sentences = ['abcde', 'fghijk', 'lmno', 'pqr']# we need sentences with less than 16 charsprint([len(word) for word in sentences])sentences_split = []shortened_sentence = ""for sentence in sentences:&nbsp; &nbsp; if len(shortened_sentence) + len(sentence) < 16:&nbsp; &nbsp; &nbsp; &nbsp; shortened_sentence += sentence&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; sentences_split.append(shortened_sentence)&nbsp; &nbsp; &nbsp; &nbsp; shortened_sentence = sentencesentences_split.append(shortened_sentence)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(sentences_split)print([len(sentence) for sentence in sentences_split])输出[5, 6, 4, 3]['abcdefghijklmno', 'pqr'][15, 3]最后,如果您不确定“将其投入生产时是否会遇到错误” - 编写测试,大量测试。这就是测试的目的 - 帮助最大限度地减少生产中的错误。另请注意,第二个片段只是一个示例实现,还有其他可能的实现。
随时随地看视频慕课网APP

相关分类

Python
我要回答