如何使用python将已编号列表的段落标记为多个句子?

我打算将段落分成多个句子。本段包含编号的句子,如下所示:


Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John. 


Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care.

我用下面的代码来分解句子:


import nltk

tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

fp = open("/Users/Desktop/sample.txt", encoding='utf-8')

data = fp.read()

with open("/Users/Desktop/output.txt", 'a', encoding='utf-8' ) as f:

            f.write(''.join(tokenizer.tokenize(data)))

            f.close()

此代码基于句号拆分段落。但是编号的句子正在产生问题。由于这些数字后面有句号,所以它以不正确的方式分裂。


有人可以建议我吗?


繁华开满天机
浏览 242回答 2
2回答

梦里花落0921

你需要sent_tokenize:from nltk.tokenize import sent_tokenizetext = "Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John."print(sent_tokenize(text))输出['Hello, How are you?', 'Hope everything is good.', "I'm fine.", '1.Hello World.', '2.Good Morning John.']
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python