如何将文本数据标记为单词和句子而不出现类型错误

我的最终目标是使用 NER 模型来识别自定义实体。在此之前，我将文本数据标记为单词和句子。我有一个文本文件（.txt）文件夹，我使用操作系统库打开并读入 Jupyter。读取文本文件后，每当我尝试标记文本文件时，都会收到类型错误。请告诉我我做错了什么？我的代码如下，谢谢。

import os

outfile = open('result.txt', 'w')

path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"

files = os.listdir(path)

for file in files:

outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')

outfile.close()

这段代码运行良好，每当我运行输出文件时，我都会在下面得到这个

outfile

<_io.TextIOWrapper name='result.txt' mode='w' encoding='cp1252'>

接下来，标记化。

from nltk.tokenize import sent_tokenize, word_tokenize

sent_tokens = sent_tokenize(outfile)

print(outfile)

word_tokens = word_tokenize(outfile)

print(outfile

但运行上面的代码后出现错误。检查下面是否有错误

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-22-62f66183895a> in <module>

1 from nltk.tokenize import sent_tokenize, word_tokenize

----> 2 sent_tokens = sent_tokenize(outfile)

3 print(outfile)

5 #word_tokens = word_tokenize(text)

~\AppData\Local\Continuum\anaconda3\envs\nlp_course\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)

93 """

94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))

---> 95 return tokenizer.tokenize(text)

97 # Standard word tokenizer.

TypeError: expected string or bytes-like object

烙印99

浏览 134回答 1

1回答

不负相思意

（移动评论来回答）您正在尝试处理文件对象而不是文件中的文本。创建文本文件后，重新打开它并在标记化之前读取整个文件。试试这个代码：import osoutfile = open('result.txt', 'w')path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"files = os.listdir(path)for file in files:    with open(path + "/" + file) as f:       outfile.write(f.read() + '\n')       #outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')outfile.close()  # done writingfrom nltk.tokenize import sent_tokenize, word_tokenize with open('result.txt') as outfile:  # open for read   alltext = outfile.read()  # read entire file   print(alltext)   sent_tokens = sent_tokenize(alltext)  # process file text. tokenize sentences      word_tokens = word_tokenize(alltext)  # process file text. tokenize words 

随时随地看视频慕课网APP