我想列出目录中的所有文本文件。然后我想在每个文件中创建单独的内容列表。例如 document1=[] 然后 document2=[] 等等。然后通过使用文档 1 和文档 2 关键字,我想计算词频和其他过程。代码正在运行,但不能为列表分配不同的名称,如 document1 等等。
import glob
import math
import re
a=0
flist=glob.glob(r'D:/Final Year Project/Development process/Text_data_extraction/MyFolder/*.txt') #get all the files from the d`#open each file >> tokenize the content >> and store it in a set
for fname in flist:
tfile=open(fname,"r")
line=tfile.read()
a+=1
line = line.lower() # lowercase
line = re.sub("</?.*?>"," <> ",line) #remove tags
line = re.sub("(\\d|\\W)+"," ",line) # remove special characters and digits
l_ist = line.split("\n")
print 'document'
print(l_ist)
tfile.close() # close the file
print"Number of documents:"
print(a)
慕尼黑的夜晚无繁华
相关分类