我需要比较两个输出字符串,即原始转录和 Speech-to-Text 服务的转录。数字通常以数字格式或作为单词书写,例如“四”或“4”。考虑到这些不同的转录方法,如何比较字符串?
到目前为止,我只是将两个字符串都转换为小写字母,并用空格分隔每个单词。
#Read the two files and store them in s1_raw and s2_raw
with open('original.txt', 'r') as f:
s1_raw = f.read()
with open('comparison.txt', 'r') as f:
s2_raw = f.read()
#Transform all letters to minuscule letter
s1 = s1_raw.lower()
s2 = s2_raw.lower()
#Split texts with space as seperator to have a list of words
s1_set = s1.split(' ')
s2_set = s2.split(' ')
#Used later for confidence calculation
count1 = len(s1_set)
count2 = 0
x = 0
#Check which string is longer to prevent running out of indices
if len(s1_set) < len(s2_set):
#Loop through whole list and compare word by word
for x in range (0, len(s1_set)):
if s1_set[x] == s2_set[x]:
count2 += 1
x += 1
else:
#Loop through whole list and compare word by word
for x in range (0, len(s2_set)):
if s1_set[x] == s2_set[x]:
count2 += 1
x += 1
#Confidence level= correct words divided by total words
confidence = count2/count1
#Print out result
print('The confidence level of this service is {:.2f}%'.format(confidence*100))
我想测量几个 *.txt 文件的转录准确性,并考虑不同 Speech-to-Text 服务转录的所有不同方式。
侃侃无极
HUH函数
桃花长相依
随时随地看视频慕课网APP
相关分类