Python - 使用字典和元组查找单词和字母的唯一计数

我目前正在尝试创建一个脚本,该脚本允许我运行文件中包含的文本并计算单词数,不同单词,列出前10个最频繁的单词和计数,并将字符频率从最频繁到最不频繁进行排序。


以下是我到目前为止所拥有的:


import sys

import os

os.getcwd()

import string


path = ""

os.chdir(path)


#Prompt for user to input filename:

fname = input('Enter the filename: ')


try:

    fhand = open(fname)

except IOError:

    #Invalid filename error

    print('\n')

    print("Sorry, file can't be opened! Please check your spelling.")

    sys.exit()


#Initialize char counts and word counts dictionary

counts = {}

worddict = {}


#For character and word frequency count

for line in fhand:

        #Remove leading spaces

        line = line.strip()

        #Convert everything in the string to lowercase

        line = line.lower()

        #Take into account punctuation        

        line = line.translate(line.maketrans('', '', string.punctuation))

        #Take into account white spaces

        line = line.translate(line.maketrans('', '', string.whitespace))

        #Take into account digits

        line = line.translate(line.maketrans('', '', string.digits))


        #Splitting line into words

        words = line.split(" ")


        for word in words:

            #Is the word already in the word dictionary?

            if word in worddict:

                #Increase by 1

                worddict[word] += 1

            else:

                #Add word to dictionary with count of 1 if not there already

                worddict[word] = 1


        #Character count

        for word in line:

            #Increase count by 1 if letter

            if word in counts:

                counts[word] += 1

            else:

                counts[word] = 1


#Initialize dictionaries

lst = []

countlst = []

freqlst = []


#Count up the number of letters

for ltrs, c in counts.items():

    lst.append((c,ltrs))

    countlst.append(c)


#Sum up the count

totalcount = sum(countlst)


#Calculate the frequency in each dictionary

for ec in countlst:

    efreq = (ec/totalcount) * 100

    freqlst.append(efreq)


#Sort lists by count and percentage frequency

freqlst.sort(reverse=True)

lst.sort(reverse=True)


慕娘9325324
浏览 122回答 2
2回答

扬帆大鱼

line = line.translate(line.maketrans('', '', string.whitespace))您正在删除包含此代码的行中的所有空格。删除它,它应该按预期工作。

跃然一笑

您的代码会删除空格以按空格拆分 - 这没有意义。由于您希望从给定的文本中提取每个单词,我建议您将所有单词彼此相邻地对齐,并在两者之间使用一个空格 - 这意味着您不仅要删除新行,不必要的空格,特殊/不需要的字符和数字,还要删除控制字符。这应该可以解决问题:import sysimport osos.getcwd()import stringpath = "/your/path"os.chdir(path)# Prompt for user to input filename:fname = input("Enter the filename: ")try:    fhand = open(fname)except IOError:    # Invalid filename error    print("\n")    print("Sorry, file can't be opened! Please check your spelling.")    sys.exit()# Initialize char counts and word counts dictionarycounts = {}worddict = {}# create one liner with undesired characters removedtext = fhand.read().replace("\n", " ").replace("\r", "")text = text.lower()text = text.translate(text.maketrans("", "", string.digits))text = text.translate(text.maketrans("", "", string.punctuation))text = " ".join(text.split())words = text.split(" ")for word in words:    # Is the word already in the word dictionary?    if word in worddict:        # Increase by 1        worddict[word] += 1    else:        # Add word to dictionary with count of 1 if not there already        worddict[word] = 1# Character countfor word in text:    # Increase count by 1 if letter    if word in counts:        counts[word] += 1    else:        counts[word] = 1# Initialize dictionarieslst = []countlst = []freqlst = []# Count up the number of lettersfor ltrs, c in counts.items():    # skip spaces    if ltrs == " ":        continue    lst.append((c, ltrs))    countlst.append(c)# Sum up the counttotalcount = sum(countlst)# Calculate the frequency in each dictionaryfor ec in countlst:    efreq = (ec / totalcount) * 100    freqlst.append(efreq)# Sort lists by count and percentage frequencyfreqlst.sort(reverse=True)lst.sort(reverse=True)# Print out word counts sortedfor key in sorted(worddict.keys(), key=worddict.get, reverse=True)[:10]:    print(key, ":", worddict[key])# Print out all letters and counts:for ltrs, c, in lst:    print(c, "-", ltrs, "-", round(ltrs / totalcount * 100, 2), "%")
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python