如何获取列表中附加的非字母和非数字字符?

这是关于简单的字数统计,收集文档中出现的单词以及出现的频率。

我尝试编写一个函数,输入是文本行列表。我遍历所有行,将它们拆分成单词,累积识别出的单词,最后返回完整列表。

首先,我有一个 while 循环遍历列表中的所有字符,但忽略空格。在这个 while 循环中,我也尝试识别我有什么样的词。在这种情况下,有三种词:

  • 以字母开头的;

  • 以数字开头的;

  • 以及那些只包含一个既不是字母也不是数字的字符的。

我有三个 if 语句来检查我有什么样的角色。当我知道我遇到了什么样的词时,我会尝试提取这个词本身。当单词以字母或数字开头时,我将所有连续的同类字符作为单词的一部分。

但是,在第三个 if 语句中,当我处理当前字符既不是字母也不是数字的情况时,我遇到了问题。

当我输入时

wordfreq.tokenize(['15,    delicious&   Tarts.'])

我希望输出是

['15', ',', 'delicious', '&', 'tarts', '.']

当我在 Python 控制台中测试函数时,它看起来像这样:

PyDev console: starting.

Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) 

[Clang 6.0 (clang-600.0.57)] on darwin

import wordfreq

wordfreq.tokenize(['15,    delicious&   Tarts.'])

['15', 'delicious', 'tarts']

该函数既不考虑逗号、符号也不考虑点!我该如何解决?请参阅下面的代码。


( lower() 方法是因为我想忽略大写,例如 'Tarts' 和 'tarts' 实际上是同一个词。)


# wordfreq.py


def tokenize(lines):

    words = []

    for line in lines:

        start = 0

        while start < len(line):

            while line[start].isspace():

                start = start + 1

            if line[start].isalpha():

                end = start

                while line[end].isalpha():

                    end = end + 1

                word = line[start:end]

                words.append(word.lower())

                start = end

            elif line[start].isdigit():

                end = start

                while line[end].isdigit():

                    end = end + 1

                word = line[start:end]

                words.append(word)

                start = end

            else:

                words.append(line[start])

            start = start + 1

    return words


慕容3067478
浏览 135回答 3
3回答

qq_遁去的一_1

我发现了问题所在。线start = start + 1应该在最后一个 else 语句中的位置。所以我的代码看起来像这样,并为我提供了上面指定的所需输入:def tokenize(lines):&nbsp; &nbsp; words = []&nbsp; &nbsp; for line in lines:&nbsp; &nbsp; &nbsp; &nbsp; start = 0&nbsp; &nbsp; &nbsp; &nbsp; while start < len(line):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while line[start].isspace():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; start = start + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; end = start&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if line[start].isalpha():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while line[end].isalpha():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; end = end + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; word = line[start:end]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; word = word.lower()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; words.append(word)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; start = end&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elif line[start].isdigit():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while line[end].isdigit():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; end = end + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; word = line[start:end]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; words.append(word)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; start = end&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; word = line[start]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; words.append(word)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; start = start + 1&nbsp; &nbsp; return words但是,当我使用下面的测试脚本来确保没有遗漏函数“tokenize”的极端情况时;...import ioimport sysimport importlib.utildef test(fun,x,y):&nbsp; &nbsp; global pass_tests, fail_tests&nbsp; &nbsp; if type(x) == tuple:&nbsp; &nbsp; &nbsp; &nbsp; z = fun(*x)&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; z = fun(x)&nbsp; &nbsp; if y == z:&nbsp; &nbsp; &nbsp; &nbsp; pass_tests = pass_tests + 1&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; if type(x) == tuple:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s = repr(x)&nbsp; &nbsp; &nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; s = "("+repr(x)+")"&nbsp; &nbsp; &nbsp; &nbsp; print("Condition failed:")&nbsp; &nbsp; &nbsp; &nbsp; print("&nbsp; &nbsp;"+fun.__name__+s+" == "+repr(y))&nbsp; &nbsp; &nbsp; &nbsp; print(fun.__name__+" returned/printed:")&nbsp; &nbsp; &nbsp; &nbsp; print(str(z))&nbsp; &nbsp; &nbsp; &nbsp; fail_tests = fail_tests + 1def run(src_path=None):&nbsp; &nbsp; global pass_tests, fail_tests&nbsp; &nbsp; if src_path == None:&nbsp; &nbsp; &nbsp; &nbsp; import wordfreq&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")&nbsp; &nbsp; &nbsp; &nbsp; wordfreq = importlib.util.module_from_spec(spec)&nbsp; &nbsp; &nbsp; &nbsp; spec.loader.exec_module(wordfreq)&nbsp; &nbsp; pass_tests = 0&nbsp; &nbsp; fail_tests = 0&nbsp; &nbsp; fun_count&nbsp; = 0&nbsp; &nbsp; def printTopMost(freq,n):&nbsp; &nbsp; &nbsp; &nbsp; saved = sys.stdout&nbsp; &nbsp; &nbsp; &nbsp; sys.stdout = io.StringIO()&nbsp; &nbsp; &nbsp; &nbsp; wordfreq.printTopMost(freq,n)&nbsp; &nbsp; &nbsp; &nbsp; out = sys.stdout.getvalue()&nbsp; &nbsp; &nbsp; &nbsp; sys.stdout = saved&nbsp; &nbsp; &nbsp; &nbsp; return out&nbsp; &nbsp; if hasattr(wordfreq, "tokenize"):&nbsp; &nbsp; &nbsp; &nbsp; fun_count = fun_count + 1&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, [], [])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, [""], [])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["&nbsp; &nbsp;"], [])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; print("tokenize is not implemented yet!")&nbsp; &nbsp; if hasattr(wordfreq, "countWords"):&nbsp; &nbsp; &nbsp; &nbsp; fun_count = fun_count + 1&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.countWords, ([],[]), {})&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})&nbsp; &nbsp; &nbsp; &nbsp; test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; print("countWords is not implemented yet!")&nbsp; &nbsp; if hasattr(wordfreq, "printTopMost"):&nbsp; &nbsp; &nbsp; &nbsp; fun_count = fun_count + 1&nbsp; &nbsp; &nbsp; &nbsp; test(printTopMost,({},10),"")&nbsp; &nbsp; &nbsp; &nbsp; test(printTopMost,({"horror": 5, "happiness": 15},0),"")&nbsp; &nbsp; &nbsp; &nbsp; test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 5\nC&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3\nhaskell&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;2\n")&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; print("printTopMost is not implemented yet!")&nbsp; &nbsp; print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")&nbsp; &nbsp; return (fun_count == 3 and fail_tests == 0)if __name__ == "__main__":&nbsp; &nbsp; run()...我得到以下输出:/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"Traceback (most recent call last):&nbsp; File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>&nbsp; &nbsp; run()&nbsp; File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run&nbsp; &nbsp; test(wordfreq.tokenize, ["&nbsp; &nbsp;"], [])&nbsp; File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test&nbsp; &nbsp; z = fun(x)&nbsp; File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize&nbsp; &nbsp; while line[start].isspace():IndexError: string index out of range为什么说字符串索引超出范围?我该如何解决这个问题?

回首忆惘然

我不确定你为什么要上下做,但这是你如何拆分它的方法:input = ['15,&nbsp; &nbsp; delicious&&nbsp; &nbsp;Tarts.']line = input[0]words = line.split(' ')words = [word for word in words if word]out:['15,', 'delicious&', 'Tarts.']编辑,看到你编辑了你想要的输出方式。只需跳过这一行即可获得该输出:&nbsp; &nbsp; words = [word for word in words if word]

素胚勾勒不出你

itertools.groupby可以大大简化这一点。基本上,您根据字符的类别或类型(字母、数字或标点符号)对字符串中的字符进行分组。在此示例中,我只定义了这三个类别,但您可以根据需要定义任意数量的类别。任何不匹配任何类别的字符(本例中为空格)将被忽略:def get_tokens(string):&nbsp; &nbsp; from itertools import groupby&nbsp; &nbsp; from string import ascii_lowercase, ascii_uppercase, digits, punctuation as punct&nbsp; &nbsp; alpha = ascii_lowercase + ascii_uppercase&nbsp; &nbsp; yield from ("".join(group) for key, group in groupby(string, key=lambda char: next((category for category in (alpha, digits, punct) if char in category), "")) if key)print(list(get_tokens("15,&nbsp; &nbsp; delicious&&nbsp; &nbsp;Tarts.")))输出:['15', ',', 'delicious', '&', 'Tarts', '.']>>>&nbsp;
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python