在PythonUnicode字符串中删除重音的最佳方法是什么？

3回答

Helenr

统一码这是正确的答案。它将任何Unicode字符串音译为最接近的ascii文本表示形式。例子：accented_string = u'Málaga'# accented_string is of type 'unicode'import unidecode unaccented_string = unidecode.unidecode(accented_string)# unaccented_string contains 'Malaga'and is of type 'str'

0 0

米琪卡哇伊

这个怎么样：import unicodedatadef strip_accents(s):    return ''.join(c for c in unicodedata.normalize('NFD', s)                   if unicodedata.category(c) != 'Mn')这也适用于希腊字母：>>> strip_accents(u"A \u00c0 \u0394 \u038E")u'A A \u0394 \u03a5'>>>这个字符范畴“Mn”代表Nonspacing_Mark，这类似于MiniQuark的答案中的合并(我没有想到独角兽数据，但它可能是更好的解决方案，因为它更明确)。请记住，这些操作可能会显着地改变文本的意义。口音、乌姆劳斯等不是“装饰”。

0 0

慕仙森

我刚在网上找到了这个答案：import unicodedatadef remove_accents(input_str):     nfkd_form = unicodedata.normalize('NFKD', input_str)     only_ascii = nfkd_form.encode('ASCII', 'ignore')     return only_ascii它运行得很好(例如，法语)，但我认为第二步(删除重音)可以比删除非ASCII字符更好，因为对于某些语言(例如希腊语)来说，这将失败。最好的解决方案可能是显式删除被标记为Diacritics的Unicode字符。编辑：这起作用是：import unicodedatadef remove_accents(input_str):     nfkd_form = unicodedata.normalize('NFKD', input_str)     return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])unicodedata.combining(c)如果字符为true，则返回true。c可以与前面的字符组合，这主要是如果它是一个对话框。编辑2: remove_accents期望Unicode字符串，而不是字节字符串。如果有字节字符串，则必须将其解码为如下所示的Unicode字符串：encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you usebyte_string = b"café"   # or simply "café" before python 3.unicode_string = byte_string.decode(encoding)

0 0