使用 string.punctuation 删除字符串的标点符号时出错

首页课程实战体系课手记专栏慕课教程

使用 string.punctuation 删除字符串的标点符号时出错

快速提问：

在将其输入某些自然语言处理算法之前，我正在使用stringand去除所有标点符号和停用词的文本块作为数据预处理的一部分。nltk.stopwords

我已经在几个原始文本块上分别测试了每个组件，因为我仍然习惯了这个过程，而且看起来还不错。

def text_process(text):

"""

Takes in string of text, and does following operations:

1. Removes punctuation.

2. Removes stopwords.

3. Returns a list of cleaned "tokenized" text.

"""

nopunc = [char for char in text.lower() if char not in string.punctuation]

nopunc = ''.join(nopunc)

return [word for word in nopunc.split() if word not in

stopwords.words('english')]

然而，当我将此函数应用于我的数据框的文本列时——它是来自一堆 Pitchfork 评论的文本——我可以看到标点符号实际上并没有被删除，尽管停用词被删除了。

未处理：

pitchfork['content'].head(5)

0 “Trip-hop” eventually became a ’90s punchline,...

1 Eight years, five albums, and two EPs in, the ...

2 Minneapolis’ Uranium Club seem to revel in bei...

3 Minneapolis’ Uranium Club seem to revel in bei...

4 Kleenex began with a crash. It transpired one ...

Name: content, dtype: object

处理：

pitchfork['content'].head(5).apply(text_process)

0 [“triphop”, eventually, became, ’90s, punchlin...

1 [eight, years, five, albums, two, eps, new, yo...

2 [minneapolis’, uranium, club, seem, revel, agg...

3 [minneapolis’, uranium, club, seem, revel, agg...

4 [kleenex, began, crash, it, transpired, one, n...

Name: content, dtype: object

关于这里出了什么问题的任何想法？我查看了文档，但我还没有看到任何人以完全相同的方式解决这个问题，所以我很想了解如何解决这个问题。非常感谢！

摇曳的蔷薇

浏览 247回答 1

1回答

侃侃无极

这里的问题是 utf-8 对左右引号（单引号和双引号）有不同的编码，而不仅仅是string.punctuation.我会做类似的事情punctuation = [ c for c in string.punctuation ] + [u'\u201c',u'\u201d',u'\u2018',u'\u2019']nopunc = [ char for char in text.decode('utf-8').lower() if char not in punctuation ]这会将非 ascii 引号的 utf-8 值添加到名为的列表中punctuation，然后将文本解码为utf-8，并替换这些值。注意：这是python2，如果您使用的是python3，则utf值的格式可能会略有不同

0 0

随时随地看视频慕课网APP