我的删除@user 和标点符号的代码不起作用

首页课程实战体系课手记专栏慕课教程

我的删除@user 和标点符号的代码不起作用

我为推文数据集编写了下面的代码，我想进行预处理，我删除了#，网站但是我的删除@user 和标点符号的代码不起作用，我是 python 的新手，有人可以帮助我吗？

from nltk.corpus import stopwords

import spacy, re

nlp = spacy.load('en')

stop_words = [w.lower() for w in stopwords.words()]

def sanitize(input_string):

""" Sanitize one string """

# normalize to lowercase

string = input_string.lower()

# spacy tokenizer

string_split = [token.text for token in nlp(string)]

# in case the string is empty

if not string_split:

return ''

names = re.compile('@[A-Za-z0-9_][A-Za-z0-9_]+')

string = [re.sub(names, '@USER', tweet) for tweet in input_string()]

#remove # and @

for punc in '":!#':

string = string.replace(punc, '')

# remove 't.co/' links

string = re.sub(r'http//t.co\/[^\s]+', '', string, flags=re.MULTILINE)

# removing stop words

string = ' '.join([w for w in string.split() if w not in stop_words])

#punctuation

# string = [''.join(w for w in string.split() if w not in #string.punctuation) for w in string]

return string

list = ['@Jeff_Atwood Thank you for #stackoverflow', 'All hail @Joel_Spolsky t.co/Gsb7V1oVLU #stackoverflow' ]

list_sanitized = [sanitize(string) for string in tweets[:300]]

list_sanitized[:50]

慕标5832272

浏览 102回答 2

2回答

千万里不及你

正则表达式需要修复。尝试类似的东西：names = re.compile('@[A-Za-z0-9_]+') string = re.sub(names, '@USER', input_string)input_string是一个变量而不是一个函数，它也是一个单数字符串，所以你不想遍历它。这将在这里显示得很好：https ://regexr.com/55u44您的标点符号删除工作正常，请参阅：https ://ideone.com/zScVPJ

0 0

Helenr

试试这个：string = [names.sub('@USER', tweet) for tweet in input_string()]

0 0

随时随地看视频慕课网APP