猿问

TfidfVectorizer 和 SelectKBest 错误

我正在尝试按照本教程进行一些情感分析,并且我很确定到目前为止我的代码完全相同。然而,我的 BOW 值出现了重大差异。

https://www.tensorscience.com/nlp/sentiment-analysis-tutorial-in-python-classifying-reviews-on-movies-and-products

到目前为止,这是我的代码。

import nltk

import pandas as pd

import string

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_selection import SelectKBest, chi2



def openFile(path):

    #param path: path/to/file.ext (str)

    #Returns contents of file (str)

    with open(path) as file:

        data = file.read()

    return data


imdb_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/imdb_labelled.txt')

amzn_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/amazon_cells_labelled.txt')

yelp_data = openFile('C:/Users/Flengo/Desktop/sentiment/data/yelp_labelled.txt')



datasets = [imdb_data, amzn_data, yelp_data]


combined_dataset = []

# separate samples from each other

for dataset in datasets:

    combined_dataset.extend(dataset.split('\n'))


# separate each label from each sample

dataset = [sample.split('\t') for sample in combined_dataset]



df = pd.DataFrame(data=dataset, columns=['Reviews', 'Labels'])

df = df[df["Labels"].notnull()]

df = df.sample(frac=1)



labels = df['Labels']

vectorizer = TfidfVectorizer(min_df=15)

bow = vectorizer.fit_transform(df['Reviews'])

len(vectorizer.get_feature_names())


selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)

vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)

bow = vectorizer.fit_transform(df['Reviews'])


bow

这是我的结果。

这是教程的结果。

https://img1.mukewang.com/64f71de60001489e06490156.jpg

我一直在试图找出可能出现的问题,但还没有任何进展。



www说
浏览 92回答 1
1回答

LEATH

问题是您正在提供索引,请尝试提供真正的词汇。尝试这个:selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)vocabulary = np.array(vectorizer.get_feature_names())[selected_features]vectorizer = TfidfVectorizer(min_df=15, vocabulary=vocabulary) # you need to supply a real vocab herebow = vectorizer.fit_transform(df['Reviews'])bow<3000x200 sparse matrix of type '<class 'numpy.float64'>'&nbsp; &nbsp; with 12916 stored elements in Compressed Sparse Row format>
随时随地看视频慕课网APP

相关分类

Python
我要回答