我正在编写一个处理情绪分析的 python 脚本,我对文本进行了预处理并对分类特征进行了矢量化并拆分了数据集,然后我使用了 LogisticRegression 模型,准确率达到了 84 %
当我上传一个新的数据集并尝试部署创建的模型时,我得到了51.84% 的准确率
代码:
import pandas as pd
import numpy as np
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
from sklearn.model_selection import train_test_split
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# ML Libraries
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
stop_words = set(stopwords.words('english'))
import joblib
def load_dataset(filename, cols):
dataset = pd.read_csv(filename, encoding='latin-1')
dataset.columns = cols
return dataset
dataset = load_dataset("F:\AIenv\sentiment_analysis\input_2_balanced.csv", ["id","label","date","text"])
dataset.head()
dataset['clean_text'] = dataset['text'].apply(processTweet)
# create doc2vec vector columns
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(dataset["clean_text"].apply(lambda x: x.split(" ")))]
# train a Doc2Vec model with our text data
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
# transform each document into a vector data
doc2vec_df = dataset["clean_text"].apply(lambda x: model.infer_vector(x.split(" "))).apply(pd.Series)
doc2vec_df.columns = ["doc2vec_vector_" + str(x) for x in doc2vec_df.columns]
dataset = pd.concat([dataset, doc2vec_df], axis=1)
手掌心
守着星空守着你
largeQ
幕布斯7119047
相关分类