简介
自己由于最近参加了一个比赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。
接着上一篇文章我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:
基于传统机器学习的文本分类
基于深度学习的文本分类
传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR
模型进行训练;这里模型有很多,比如贝叶斯、svm
等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention
等。
利用传统机器学习和深度学习进行文本分类
基于传统机器学习方法进行文本分类
基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练
import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.neural_network.multilayer_perceptron import MLPClassifierfrom sklearn.svm import SVC,LinearSVC,LinearSVRfrom sklearn.linear_model.stochastic_gradient import SGDClassifierfrom sklearn.linear_model.logistic import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.ensemble import AdaBoostClassifierfrom sklearn.tree import DecisionTreeClassifier# 选取下面的8类selected_categories = [ 'comp.graphics', 'rec.motorcycles', 'rec.sport.baseball', 'misc.forsale', 'sci.electronics', 'sci.med', 'talk.politics.guns', 'talk.religion.misc']# 加载数据集newsgroups_train=fetch_20newsgroups(subset='train', categories=selected_categories, remove=('headers','footers','quotes')) newsgroups_test=fetch_20newsgroups(subset='train', categories=selected_categories, remove=('headers','footers','quotes')) train_texts=newsgroups_train['data'] train_labels=newsgroups_train['target'] test_texts=newsgroups_test['data'] test_labels=newsgroups_test['target'] print(len(train_texts),len(test_texts))# 贝叶斯text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',MultinomialNB())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("MultinomialNB准确率为:",np.mean(predicted==test_labels))# SGDtext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',SGDClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("SGDClassifier准确率为:",np.mean(predicted==test_labels))# LogisticRegressiontext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LogisticRegression())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LogisticRegression准确率为:",np.mean(predicted==test_labels))# SVMtext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',SVC())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("SVC准确率为:",np.mean(predicted==test_labels)) text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LinearSVC())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LinearSVC准确率为:",np.mean(predicted==test_labels)) text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',LinearSVR())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("LinearSVR准确率为:",np.mean(predicted==test_labels))# MLPClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',MLPClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("MLPClassifier准确率为:",np.mean(predicted==test_labels))# KNeighborsClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',KNeighborsClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels))# RandomForestClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',RandomForestClassifier(n_estimators=8))]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels))# GradientBoostingClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',GradientBoostingClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels))# AdaBoostClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',AdaBoostClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels))# DecisionTreeClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)), ('clf',DecisionTreeClassifier())]) text_clf=text_clf.fit(train_texts,train_labels) predicted=text_clf.predict(test_texts) print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))
输出结果为:
MultinomialNB准确率为: 0.8960196779964222SGDClassifier准确率为: 0.9724955277280859LogisticRegression准确率为: 0.9304561717352415SVC准确率为: 0.13372093023255813LinearSVC准确率为: 0.9749552772808586LinearSVR准确率为: 0.00022361359570661896MLPClassifier准确率为: 0.9758497316636852KNeighborsClassifier准确率为: 0.45840787119856885RandomForestClassifier准确率为: 0.9680232558139535GradientBoostingClassifier准确率为: 0.9186046511627907AdaBoostClassifier准确率为: 0.5916815742397138DecisionTreeClassifier准确率为: 0.9758497316636852
从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;
作者:致Great
链接:https://www.jianshu.com/p/3da3f5608a7c