手记

文本分类(上)- 基于传统机器学习方法进行文本分类

简介

自己由于最近参加了一个比赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。

接着上一篇文章我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:

  • 基于传统机器学习的文本分类

  • 基于深度学习的文本分类

传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR模型进行训练;这里模型有很多,比如贝叶斯、svm等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention等。

利用传统机器学习和深度学习进行文本分类

  • 基于传统机器学习方法进行文本分类
    基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练

import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.neural_network.multilayer_perceptron import MLPClassifierfrom sklearn.svm import SVC,LinearSVC,LinearSVRfrom sklearn.linear_model.stochastic_gradient import SGDClassifierfrom sklearn.linear_model.logistic import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.ensemble import AdaBoostClassifierfrom sklearn.tree import DecisionTreeClassifier# 选取下面的8类selected_categories = [    'comp.graphics',    'rec.motorcycles',    'rec.sport.baseball',    'misc.forsale',    'sci.electronics',    'sci.med',    'talk.politics.guns',    'talk.religion.misc']# 加载数据集newsgroups_train=fetch_20newsgroups(subset='train',
                                    categories=selected_categories,
                                    remove=('headers','footers','quotes'))
newsgroups_test=fetch_20newsgroups(subset='train',
                                    categories=selected_categories,
                                    remove=('headers','footers','quotes'))

train_texts=newsgroups_train['data']
train_labels=newsgroups_train['target']
test_texts=newsgroups_test['data']
test_labels=newsgroups_test['target']
print(len(train_texts),len(test_texts))# 贝叶斯text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',MultinomialNB())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MultinomialNB准确率为:",np.mean(predicted==test_labels))# SGDtext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',SGDClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SGDClassifier准确率为:",np.mean(predicted==test_labels))# LogisticRegressiontext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LogisticRegression())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LogisticRegression准确率为:",np.mean(predicted==test_labels))# SVMtext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',SVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("SVC准确率为:",np.mean(predicted==test_labels))

text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LinearSVC())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVC准确率为:",np.mean(predicted==test_labels))

text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',LinearSVR())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("LinearSVR准确率为:",np.mean(predicted==test_labels))# MLPClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',MLPClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("MLPClassifier准确率为:",np.mean(predicted==test_labels))# KNeighborsClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',KNeighborsClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels))# RandomForestClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',RandomForestClassifier(n_estimators=8))])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels))# GradientBoostingClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',GradientBoostingClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels))# AdaBoostClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',AdaBoostClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels))# DecisionTreeClassifiertext_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                   ('clf',DecisionTreeClassifier())])
text_clf=text_clf.fit(train_texts,train_labels)
predicted=text_clf.predict(test_texts)
print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))

输出结果为:

MultinomialNB准确率为: 0.8960196779964222SGDClassifier准确率为: 0.9724955277280859LogisticRegression准确率为: 0.9304561717352415SVC准确率为: 0.13372093023255813LinearSVC准确率为: 0.9749552772808586LinearSVR准确率为: 0.00022361359570661896MLPClassifier准确率为: 0.9758497316636852KNeighborsClassifier准确率为: 0.45840787119856885RandomForestClassifier准确率为: 0.9680232558139535GradientBoostingClassifier准确率为: 0.9186046511627907AdaBoostClassifier准确率为: 0.5916815742397138DecisionTreeClassifier准确率为: 0.9758497316636852

从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;



作者:致Great
链接:https://www.jianshu.com/p/3da3f5608a7c


0人推荐
随时随地看视频
慕课网APP