该数据集包含了社交网络中用户的信息。这些信息涉及用户ID,性别,年龄以及预估薪资。一家汽车公司刚刚推出了他们新型的豪华SUV,我们尝试预测哪些用户会购买这种全新SUV。并且在最后一列用来表示用户是否购买。我们将建立一种模型来预测用户是否购买这种SUV,该模型基于两个变量,分别是年龄和预计薪资。因此我们的特征矩阵将是这两列。我们尝试寻找用户年龄与预估薪资之间的某种相关性,以及他是否购买SUV的决定。
User ID Gender Age EstimatedSalary Purchased0 15624510 Male 19 19000 0 1 15810944 Male 35 20000 0 2 15668575 Female 26 43000 0 3 15603246 Female 27 57000 0 4 15804002 Male 19 76000 0 5 15728773 Male 27 58000 0 6 15598044 Female 27 84000 0 7 15694829 Female 32 150000 1 8 15600575 Male 25 33000 0 9 15727311 Female 35 65000 0 10 15570769 Female 26 80000 0 11 15606274 Female 26 52000 0 12 15746139 Male 20 86000 0 13 15704987 Male 32 18000 0 14 15628972 Male 18 82000 0 15 15697686 Male 29 80000 0 16 15733883 Male 47 25000 1 17 15617482 Male 45 26000 1 18 15704583 Male 46 28000 1 19 15621083 Female 48 29000 1 20 15649487 Male 45 22000 1 21 15736760 Female 47 49000 1 22 15714658 Male 48 41000 1 23 15599081 Female 45 22000 1 24 15705113 Male 46 23000 1 25 15631159 Male 47 20000 1 26 15792818 Male 49 28000 1 27 15633531 Female 47 30000 1 28 15744529 Male 29 43000 0 29 15669656 Male 31 18000 0 .. ... ... ... ... ... 370 15611430 Female 60 46000 1 371 15774744 Male 60 83000 1 372 15629885 Female 39 73000 0 373 15708791 Male 59 130000 1 374 15793890 Female 37 80000 0 375 15646091 Female 46 32000 1 376 15596984 Female 46 74000 0 377 15800215 Female 42 53000 0 378 15577806 Male 41 87000 1 379 15749381 Female 58 23000 1 380 15683758 Male 42 64000 0 381 15670615 Male 48 33000 1 382 15715622 Female 44 139000 1 383 15707634 Male 49 28000 1 384 15806901 Female 57 33000 1 385 15775335 Male 56 60000 1 386 15724150 Female 49 39000 1 387 15627220 Male 39 71000 0 388 15672330 Male 47 34000 1 389 15668521 Female 48 35000 1 390 15807837 Male 48 33000 1 391 15592570 Male 47 23000 1 392 15748589 Female 45 45000 1 393 15635893 Male 60 42000 1 394 15757632 Female 39 59000 0 395 15691863 Female 46 41000 1 396 15706071 Male 51 23000 1 397 15654296 Female 50 20000 1 398 15755018 Male 36 33000 0 399 15594041 Female 49 36000 1[400 rows x 5 columns]
所有代码
import numpy as numpyimport matplotlib.pyplot as pltimport pandas as pdfrom sklearn.cross_validation import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import confusion_matrix dataset = pd.read_csv('/Users/xiehao/Desktop/100-Days-Of-ML-Code-master/datasets/Social_Network_Ads.csv')#数据预处理X = dataset.iloc[:, [2, 3]].values Y = dataset.iloc[:,4].values X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)#特征缩放sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)#将逻辑回归应用于训练集classifier = LogisticRegression() classifier.fit(X_train, y_train)#预测测试集结果y_pred = classifier.predict(X_test)#生成混淆矩阵cm = confusion_matrix(y_test, y_pred)
第一步:数据预处理
老规矩
#导入数据集dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values Y = dataset.iloc[:,4].values#将数据集分成训练集和测试集,比例是1:4X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)#特征缩放sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
第二步:逻辑回归模型
该项工作的库将会是一个线性模型库,之所以被称为线性是因为逻辑回归是一个线性分类器,这意味着我们在二维空间中,我们两类用户(购买和不购买)将被一条直线分割。然后导入逻辑回归类。下一步我们将创建该类的对象,它将作为我们训练集的分类器。
#使用 LogisticRegression类中的fit对象classifier = LogisticRegression() classifier.fit(X_train, y_train)
第三步:预测测试集结果
y_pred = classifier.predict(X_test)>>print(y_pred) [0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1]
第四步:评估预测
我们预测了测试集。 现在我们将评估逻辑回归模型是否正确的学习和理解。因此这个混淆矩阵将包含我们模型的正确和错误的预测。
cm = confusion_matrix(y_test, y_pred)>>print(cm) [[65 3] [ 8 24]]
作者:raphah
链接:https://www.jianshu.com/p/bdfa3a2ea2ab