猿问

如何预测特征数量是否与测试集中可用的特征数量不匹配?

我正在使用 Pandasget_dummies将分类变量转换为虚拟/指标变量,它在数据集中引入了新功能。然后我们将此数据集拟合/训练到模型中。


由于尺寸X_train和X_test保持不变,当我们的测试数据做预测它的测试数据运行良好X_test。


现在假设我们在另一个 csv 文件中有测试数据(输出未知)。当我们使用 转换这组测试数据时get_dummies,结果数据集的特征数量可能与我们训练模型的特征数量不同。稍后当我们将此模型与此数据集一起使用时,它失败了,因为测试集中的特征数量与模型的不匹配。


知道我们如何处理这个吗?


代码 :


import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split


# Load the dataset

in_file = 'train.csv'

full_data = pd.read_csv(in_file)

outcomes = full_data['Survived']

features_raw = full_data.drop('Survived', axis = 1)

features = pd.get_dummies(features_raw)

features = features.fillna(0.0)

X_train, X_test, y_train, y_test = train_test_split(features, outcomes, 

test_size=0.2, random_state=42)

model = 

DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)

model.fit(X_train,y_train)


y_train_pred = model.predict(X_train)

#print (X_train.shape)

y_test_pred = model.predict(X_test)



from sklearn.metrics import accuracy_score

train_accuracy = accuracy_score(y_train, y_train_pred)

test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)

print('The test accuracy is', test_accuracy)


# DOing again to test another set of data

test_data = 'test.csv'

test_data1 = pd.read_csv(test_data)


test_data2 = pd.get_dummies(test_data1)

test_data3 = test_data2.fillna(0.0)

print(test_data2.shape)

print (model.predict(test_data3))


白衣染霜花
浏览 227回答 1
1回答

陪伴而非守候

之前似乎有人问过类似的问题,但最有效/最简单的方法是遵循Thibault Clement描述的方法here# Get missing columns in the training testmissing_cols = set( X_train.columns ) - set( X_test.columns )# Add a missing column in test set with default value equal to 0for c in missing_cols:    X_test[c] = 0# Ensure the order of column in the test set is in the same order than in train setX_test = X_test[X_train.columns]还值得注意的是,您的模型只能使用它训练过的特征,因此如果 X_test 与 X_train 中有额外的列而不是更少,那么在预测之前必须删除这些列。
随时随地看视频慕课网APP

相关分类

Python
我要回答