所有子集上的岭回归 rmse 高于总集

我在一个集合上训练了一个模型，并尝试在所有子集上使用它。

从数学上讲，总 rmse 和 mae（平均误差）应该在单个 rsme 和 mae 之间。但是所有单个 rmse' 和 mae's 都高于总 rmse' 和 mae。

我做了以下事情：

%pyspark

def preprocessing(features, attributes):

features_2 = features[attributes]

y = features['y'].values

x = features_2.values

robustScaler = RobustScaler(quantile_range=(25.0,75.0))

xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

xScaled[xScaled < -2.0] = -2.0

xScaled[xScaled > 2.0] = 2.0

xCustomers = x[:,0]

xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1))

x_TS = xScaled

x_T0 = xScaled[:,:]

x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3))

xCustR = xCustomers.reshape((x[:,0].size, 1))

x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3)))

x_all = np.hstack((x_T0_all, x_TS_all))

variable_names = features_2.columns.get_values()[1:].tolist()

return x_all, variable_names, y

def trainModel(features,attributes,optAlpha):

x_all, variable_names, y = preprocessing(features, attributes)

ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')

ridge.fit(x_all, y)

return ridge

def useModel(features,ridge,attributes):

x_all, variable_names, y = preprocessing(features, attributes)

y_pred = ridge.predict(x_all)

rmse = np.sqrt(mean_squared_error(y,y_pred))

mae = mean_absolute_error(y, y_pred)

print "RMSE on test set: ", round(rmse,2)

print "MAE on test set: ", round(mae,2)

return y_pred, y, rmse, mae

ridge = trainModel(df_features_train, attributes, optAlpha)

useModel(df_features_train,ridge,attributes)

RMSE on test set: 67.05

任何想法出了什么问题？

小怪兽爱吃肉

浏览 208回答 1

月关宝盒

我自己找到的。预处理中的robustScaler 在不同的集合/子集上的工作方式不同。因此，子集中的值以不同方式准备，因此不再适合模型。

随时随地看视频慕课网APP