猿问

线性回归的负精度

我的线性回归模型的决定系数 R²为负。


怎么会发生这种事呢?任何想法都有帮助。


这是我的数据集:


year,population

1960,22151278.0

1961,22671191.0

1962,23221389.0

1963,23798430.0

1964,24397022.0

1965,25013626.0

1966,25641044.0

1967,26280132.0

1968,26944390.0

1969,27652709.0

1970,28415077.0

1971,29248643.0

1972,30140804.0

1973,31036662.0

1974,31861352.0

1975,32566854.0

1976,33128149.0

1977,33577242.0

1978,33993301.0

1979,34487799.0

1980,35141712.0

1981,35984528.0

1982,36995248.0

1983,38142674.0

1984,39374348.0

1985,40652141.0

1986,41965693.0

1987,43329231.0

1988,44757203.0

1989,46272299.0

1990,47887865.0

1991,49609969.0

1992,51423585.0

1993,53295566.0

1994,55180998.0

1995,57047908.0

1996,58883530.0

1997,60697443.0

1998,62507724.0

1999,64343013.0

2000,66224804.0

2001,68159423.0

2002,70142091.0

2003,72170584.0

2004,74239505.0

2005,76346311.0

2006,78489206.0

2007,80674348.0

2008,82916235.0

2009,85233913.0

2010,87639964.0

2011,90139927.0

2012,92726971.0

2013,95385785.0

2014,98094253.0

2015,100835458.0

2016,103603501.0

2017,106400024.0

2018,109224559.0

模型的代码LinearRegression如下:


import pandas as pd


from sklearn.linear_model import LinearRegression


data =pd.read_csv("data.csv", header=None )


data = data.drop(0,axis=0)


X=data[0]


Y=data[1]


from sklearn.model_selection import train_test_split 


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)


lm = LinearRegression()


lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))


Y_pred = lm.predict(X_test.values.reshape(-1,1))


accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)


print(accuracy)

output

-3592622948027972.5


喵喵时光机
浏览 112回答 2
2回答

慕盖茨4494581

以下是 R² 分数的公式:\hat{y_i} 是第 i 个观测值 y_i 的预测变量,\bar{y} 是所有观测值的平均值。因此,负 R² 意味着如果有人知道您样本的平均值y_test并始终将其用作“预测”,则该“预测”将比您的模型更准确。转到您的数据集(感谢 @Prayson W. Daniel 提供了方便的加载脚本),让我们快速浏览一下您的数据。df.population.plot()看起来对数变换可能会有所帮助。import numpy as npdf_log = df.copy()df_log.population = np.log(df.population)df_log.population.plot()现在让我们使用 OpenTURNS 执行线性回归。import openturns as otsam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Samplesam.setDescription(['year', 'logarithm of the population'])linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])linreg.run()linreg_result = linreg.getResult()coeffs = linreg_result.getCoefficients()print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))print("R2 score = {}".format(linreg_result.getRSquared()))ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)输出:Best fitting line = -38.35148311467912 + year * 0.028172928802559845R2 score = 0.9966261033648469这几乎是精确的配合。编辑正如 @Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。# Get the original data in openturns Sample formatorig_sam = ot.Sample(np.array(df))orig_sam.setDescription(df.columns)# Compute the prediction in the original scalepredicted = ot.Sample(orig_sam) # start by copying the original datapredicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted valueserror = np.array((predicted - orig_sam)[:, 1]) # compute errorr2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scaleprint("R2 score in original scale = {}".format(r2))# Plot the modelgraph = ot.Graph("Original scale", "year", "population", True, '')curve = ot.Curve(predicted)graph.add(curve)points = ot.Cloud(orig_sam)points.setColor('red')graph.add(points)graph输出:R2 score in original scale = 0.9979032805107133

繁华开满天机

Sckit-learn 的 LinearRegression 分数使用 𝑅2 分数。负 𝑅2 意味着该模型与您的数据拟合得非常糟糕。由于 𝑅2 将模型的拟合度与原假设(水平直线)的拟合度进行比较,因此当模型拟合度比水平线差时,𝑅2 为负。𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))因此,如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2,则 𝑅2 将为负数。原因及纠正方法问题 1:您正在执行时间序列数据的随机分割。随机分割将忽略时间维度。解决方案:保留时间流(参见下面的代码)问题2:目标值太大。解决方案:除非我们使用基于树的模型,否则您将必须进行一些目标特征工程,以将数据缩放到模型可以学习的范围内。这是一个代码示例。使用 LinearRegression 的默认参数和log|exp目标值的转换,我的尝试产生了约 87% 的 R2 分数:import pandas as pdimport numpy as np# we need to transform/feature engineer our target# I will use log from numpy. The np.log and np.exp to make the value learnablefrom sklearn.linear_model import LinearRegressionfrom sklearn.compose import TransformedTargetRegressor# your data, df# transform year to referencedf = df.assign(ref_year = lambda x: x.year - 1960)df.population = df.population.astype(int)split = int(df.shape[0] *.9) #split at 90%, 10%-ishdf = df[['ref_year', 'population']]train_df = df.iloc[:split]test_df = df.iloc[split:]X_train = train_df[['ref_year']]y_train = train_df.populationX_test = test_df[['ref_year']]y_test = test_df.population# regressorregressor = LinearRegression()lr = TransformedTargetRegressor(        regressor=regressor,         func=np.log, inverse_func=np.exp)lr.fit(X_train,y_train)print(lr.score(X_test,y_test))对于那些有兴趣让它变得更好的人,这里有一种读取该数据集的方法import pandas as pdimport iodf = pd.read_csv(io.StringIO('''year,population1960,22151278.0 1961,22671191.0 1962,23221389.0 1963,23798430.0 1964,24397022.0 1965,25013626.0 1966,25641044.0 1967,26280132.0 1968,26944390.0 1969,27652709.0 1970,28415077.0 1971,29248643.0 1972,30140804.0 1973,31036662.0 1974,31861352.0 1975,32566854.0 1976,33128149.0 1977,33577242.0 1978,33993301.0 1979,34487799.0 1980,35141712.0 1981,35984528.0 1982,36995248.0 1983,38142674.0 1984,39374348.0 1985,40652141.0 1986,41965693.0 1987,43329231.0 1988,44757203.0 1989,46272299.0 1990,47887865.0 1991,49609969.0 1992,51423585.0 1993,53295566.0 1994,55180998.01995,57047908.0 1996,58883530.0 1997,60697443.0 1998,62507724.0 1999,64343013.0 2000,66224804.0 2001,68159423.0 2002,70142091.0 2003,72170584.0 2004,74239505.02005,76346311.02006,78489206.0 2007,80674348.0 2008,82916235.0 2009,85233913.0 2010,87639964.0 2011,90139927.0 2012,92726971.0 2013,95385785.0 2014,98094253.0 2015,100835458.0 2016,103603501.0 2017,106400024.0 2018,109224559.0'''))结果:
随时随地看视频慕课网APP

相关分类

Python
我要回答