慕盖茨4494581
以下是 R² 分数的公式:\hat{y_i} 是第 i 个观测值 y_i 的预测变量,\bar{y} 是所有观测值的平均值。因此,负 R² 意味着如果有人知道您样本的平均值y_test并始终将其用作“预测”,则该“预测”将比您的模型更准确。转到您的数据集(感谢 @Prayson W. Daniel 提供了方便的加载脚本),让我们快速浏览一下您的数据。df.population.plot()看起来对数变换可能会有所帮助。import numpy as npdf_log = df.copy()df_log.population = np.log(df.population)df_log.population.plot()现在让我们使用 OpenTURNS 执行线性回归。import openturns as otsam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Samplesam.setDescription(['year', 'logarithm of the population'])linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])linreg.run()linreg_result = linreg.getResult()coeffs = linreg_result.getCoefficients()print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))print("R2 score = {}".format(linreg_result.getRSquared()))ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)输出:Best fitting line = -38.35148311467912 + year * 0.028172928802559845R2 score = 0.9966261033648469这几乎是精确的配合。编辑正如 @Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。# Get the original data in openturns Sample formatorig_sam = ot.Sample(np.array(df))orig_sam.setDescription(df.columns)# Compute the prediction in the original scalepredicted = ot.Sample(orig_sam) # start by copying the original datapredicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted valueserror = np.array((predicted - orig_sam)[:, 1]) # compute errorr2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scaleprint("R2 score in original scale = {}".format(r2))# Plot the modelgraph = ot.Graph("Original scale", "year", "population", True, '')curve = ot.Curve(predicted)graph.add(curve)points = ot.Cloud(orig_sam)points.setColor('red')graph.add(points)graph输出:R2 score in original scale = 0.9979032805107133
繁华开满天机
Sckit-learn 的 LinearRegression 分数使用 𝑅2 分数。负 𝑅2 意味着该模型与您的数据拟合得非常糟糕。由于 𝑅2 将模型的拟合度与原假设(水平直线)的拟合度进行比较,因此当模型拟合度比水平线差时,𝑅2 为负。𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))因此,如果 SUM((y - ypred)**2大于SUM((y - AVG(y))**2,则 𝑅2 将为负数。原因及纠正方法问题 1:您正在执行时间序列数据的随机分割。随机分割将忽略时间维度。解决方案:保留时间流(参见下面的代码)问题2:目标值太大。解决方案:除非我们使用基于树的模型,否则您将必须进行一些目标特征工程,以将数据缩放到模型可以学习的范围内。这是一个代码示例。使用 LinearRegression 的默认参数和log|exp目标值的转换,我的尝试产生了约 87% 的 R2 分数:import pandas as pdimport numpy as np# we need to transform/feature engineer our target# I will use log from numpy. The np.log and np.exp to make the value learnablefrom sklearn.linear_model import LinearRegressionfrom sklearn.compose import TransformedTargetRegressor# your data, df# transform year to referencedf = df.assign(ref_year = lambda x: x.year - 1960)df.population = df.population.astype(int)split = int(df.shape[0] *.9) #split at 90%, 10%-ishdf = df[['ref_year', 'population']]train_df = df.iloc[:split]test_df = df.iloc[split:]X_train = train_df[['ref_year']]y_train = train_df.populationX_test = test_df[['ref_year']]y_test = test_df.population# regressorregressor = LinearRegression()lr = TransformedTargetRegressor( regressor=regressor, func=np.log, inverse_func=np.exp)lr.fit(X_train,y_train)print(lr.score(X_test,y_test))对于那些有兴趣让它变得更好的人,这里有一种读取该数据集的方法import pandas as pdimport iodf = pd.read_csv(io.StringIO('''year,population1960,22151278.0 1961,22671191.0 1962,23221389.0 1963,23798430.0 1964,24397022.0 1965,25013626.0 1966,25641044.0 1967,26280132.0 1968,26944390.0 1969,27652709.0 1970,28415077.0 1971,29248643.0 1972,30140804.0 1973,31036662.0 1974,31861352.0 1975,32566854.0 1976,33128149.0 1977,33577242.0 1978,33993301.0 1979,34487799.0 1980,35141712.0 1981,35984528.0 1982,36995248.0 1983,38142674.0 1984,39374348.0 1985,40652141.0 1986,41965693.0 1987,43329231.0 1988,44757203.0 1989,46272299.0 1990,47887865.0 1991,49609969.0 1992,51423585.0 1993,53295566.0 1994,55180998.01995,57047908.0 1996,58883530.0 1997,60697443.0 1998,62507724.0 1999,64343013.0 2000,66224804.0 2001,68159423.0 2002,70142091.0 2003,72170584.0 2004,74239505.02005,76346311.02006,78489206.0 2007,80674348.0 2008,82916235.0 2009,85233913.0 2010,87639964.0 2011,90139927.0 2012,92726971.0 2013,95385785.0 2014,98094253.0 2015,100835458.0 2016,103603501.0 2017,106400024.0 2018,109224559.0'''))结果: