猿问

为什么我无法匹配 LGBM 的简历分数?

我无法手动匹配 LGBM 的简历分数。


这是一个 MCVE:


from sklearn.datasets import load_breast_cancer

import pandas as pd

from sklearn.model_selection import train_test_split, KFold

from sklearn.metrics import roc_auc_score

import lightgbm as lgb

import numpy as np


data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)

y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


folds = KFold(5, random_state=42)


params = {'random_state': 42}


results = lgb.cv(params, lgb.Dataset(X_train, y_train), folds=folds, num_boost_round=1000, early_stopping_rounds=100, metrics=['auc'])

print('LGBM\'s cv score: ', results['auc-mean'][-1])


clf = lgb.LGBMClassifier(**params, n_estimators=len(results['auc-mean']))


val_scores = []

for train_idx, val_idx in folds.split(X_train):

    clf.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])

    val_scores.append(roc_auc_score(y_train.iloc[val_idx], clf.predict_proba(X_train.iloc[val_idx])[:,1]))

print('Manual score: ', np.mean(np.array(val_scores)))

我期望两个 CV 分数相同 - 我设置了随机种子,并做了完全相同的事情。然而他们不同。


这是我得到的输出:


LGBM's cv score:  0.9851513530737058

Manual score:  0.9903622177441328

为什么?我没有cv正确使用 LGMB 的模块吗?


沧海一幻觉
浏览 128回答 1
1回答

元芳怎么了

您将 X 拆分为 X_train 和 X_test。对于 cv,您将 X_train 拆分为 5 折,而手动将 X 拆分为 5 折。即您手动使用的点数比使用 cv 的点数多。更改results = lgb.cv(params, lgb.Dataset(X_train, y_train)为results = lgb.cv(params, lgb.Dataset(X, y)此外,可以有不同的参数。例如,lightgbm 使用的线程数会改变结果。在 cv 期间,模型并行拟合。因此,使用的线程数可能与您的手动顺序训练不同。第一次更正后编辑:您可以使用以下代码使用手动拆分/ cv 获得相同的结果:from sklearn.datasets import load_breast_cancerimport pandas as pdfrom sklearn.model_selection import train_test_split, KFoldfrom sklearn.metrics import roc_auc_scoreimport lightgbm as lgbimport numpy as npdata = load_breast_cancer()X = pd.DataFrame(data.data, columns=data.feature_names)y = pd.Series(data.target)X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)folds = KFold(5, random_state=42)params = {        'task': 'train',        'boosting_type': 'gbdt',        'objective':'binary',        'metric':'auc',        }data_all = lgb.Dataset(X_train, y_train)results = lgb.cv(params, data_all,                  folds=folds.split(X_train),                  num_boost_round=1000,                  early_stopping_rounds=100)print('LGBM\'s cv score: ', results['auc-mean'][-1])val_scores = []for train_idx, val_idx in folds.split(X_train):    data_trd = lgb.Dataset(X_train.iloc[train_idx],                            y_train.iloc[train_idx],                            reference=data_all)    gbm = lgb.train(params,                    data_trd,                    num_boost_round=len(results['auc-mean']),                    verbose_eval=100)    val_scores.append(roc_auc_score(y_train.iloc[val_idx], gbm.predict(X_train.iloc[val_idx])))print('Manual score: ', np.mean(np.array(val_scores)))产量LGBM's cv score:  0.9914524426410262Manual score:  0.9914524426410262与众不同的是这条线reference=data_all。在 cv 期间,变量的分箱(指lightgbm doc)是使用整个数据集 (X_train) 构建的,而在您的 for 循环手册中,它是基于训练子集 (X_train.iloc[train_idx]) 构建的。通过传递对包含所有数据的数据集的引用,lightGBM 将重用相同的分箱,给出相同的结果
随时随地看视频慕课网APP

相关分类

Python
我要回答