我使用了一个 70-30 平衡的数据集,并尝试使用 train_test_split sklearn 函数在训练/测试中将其拆分为分层。它在 python 3.5 中按预期工作,但在 3.7 中却不是。
有我用来重现的代码:
import numpy as np
from sklearn.model_selection import train_test_split
data = np.random.rand(1000000).reshape(100000, 10)
y_0 = [0]*30000
y_1 = [1]*70000
y_2 = y_0 + y_1
x_train, x_test, y_train, y_test = train_test_split(data, y_2, test_size=0.2, random_state=0, stratify=y_2)
print('Train set size : {}'.format(len(y_train)))
print('Value 1 repartition in train set : {}'.format(sum(y_train)/len(y_train)))
print('Test set size : {}'.format(len(y_test)))
print('Value 1 repartition in test set : {}'.format(sum(y_test)/len(y_test)))
输出 Python 3.7:
Train set size : 24102
Value 1 repartition in train set : 0.5414903327524687
Test set size : 20000
Value 1 repartition in test set : 0.53775
输出 Python 3.5:
Train set size : 80000
Value 1 repartition in train set : 0.7
Test set size : 20000
Value 1 repartition in test set : 0.7
库版本 3.7:
Python 3.7.2
numpy==1.16.1
pandas==0.24.1
python-dateutil==2.8.0
pytz==2018.9
scikit-learn==0.20.2
scipy==1.2.1
six==1.12.0
库版本 3.5:
Python 3.5.1
numpy==1.16.1
pandas==0.24.1
python-dateutil==2.8.0
pytz==2018.9
scikit-learn==0.20.2
scipy==1.2.1
six==1.12.0
尚方宝剑之说
相关分类