对数据集进行分层,同时避免索引污染?

作为可重现的示例,我有以下数据集:


import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split


data = np.random.randint(0,20,size=(300, 5))

df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])

df = df.set_index(['ID'])


df.head()

Out: 

           A   B   C   D

ID                

12         3  14   4   7

9          5   9   8   4

12         18  17   3  14

1          0  10   1   0

9          10   5  11   9

我需要执行 70%-30% 的分层分割(在 y 上),我知道它看起来像这样:


# Train/Test Split

X = df.iloc[:,0:-1] # Columns A, B, and C

y = df.iloc[:,-1] # Column D

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, test_size = 0.30, stratify = y)

然而,虽然我希望训练和测试集具有相同(或足够相似)的“D”分布,但我不希望测试和训练中都存在相同的“ID”。


我怎么能这样做呢?


泛舟湖上清波郎朗
浏览 103回答 1
1回答

饮歌长啸

编辑:执行(类似)您要求的操作的一种方法可能是按类别存储 ID,然后对于每个类别,获取 70% 的 ID,并将具有这些 ID 的样本插入到训练中,其余的插入到测试集中。请注意,如果每个 ID 出现的次数不同,这仍然不能保证分布相同。此外,鉴于每个 ID 可以属于 D 中的多个类,并且不应在训练集和测试集之间共享,因此寻求相同的分布成为一个复杂的优化问题。这是因为每个 ID 只能包含在train或test中,同时向分配的集合添加可变数量的类,这取决于给定 ID 在其出现的所有行中所具有的类。在近似平衡分布的同时分割数据的一种相当简单的方法是按随机顺序迭代类,并仅考虑每个 ID 出现的其中一个类,因此将其分配给其所有行进行训练/测试,因此为以后的课程删除它。我发现将 ID 视为一列有助于完成此任务,因此我更改了您提供的代码,如下所示:# Given snippet (modified)import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitdata = np.random.randint(0,20,size=(300, 5))df = pd.DataFrame(data, columns=['ID', 'A', 'B', 'C', 'D'])建议的解决方案:import randomfrom collections import defaultdictclasses = df.D.unique().tolist() # get unique classes,random.shuffle(classes)          # shuffle to eliminate positional biasesids_by_class = defaultdict(list)# iterate over classestemp_df = df.copy()for c in classes:    c_rows = temp_df.loc[temp_df['D'] == c] # rows with given class    ids = temp_df.ID.unique().tolist()      # IDs in these rows    ids_by_class[c].extend(ids)    # remove ids so they cannot be taken into account for other classes    temp_df = temp_df[~temp_df.ID.isin(ids)]# now construct ids split, class by classtrain_ids, test_ids = [], []for c, ids in ids_by_class.items():    random.shuffle(ids) # shuffling can eliminate positional biases    # split the IDs    split = int(len(ids)*0.7) # split at 70%    train_ids.extend(ids[:split])    test_ids.extend(ids[split:])# finally use the ids in train and test to get the# data split from the original dftrain = df.loc[df['ID'].isin(train_ids)]test = df.loc[df['ID'].isin(test_ids)]让我们测试一下分割比大致符合 70/30,数据被保留并且训练和测试数据帧之间没有共享 ID:# 1) check that elements in Train are roughly 70% and Test 30% of original dfprint(f'Numbers of elements in train: {len(train)}, test: {len(test)}| Perfect split would be train: {int(len(df)*0.7)}, test: {int(len(df)*0.3)}')# 2) check that concatenating Train and Test gives back the original dftrain_test = pd.concat([train, test]).sort_values(by=['ID', 'A', 'B', 'C', 'D']) # concatenate dataframes into one, and sortsorted_df = df.sort_values(by=['ID', 'A', 'B', 'C', 'D']) # sort original dfassert train_test.equals(sorted_df) # check equality# 3) check that the IDs are not shared between train/test setstrain_id_set = set(train.ID.unique().tolist())test_id_set = set(test.ID.unique().tolist())assert len(train_id_set.intersection(test_id_set)) == 0输出示例:Numbers of elements in train: 209, test: 91| Perfect split would be train: 210, test: 90Numbers of elements in train: 210, test: 90| Perfect split would be train: 210, test: 90Numbers of elements in train: 227, test: 73| Perfect split would be train: 210, test: 90
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python