Pandas groupby 采样 - 忽略样本大于元素数量的情况

首页课程实战体系课手记专栏慕课教程

Pandas groupby 采样 - 忽略样本大于元素数量的情况

我可以a从每个分组中进行采样b，如下所示。

df = pd.DataFrame({'a': [10,20,30,40,50,60,70],

'b': [1,1,1,0,0,0,0]})

df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=3))

给出：

b a

0 3 40

4 50

5 60

1 0 10

2 30

1 20

但是，如果我想对n元素进行采样，则n必须设置为最多一个组中的元素数量（如果我们想要的话replace=False）

是否有一种干净的方法来对n组中的元素进行采样，最多可达最大数量的项目？

例如，在给定的 DataFrame: in 中b，存在三个值为的项目1。

如果我愿意df.groupby('b').apply(lambda x: x.sample(n=4))，（注意n=4）这就会破裂。

对每组进行最大采样的干净方法是什么？

蓝山帝景

浏览 1781回答 2

2回答

慕标5832272

将其包裹起来min是一个选项：df = pd.DataFrame({'a': [10,20,30,40,50,60,70], 'b': [1,1,1,0,0,0,0]})n = 4df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(n=min(10, len(x))))输出：0 3 40 4 50 6 70 5 601 2 30 1 20 0 10Name: a, dtype: int64或者，如果您总是想对最大值进行采样（即随机洗牌），请使用frac：df.groupby('b', as_index=False)['a'].apply(lambda x: x.sample(frac=1))输出：0 6 70 4 50 5 60 3 401 2 30 1 20 0 10Name: a, dtype: int64请注意pandas-1.1.0，您可以直接sample从 groupby 对象访问。

0 0

SMILET

您可以通过将预先指定的最大样本大小与组的大小进行比较来自适应地修改样本大小。max_sample = 4 df.groupby('b')['a'].apply(lambda x: x.sample(n=max_sample if len(x)>max_sample else len(x)))

0 0

随时随地看视频慕课网APP