叮当猫咪
这可能是一个可能的解决方案,包含 3 个步骤:删除所有没有 true 和 false 标志的集合(此处为 C)计算每个设置标志组合所需的行数删除超过该计数行数的所有行这会产生以下代码:df = pd.DataFrame(data={"data":[0, 30, -1, 20, 5, 19, 7, 8], "Flag":[True, True, False, True, False, False, False, False], "Set":["A", "A", "A", "B", "B", "B", "C", "C"]})# 1. removing sets with only one of both flagsreducer = df.groupby("Set")["Flag"].transform("nunique") > 1df_reduced = df.loc[reducer]# 2. counting the minimum number of rows per setcounts = df_reduced.groupby(["Set", "Flag"]).count().groupby("Set").min()# 3. reducing each set and flag to the minumum number of rowsdf_equal = df_reduced.groupby(["Set", "Flag"]) \ .apply(lambda x: x.head(counts.loc[x["Set"].values[0]][0])) \ .reset_index(drop=True)
ITMISS
编辑:我想出了一个易于理解、简洁的解决方案:只需获取.cumcount()分组依据set和flag检查一组set和cumcount上面的结果(cc下面的代码)是否重复。如果一个组不包含重复项,则意味着需要将其删除。In[1]: data Flag Set0 0 True A1 8 True A2 30 True A3 0 True A4 8 True A5 30 True A6 -1 False A7 -14 False A8 -1 False A9 -14 False A10 20 True B11 5 False B12 19 False B13 7 False C14 8 False c编辑2:根据@Jezrael,我可以进一步简化以下三行代码:df = (df[df.assign(cc = df.groupby(['Set', 'Flag']) .cumcount()).duplicated(['Set','cc'], keep=False)])下面的代码进一步细分。df['cc'] = df.groupby(['Set', 'Flag']).cumcount()s = df.duplicated(['Set','cc'], keep=False)df = df[s].drop('cc', axis=1)dfOut[1]: data Flag Set0 0 True A1 8 True A2 30 True A3 0 True A6 -1 False A7 -14 False A8 -1 False A9 -14 False A10 20 True B11 5 False B在删除之前,数据如下所示:df['cc'] = df.groupby(['Set', 'Flag']).cumcount()df['s'] = df.duplicated(['Set','cc'], keep=False)# df = df[df['s']].drop('cc', axis=1)dfOut[1]: data Flag Set cc s0 0 True A 0 True1 8 True A 1 True2 30 True A 2 True3 0 True A 3 True4 8 True A 4 False5 30 True A 5 False6 -1 False A 0 True7 -14 False A 1 True8 -1 False A 2 True9 -14 False A 3 True10 20 True B 0 True11 5 False B 0 True12 19 False B 1 False13 7 False C 0 False14 8 False c 0 False然后,False列中的行s被删除df = df[df['s']]