随机将数据帧分成具有均匀分布值的组

使用一些技巧用于pd.factorize()将分类数据转换为每个类别的值计算代表组/子组对的值/因子f随机化一点np.random.uniform()，最小值和最大值接近 1一旦有一个代表分组的值，就可以sort_values()并且reset_index()有一个干净的有序索引最终通过整数余数进行分组group = list("ABCD")subgroup = list("abcdef")df = pd.DataFrame([{"group":group[random.randint(0,len(group)-1)],  "subgroup":subgroup[random.randint(0,len(subgroup)-1)], "value":random.randint(1,3)} for i in range(300)])bins=6dfc = df.assign(    # take into account concentration of group and subgroup    # randomise a bit....    f = ((pd.factorize(df["group"])[0] +1)*10 +             (pd.factorize(df["subgroup"])[0] +1)             *np.random.uniform(0.99,1.01,len(df))        ),).sort_values("f").reset_index(drop=True).assign(    gc=lambda dfa: dfa.index%(bins)).drop(columns="f")# check distribution ... used plot for SOdfc.groupby(["gc","group","subgroup"]).count().unstack(0).plot(kind="barh")# every group same size...# dfc.groupby("gc").count()# now it's easy to get each of the cuts.... 0 through 5# dfcut0 = dfc.query("gc==0").drop(columns="gc").copy().reset_index(drop=True)# dfcut0

随机将数据帧分成具有均匀分布值的组

1回答