如果值计数低于阈值,则将列值映射到“杂项” - 分类列 - Pandas Dataframe

我有一个形状为 ~ [200K, 40] 的熊猫数据框。数据框有一个分类列(众多列之一),有超过 1000 个唯一值。我可以使用以下方法可视化每个此类唯一列的值计数:

df['column_name'].value_counts()

我现在如何将价值观与:

  • value_count 小于阈值,比如 100,并将它们映射到,比如“杂项”?

  • 或基于累积行数 % ?


牧羊人nacy
浏览 157回答 3
3回答

至尊宝的传说

您可以从索引中提取要屏蔽的值,value_counts然后使用replace 将它们映射到“杂项” :import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])frequencies = df['A'].value_counts()condition = frequencies<200&nbsp; &nbsp;# you can define it however you wantmask_obs = frequencies[condition].indexmask_dict = dict.fromkeys(mask_obs, 'miscellaneous')df['A'] = df['A'].replace(mask_dict)&nbsp; # or you could make a copy not to modify original data现在,使用 value_counts 会将低于阈值的所有值分组为杂项:df['A'].value_counts()df['A'].value_counts()Out[18]:&nbsp;miscellaneous&nbsp; &nbsp; 9473&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2261&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2210&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2047&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2012&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 201

德玛西亚99

我认为需要:df = pd.DataFrame({ 'A': ['a','a','a','a','b','b','b','c','d']})s = df['A'].value_counts()print (s)a&nbsp; &nbsp; 4b&nbsp; &nbsp; 3d&nbsp; &nbsp; 1c&nbsp; &nbsp; 1Name: A, dtype: int64如果需要总结以下所有值threshold:threshold = 2m = s < threshold#filter values under thresholdout = s[~m]#sum values under and create new values to Seriesout['misc'] = s[m].sum()print (out)a&nbsp; &nbsp; &nbsp; &nbsp;4b&nbsp; &nbsp; &nbsp; &nbsp;3misc&nbsp; &nbsp; 2Name: A, dtype: int64但是如果需要rename索引值低于阈值:out = s.rename(dict.fromkeys(s.index[s < threshold], 'misc'))print (out)a&nbsp; &nbsp; &nbsp; &nbsp;4b&nbsp; &nbsp; &nbsp; &nbsp;3misc&nbsp; &nbsp; 1misc&nbsp; &nbsp; 1Name: A, dtype: int64如果需要更换原来的柱使用GroupBy.transform具有numpy.where:df['A'] = np.where(df.groupby('A')['A'].transform('size') < threshold, 'misc', df['A'])print (df)&nbsp; &nbsp; &nbsp; A0&nbsp; &nbsp; &nbsp;a1&nbsp; &nbsp; &nbsp;a2&nbsp; &nbsp; &nbsp;a3&nbsp; &nbsp; &nbsp;a4&nbsp; &nbsp; &nbsp;b5&nbsp; &nbsp; &nbsp;b6&nbsp; &nbsp; &nbsp;b7&nbsp; misc8&nbsp; misc

白衣非少年

替代解决方案:cond = df['col'].value_counts()threshold = 100df['col'] = np.where(df['col'].isin(cond.index[cond >= threshold ]), df['col'], 'miscellaneous')
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python