如何分组并获得三个最频繁的值?

我想按 id 分组并获得三个最常见的城市。例如我有原始数据框


  ID    City

    1    London

    1    London

    1    New York

    1    London

    1    New York

    1    Berlin

    2    Shanghai

    2    Shanghai

我想要的结果是这样的:


ID first_frequent_city   second_frequent_city   third_frequent_city

1   London               New York               Berlin

2   Shanghai             NaN                    NaN


汪汪一只猫
浏览 149回答 4
4回答

MMMHUHU

第一步是使用perSeriesGroupBy.value_counts的计数值,优点是已经对值进行了排序,然后通过 获取计数器,通过 过滤第一个值,通过 旋转,更改列名并最后转换为列:CityIDGroupBy.cumcount3locDataFrame.pivotIDDataFrame.reset_indexdf = (df.groupby('ID')['City'].value_counts()        .groupby(level=0).cumcount()        .loc[lambda x: x < 3]        .reset_index(name='c')        .pivot('ID','c','City')        .rename(columns={0:'first_', 1:'second_', 2:'third_'})        .add_suffix('frequent_city')        .rename_axis(None, axis=1)        .reset_index())print (df)   ID first_frequent_city second_frequent_city third_frequent_city0   1              London             New York              Berlin1   2            Shanghai                  NaN                 NaN

喵喔喔

另一种使用count作为排序参考的方法,然后通过遍历groupby对象重新创建数据框:df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))&nbsp; &nbsp; &nbsp; &nbsp; .drop_duplicates(["ID","City"])&nbsp; &nbsp; &nbsp; &nbsp; .sort_values(["ID","count"], ascending=False))&nbsp; &nbsp;&nbsp;print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp; &nbsp;20&nbsp; &nbsp; London&nbsp; New York&nbsp; Berlin1&nbsp; Shanghai&nbsp; &nbsp; &nbsp; &nbsp;NaN&nbsp; &nbsp; &nbsp;NaN

湖上湖

获取.countby ID,City然后np.where()与.groupby()withmax和median一起使用min。然后将索引和取消堆叠行设置为列上的列max。df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)输出:&nbsp; &nbsp; Citymax first_frequent_city second_frequent_city&nbsp; &nbsp; third_frequent_cityID&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;1&nbsp; &nbsp;London&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; New York&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Berlin2&nbsp; &nbsp;Shanghai&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; NaN&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;NaN

qq_笑_17

有点长,基本上你 groupby 两次,第一部分基于分组按升序对数据进行排序的想法,第二部分允许我们将数据拆分为单独的列:(df.groupby("ID").tail(3).drop_duplicates().groupby("ID").agg(",".join).City.str.split(",", expand=True).set_axis(["first_frequent_city",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;"second_frequent_city",&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;third_frequent_city"],&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;axis="columns",))&nbsp; &nbsp; &nbsp;first_frequent_city&nbsp; &nbsp; second_frequent_city&nbsp; &nbsp; third_frequent_cityID&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;1&nbsp; &nbsp; &nbsp; London&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;New York&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Berlin2&nbsp; &nbsp; &nbsp; Shanghai&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;None&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; None
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python