大熊猫中棘手的级联分组

我想在 pandas 中解决一个奇怪的问题。假设我有一堆对象,它们有不同的分组方式。这是我们的数据框的样子:


df=pd.DataFrame([

    {'obj': 'Ball',    'group1_id': None, 'group2_id': '7' },

    {'obj': 'Balloon', 'group1_id': '92', 'group2_id': '7' },

    {'obj': 'Person',  'group1_id': '14', 'group2_id': '11'},

    {'obj': 'Bottle',  'group1_id': '3',  'group2_id': '7' },

    {'obj': 'Thought', 'group1_id': '3',  'group2_id': None},

])



obj       group1_id          group2_id

Ball      None               7

Balloon   92                 7

Person    14                 11

Bottle    3                  7

Thought   3                  None

我想根据任何组将事物分组在一起。这里注释一下:


obj       group1_id          group2_id    # annotated

Ball      None               7            #                   group2_id = 7

Balloon   92                 7            # group1_id = 92 OR group2_id = 7

Person    14                 11           # group1_id = 14 OR group2_id = 11

Bottle    3                  7            # group1_id =  3 OR group2_id = 7

Thought   3                  None         # group1_id = 3

组合后,我们的输出应如下所示:


count         objs                               composite_id

4             [Ball, Balloon, Bottle, Thought]   g1=3,92|g2=7

1             [Person]                           g1=11|g2=14

请注意,我们可以获得的前三个对象group2_id=7,然后是第四个对象Thought,是因为它可以通过group1_id=3为其分配group_id=7id 来与另一个项目匹配。注意:对于这个问题,假设一个项目只会属于一个组合组(并且永远不会有可能属于两个组的情况)。


我怎样才能做到这一点pandas?


汪汪一只猫
浏览 115回答 2
2回答

郎朗坤

这一点也不奇怪~网络问题import networkx as nx#we need to handle the miss value first , we fill it with same row, so that we did not calssed them into wrong groupdf['key1']=df['group1_id'].fillna(df['group2_id'])df['key2']=df['group2_id'].fillna(df['group1_id'])# here we start to create the networkG=nx.from_pandas_edgelist(df, 'key1', 'key2')l=list(nx.connected_components(G))L=[dict.fromkeys(y,x) for x, y in enumerate(l)]d={k: v for d in L for k, v in d.items()}# we using above dict to map the same group into the same one in order to groupby them out=df.groupby(df.key1.map(d)).agg(objs = ('obj',list) , Count = ('obj','count'), g1= ('group1_id', lambda x : set(x[x.notnull()].tolist())), g2= ('group2_id',  lambda x : set(x[x.notnull()].tolist())))# notice here I did not conver the composite id into string format , I keep them into different columns which more easy to understand Out[53]:                                   objs  Count       g1    g2key1                                                        0     [Ball, Balloon, Bottle, Thought]      4  {92, 3}   {7}1                             [Person]      1     {14}  {11}

红糖糍粑

这里有一个更详细的解决方案,我为分组集合构建了“第一个键”的映射:# using four id fields instead of 2grouping_fields = ['group1_id', 'group2_id', 'group3_id', 'group4_id']id_fields = df.loc[df[grouping_fields].notnull().any(axis=1), grouping_fields]# build a set of all similarly-grouped items# and use the 'first seen' as the grouping key for thatFIRST_SEEN_TO_ALL = defaultdict(set)KEY_TO_FIRST_SEEN = {}for row in id_fields.to_dict('records'):    # why doesn't nan fall out in a boolean check?    keys = [id for id in row.values() if id and (str(id) != 'nan')]    row_id = keys[0]    for key in keys:        if (row_id != key) or (key not in KEY_TO_FIRST_SEEN):            KEY_TO_FIRST_SEEN[key] = row_id            first_seen_key = row_id        else:            first_seen_key = KEY_TO_FIRST_SEEN[key]        FIRST_SEEN_TO_ALL[first_seen_key].add(key)def fetch_group_id(row):    keys = filter(None, row.to_dict().values())    for key in keys:        first_seen_key = KEY_TO_FIRST_SEEN.get(key)        if first_seen_key:             return first_seen_keydf['group_super'] = df[grouping_fields].apply(fetch_group_id, axis=1)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python