尝试通过对其他列应用条件来过滤数据框中的列

我的 csv 文件中有 3 列: account_id 、 game_variant 、 no_of_games .... 表看起来像这样



account_id    game_variant   no_of_games

130               a             2

145               c             1

130               b             4

130               c             1

142               a             3

140               c             2

145               b             5


所以,我想提取变体 a,b,c,a∩b,b∩c,a∩c,a∩b∩c 中玩的游戏数量


我能够通过与 game_variant 分组并对 no_of_games 进行求和来单独提取在 a、b、c 中玩的游戏,但无法逻辑地放入交叉部分。请帮我解决这个问题


data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

提前致谢


梵蒂冈之花
浏览 75回答 1
1回答

一只甜甜圈

这里的解决方案将根据每个玩家的级别返回交集。这还使用了defaultdict,因为这对于这种情况非常方便。我将解释内联代码from itertools import combinationsimport pandasfrom collections import defaultdictfrom pprint import pprint&nbsp; # only needed for pretty printing of dictionarydf = pandas.read_csv('df.csv', sep='\s+')&nbsp; # assuming the data frame is in a file df.csv# group by account_id to get subframes which only refer to one account.data_agg2 = df.groupby(['account_id'])# a defaultdict is a dictionary, where when no key is present, the function defined# is used to create the element. This eliminates the check, if a key is# already present or to set all combinations in advance.games_played_2 = defaultdict(int)# iterate over all accountsfor el in data_agg2.groups:&nbsp; &nbsp; # extract the sub-dataframe from the gouped function&nbsp; &nbsp; tmp = data_agg2.get_group(el)&nbsp; &nbsp; # print(tmp)&nbsp; # you can uncomment this to see each account&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; # This is in principle the same loop as suggested before. However, as not every&nbsp; &nbsp; # player has played all variants, one only has to create the number of combinations&nbsp; &nbsp; # necessary for that player&nbsp; &nbsp; for i in range(len(tmp.loc[:, 'no_of_games'])):&nbsp; &nbsp; &nbsp; &nbsp; # As now the game_variant is a column and not the index, the first part of zip&nbsp; &nbsp; &nbsp; &nbsp; # is slightly adapted. This loops over all combinations of variants for the&nbsp; &nbsp; &nbsp; &nbsp; # current account.&nbsp; &nbsp; &nbsp; &nbsp; for comb, combsum in zip(combinations(tmp.loc[:, 'game_variant'], i+1), combinations(tmp.loc[:, 'no_of_games'].values, i+1)):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Here, each variant combination gets a unique key. Comb is sorted, as the&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # variants might be not in alphabetic order. The number of games played for&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # each variant for that player are added to the value of all players before.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; games_played_2['_'.join(sorted(comb))] += sum(combsum)pprint (games_played_2)# returns>> defaultdict(<class 'int'>,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {'a': 5,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'a_b': 6,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'a_b_c': 7,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'a_c': 3,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b': 9,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'b_c': 11,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'c': 4})由于您已经提取了它们的变体所玩的游戏数量,因此您可以简单地将它们相加。如果您想自动执行此操作,则可以itertools.combinations在循环中使用它,该循环会迭代所有可能的组合长度:from itertools import combinationsimport pandasimport numpy as npfrom pprint import pprint&nbsp; # only needed for pretty printing of dictionarydf = pandas.read_csv('df.csv', sep='\s+')&nbsp; # assuming the data frame is in a file df.csvdata_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})games_played = {}for i in range(len(data_agg.loc[:, 'no_of_games'])):&nbsp; &nbsp; for comb, combsum in zip(combinations(data_agg.index, i+1), combinations(data_agg.loc[:, 'no_of_games'].values, i+1)):&nbsp; &nbsp; &nbsp; &nbsp; games_played['_'.join(comb)] = sum(combsum)pprint(games_played)返回:>> {'a': array([5], dtype=int64),>>&nbsp; 'a_b': array([14], dtype=int64),>>&nbsp; 'a_b_c': array([18], dtype=int64),>>&nbsp; 'a_c': array([9], dtype=int64),>>&nbsp; 'b': array([9], dtype=int64),>>&nbsp; 'b_c': array([13], dtype=int64),>>&nbsp; 'c': array([4], dtype=int64)}'combinations(sequence, number)'number返回中所有元素组合的迭代器sequence。因此,要获得所有可能的组合,您必须迭代所有numbersfrom1到len(sequence。这就是第一个 for 循环的作用。下一个for循环由两个迭代器组成:一个迭代器覆盖聚合数据的索引 ( combinations(data_agg.index, i+1)),一个迭代器覆盖每个变体中实际玩的游戏数量 ( combinations(data_agg.loc[:, 'no_of_games'].values, i+1))。因此comb应该始终是变体列表,并汇总每个变体所玩游戏数量的列表。这里请注意,要获取所有值,您必须使用.loc[:, 'no_games'],而不是.loc['no_games'],因为后者搜索名为 的索引'no_games',而它是列名。然后,我将字典的键设置为变体列表的组合字符串,并将值设置为玩过的游戏数量的元素之和。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python