对组对象应用VS转换

对组对象应用VS转换

考虑以下数据:


     A      B         C         D

0  foo    one  0.162003  0.087469

1  bar    one -1.156319 -1.526272

2  foo    two  0.833892 -1.666304

3  bar  three -2.026673 -0.322057

4  foo    two  0.411452 -0.954371

5  bar    two  0.765878 -0.095968

6  foo    one -0.654890  0.678091

7  foo  three -1.789842 -1.130922

以下命令起作用:


> df.groupby('A').apply(lambda x: (x['C'] - x['D']))

> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但下列任何一项工作都没有:


> df.groupby('A').transform(lambda x: (x['C'] - x['D']))

ValueError: could not broadcast input array from shape (5) into shape (5,3)


> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

 TypeError: cannot concatenate a non-NDFrame object

为什么? 关于文档的示例似乎意味着transform在组中允许进行逐行操作处理:


# Note that the following suggests row-wise operation (x.mean is the column mean)

zscore = lambda x: (x - x.mean()) / x.std()

transformed = ts.groupby(key).transform(zscore)

换句话说,我认为转换本质上是一种特定类型的应用(不聚合)。我哪里错了?


以下是上述原始数据的构造,供参考:


df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',

                          'foo', 'bar', 'foo', 'foo'],

                   'B' : ['one', 'one', 'two', 'three',

                         'two', 'two', 'one', 'three'],

                   'C' : randn(8), 'D' : randn(8)})


叮当猫咪
浏览 247回答 3
3回答

慕的地10843

我同样感到困惑.transform手术与手术.apply我找到了一些关于这个问题的答案。这个答案例如,非常有用。到目前为止我的外卖是.transform将工作(或处理)Series(栏)与世隔绝..这意味着在你最后两个电话里:df.groupby('A').transform(lambda x: (x['C'] - x['D']))df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())你问.transform从两列中获取值,而“it”实际上并不同时“查看”这两个列(可以这么说)。transform将逐一查看dataframe列,并返回由重复的标量组成的序列(或序列组)。len(input_column)时代。所以这个标量,应该被.transform使Series是对输入应用某种约简函数的结果。Series(一次只能在一个系列/列上)。请考虑这个示例(在您的dataframe上):zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.df.groupby('A').transform(zscore)将产生:&nbsp; &nbsp; &nbsp; &nbsp;C&nbsp; &nbsp; &nbsp; D0&nbsp; 0.989&nbsp; 0.1281 -0.478&nbsp; 0.4892&nbsp; 0.889 -0.5893 -0.671 -1.1504&nbsp; 0.034 -0.2855&nbsp; 1.149&nbsp; 0.6626 -1.404 -0.9077 -0.509&nbsp; 1.653这与每次只在一列上使用它完全相同:df.groupby('A')['C'].transform(zscore)屈服:0&nbsp; &nbsp; 0.9891&nbsp; &nbsp;-0.4782&nbsp; &nbsp; 0.8893&nbsp; &nbsp;-0.6714&nbsp; &nbsp; 0.0345&nbsp; &nbsp; 1.1496&nbsp; &nbsp;-1.4047&nbsp; &nbsp;-0.509请注意.apply在最后一个例子中(df.groupby('A')['C'].apply(zscore))将以完全相同的方式工作,但如果您尝试在dataframe上使用它,则会失败:df.groupby('A').apply(zscore)给出错误:ValueError: operands could not be broadcast together with shapes (6,) (2,)所以还有别的地方.transform有用吗?最简单的情况是尝试将约简函数的结果分配回原始数据。df['sum_C'] = df.groupby('A')['C'].transform(sum)df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group屈服:&nbsp; &nbsp; &nbsp;A&nbsp; &nbsp; &nbsp; B&nbsp; &nbsp; &nbsp; C&nbsp; &nbsp; &nbsp; D&nbsp; sum_C1&nbsp; bar&nbsp; &nbsp; one&nbsp; 1.998&nbsp; 0.593&nbsp; 3.9733&nbsp; bar&nbsp; three&nbsp; 1.287 -0.639&nbsp; 3.9735&nbsp; bar&nbsp; &nbsp; two&nbsp; 0.687 -1.027&nbsp; 3.9734&nbsp; foo&nbsp; &nbsp; two&nbsp; 0.205&nbsp; 1.274&nbsp; 4.3732&nbsp; foo&nbsp; &nbsp; two&nbsp; 0.128&nbsp; 0.924&nbsp; 4.3736&nbsp; foo&nbsp; &nbsp; one&nbsp; 2.113 -0.516&nbsp; 4.3737&nbsp; foo&nbsp; three&nbsp; 0.657 -1.179&nbsp; 4.3730&nbsp; foo&nbsp; &nbsp; one&nbsp; 1.270&nbsp; 0.201&nbsp; 4.373用同样的方法.apply会给NaNs在……里面sum_C..因为.apply会退货Series,它不知道如何广播:df.groupby('A')['C'].apply(sum)给予:Abar&nbsp; &nbsp; 3.973foo&nbsp; &nbsp; 4.373在某些情况下.transform用于筛选数据:df[df.groupby(['B'])['D'].transform(sum) < -1]&nbsp; &nbsp; &nbsp;A&nbsp; &nbsp; &nbsp; B&nbsp; &nbsp; &nbsp; C&nbsp; &nbsp; &nbsp; D3&nbsp; bar&nbsp; three&nbsp; 1.287 -0.6397&nbsp; foo&nbsp; three&nbsp; 0.657 -1.179我希望这能增加一点清晰度。

慕标5832272

两大区别apply和transform之间有两个主要的区别。transform和apply群方法apply隐式地将每个组的所有列作为DataFrame到自定义函数,同时transform将每个组的每一列作为系列到自定义函数传递给apply可以返回标量、系列或DataFrame(或numpy数组甚至列表)。传递给transform必须返回与组相同长度的序列(一维序列、数组或列表)。所以,transform一次只做一个系列的作品apply同时处理整个DataFrame。检查自定义函数检查传递给您的自定义函数的输入会有很大帮助。apply或transform.实例让我们创建一些示例数据并检查组,这样您就可以看到我在说什么:df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;'a':[4,5,1,3], 'b':[6,10,3,11]})df让我们创建一个简单的自定义函数,它输出隐式传递对象的类型,然后引发一个错误,以便可以停止执行。def inspect(x):&nbsp; &nbsp; print(type(x))&nbsp; &nbsp; raise现在让我们把这个函数传递给groupbyapply和transform方法来查看传递给它的对象:df.groupby('State').apply(inspect)<class 'pandas.core.frame.DataFrame'><class 'pandas.core.frame.DataFrame'>RuntimeError如您所见,DataFrame被传递到inspect功能。您可能想知道为什么类型DataFrame被打印了两次。第一组熊猫跑两次。它这样做是为了确定是否有一种快速的方法来完成计算。这是一个你不应该担心的小细节。现在,让我们做同样的事情transformdf.groupby('State').transform(inspect)<class 'pandas.core.series.Series'><class 'pandas.core.series.Series'>RuntimeError它被传递了一个系列-一个完全不同的熊猫对象。所以,transform一次只能使用一个系列。它不可能同时对两列采取行动。所以,如果我们尝试减去列a从…b在我们的自定义函数中,我们将得到一个错误transform..见下文:def subtract_two(x):&nbsp; &nbsp; return x['a'] - x['b']df.groupby('State').transform(subtract_two)KeyError: ('a', 'occurred at index a')当熊猫试图找到系列索引时,我们得到了一个KeyErrora并不存在。您可以用apply因为它拥有整个DataFrame:df.groupby('State').apply(subtract_two)State&nbsp; &nbsp; &nbsp;Florida&nbsp; 2&nbsp; &nbsp;-2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;3&nbsp; &nbsp;-8Texas&nbsp; &nbsp; 0&nbsp; &nbsp;-2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;1&nbsp; &nbsp;-5dtype: int64输出是一个Series,由于保留了原始索引,所以有点混乱,但是我们可以访问所有列。显示传递的熊猫对象它可以帮助更多地显示整个熊猫对象的自定义功能,这样你就可以准确地看到你在操作什么。你可以用print语句,我喜欢使用display函数的IPython.display模块,以便在jupyter笔记本中以HTML格式很好地输出DataFrame:from IPython.display import displaydef subtract_two(x):&nbsp; &nbsp; display(x)&nbsp; &nbsp; return x['a'] - x['b']截图:enter image description here转换必须返回与组大小相同的一维序列。另一个区别是transform必须返回与组大小相同的单维度序列。在这个特定的实例中,每个组有两行,因此transform必须返回两行的序列。如果没有,则会引发错误:def return_three(x):&nbsp; &nbsp; return np.array([1, 2, 3])df.groupby('State').transform(return_three)ValueError: transform must return a scalar value for each group错误消息并不真正描述问题。必须返回与组长度相同的序列。所以,像这样的函数会起作用:def rand_group_len(x):&nbsp; &nbsp; return np.random.rand(len(x))df.groupby('State').transform(rand_group_len)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; a&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;b0&nbsp; 0.962070&nbsp; 0.1514401&nbsp; 0.440956&nbsp; 0.7821762&nbsp; 0.642218&nbsp; 0.4832573&nbsp; 0.056047&nbsp; 0.238208返回单个标量对象也适用于transform如果您只从自定义函数返回一个标量,那么transform将用于组中的每一行:def group_sum(x):&nbsp; &nbsp; return x.sum()df.groupby('State').transform(group_sum)&nbsp; &nbsp;a&nbsp; &nbsp;b0&nbsp; 9&nbsp; 161&nbsp; 9&nbsp; 162&nbsp; 4&nbsp; 143&nbsp; 4&nbsp; 14

暮色呼如

我将用一个非常简单的片段来说明两者之间的区别:test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})grouping = test.groupby('id')['price']DataFrame如下所示:&nbsp; &nbsp; id&nbsp; price&nbsp; &nbsp;0&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;1&nbsp; &nbsp;2&nbsp; &nbsp;2&nbsp; &nbsp;2&nbsp; &nbsp;3&nbsp; &nbsp;3&nbsp; &nbsp;3&nbsp; &nbsp;1&nbsp; &nbsp;2&nbsp; &nbsp;4&nbsp; &nbsp;2&nbsp; &nbsp;3&nbsp; &nbsp;5&nbsp; &nbsp;3&nbsp; &nbsp;1&nbsp; &nbsp;6&nbsp; &nbsp;1&nbsp; &nbsp;3&nbsp; &nbsp;7&nbsp; &nbsp;2&nbsp; &nbsp;1&nbsp; &nbsp;8&nbsp; &nbsp;3&nbsp; &nbsp;2&nbsp; &nbsp;本表中有3个客户ID,每个客户进行了三次交易,每次支付1,2,3美元。现在,我想找到每个客户的最低付款。有两种方法:使用apply:Grouping.min()回报如下:id1&nbsp; &nbsp; 12&nbsp; &nbsp; 13&nbsp; &nbsp; 1Name: price, dtype: int64pandas.core.series.Series # return typeInt64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index# lenght is 3使用transform:分组变换(MIN)回报如下:0&nbsp; &nbsp; 11&nbsp; &nbsp; 12&nbsp; &nbsp; 13&nbsp; &nbsp; 14&nbsp; &nbsp; 15&nbsp; &nbsp; 16&nbsp; &nbsp; 17&nbsp; &nbsp; 18&nbsp; &nbsp; 1Name: price, dtype: int64pandas.core.series.Series # return typeRangeIndex(start=0, stop=9, step=1) # The returned Series' index# length is 9&nbsp; &nbsp;&nbsp;两个方法都返回一个Series对象,但是length第一个是3,而length第二个是9。如果你想回答What is the minimum price paid by each customer,然后apply方法是比较适合选择的方法。如果你想回答What is the difference between the amount paid for each transaction vs the minimum payment,然后你想用transform,因为:test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum paymenttest.price - test.minimum # returns the difference for each rowApply在这里工作并不仅仅是因为它返回一个3大小的系列,但是原始df的长度是9,您不能轻松地将它集成回原始df。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

JavaScript