分组函数(tapply,by,aggregate)和* apply系列

分组函数(tapply,by,aggregate)和* apply系列

每当我想在R中做一些“map”py时,我通常会尝试在apply家族中使用一个函数。

但是,我从来没有完全理解它们之间的区别 - 如何{ sapplylapply等}将函数应用于输入/分组输入,输出将是什么样的,甚至输入可以是什么 - 所以我经常只要仔细检查它们,直到我得到我想要的东西。

有人可以解释如何使用哪一个?

我当前(可能不正确/不完整)的理解是......

  1. sapply(vec, f):输入是一个向量。output是一个向量/矩阵,其中element if(vec[i])一个矩阵,如果f有一个多元素输出

  2. lapply(vec, f):相同sapply,但输出是一个列表?

  3. apply(matrix, 1/2, f):输入是一个矩阵。output是一个向量,其中element i是f(矩阵的row / col i)

  4. tapply(vector, grouping, f):output是一个矩阵/数组,其中矩阵/数组中的元素是向量f分组g的值,并g被推送到行/列名称

  5. by(dataframe, grouping, f):让我们g成为一个分组。适用f于组/数据框的每一列。漂亮打印分组和f每列的值。

  6. aggregate(matrix, grouping, f):类似于by,但不是将输出打印得很漂亮,而是将所有内容都粘贴到数据帧中。

侧问题:我还没有学会plyr或重塑-将plyrreshape更换所有这些完全?


噜噜哒
浏览 1962回答 4
4回答

UYOU

首先从Joran的优秀答案开始 - 怀疑任何事情都可以更好。然后,以下助记符可能有助于记住每个之间的区别。虽然有些是显而易见的,但有些可能不那么明显 - 对于这些,你会在Joran的讨论中找到理由。助记符lapply是一个列表应用,它作用于列表或向量并返回一个列表。sapply是一个简单的 lapply(函数默认为在可能的情况下返回向量或矩阵)vapply是经过验证的申请(允许预先指定退货对象类型)rapply是嵌套列表的递归应用,即列表中的列表tapply是标记应用,其中标记标识子集apply 是 通用的:应用一个函数的矩阵的行或列(或者,更一般地,以阵列的尺寸)建立正确的背景如果使用这个apply家庭仍然觉得你有点陌生,那么可能是你错过了一个关键的观点。这两篇文章可以提供帮助。它们提供了激发函数apply族提供的函数式编程技术的必要背景。Lisp的用户将立即认识到这种范式。如果你不熟悉Lisp,一旦你了解了FP,你就会获得一个强大的观点来使用R - 并且apply会更有意义。高级R:功能编程,由Hadley Wickham撰写R的简单函数式编程,作者:Michael Barton

回首忆惘然

因为我意识到这篇文章的(非常优秀的)答案缺乏by和aggregate解释。这是我的贡献。通过by但是,如文档中所述,该函数可以作为“包装器” tapply。by当我们想要计算tapply无法处理的任务时,会产生这种力量。一个例子是这段代码:ct <- tapply(iris$Sepal.Width , iris$Species , summary )cb <- by(iris$Sepal.Width , iris$Species , summary )&nbsp;cbiris$Species: setosa&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.300&nbsp; &nbsp;3.200&nbsp; &nbsp;3.400&nbsp; &nbsp;3.428&nbsp; &nbsp;3.675&nbsp; &nbsp;4.400&nbsp;--------------------------------------------------------------&nbsp;iris$Species: versicolor&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.000&nbsp; &nbsp;2.525&nbsp; &nbsp;2.800&nbsp; &nbsp;2.770&nbsp; &nbsp;3.000&nbsp; &nbsp;3.400&nbsp;--------------------------------------------------------------&nbsp;iris$Species: virginica&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.200&nbsp; &nbsp;2.800&nbsp; &nbsp;3.000&nbsp; &nbsp;2.974&nbsp; &nbsp;3.175&nbsp; &nbsp;3.800&nbsp;ct$setosa&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.300&nbsp; &nbsp;3.200&nbsp; &nbsp;3.400&nbsp; &nbsp;3.428&nbsp; &nbsp;3.675&nbsp; &nbsp;4.400&nbsp;$versicolor&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.000&nbsp; &nbsp;2.525&nbsp; &nbsp;2.800&nbsp; &nbsp;2.770&nbsp; &nbsp;3.000&nbsp; &nbsp;3.400&nbsp;$virginica&nbsp; &nbsp;Min. 1st Qu.&nbsp; Median&nbsp; &nbsp; Mean 3rd Qu.&nbsp; &nbsp; Max.&nbsp;&nbsp; 2.200&nbsp; &nbsp;2.800&nbsp; &nbsp;3.000&nbsp; &nbsp;2.974&nbsp; &nbsp;3.175&nbsp; &nbsp;3.800&nbsp;如果我们打印这两个对象,ct并且cb我们“基本上”具有相同的结果,唯一的区别在于它们的显示方式和不同的class属性,分别by为for cb和arrayfor ct。正如我所说,by当我们不能使用时会产生力量tapply; 以下代码是一个例子:&nbsp;tapply(iris, iris$Species, summary )Error in tapply(iris, iris$Species, summary) :&nbsp;&nbsp; arguments must have same lengthR表示参数必须具有相同的长度,比如“我们想要计算沿着因子summary的所有变量”:但是R不能这样做,因为它不知道如何处理。irisSpecies使用by函数R为data frame类调度一个特定的方法,然后让summary函数工作,即使第一个参数(和类型)的长度不同。bywork <- by(iris, iris$Species, summary )byworkiris$Species: setosa&nbsp; Sepal.Length&nbsp; &nbsp; Sepal.Width&nbsp; &nbsp; &nbsp;Petal.Length&nbsp; &nbsp; Petal.Width&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Species&nbsp;&nbsp;&nbsp;Min.&nbsp; &nbsp;:4.300&nbsp; &nbsp;Min.&nbsp; &nbsp;:2.300&nbsp; &nbsp;Min.&nbsp; &nbsp;:1.000&nbsp; &nbsp;Min.&nbsp; &nbsp;:0.100&nbsp; &nbsp;setosa&nbsp; &nbsp; :50&nbsp;&nbsp;&nbsp;1st Qu.:4.800&nbsp; &nbsp;1st Qu.:3.200&nbsp; &nbsp;1st Qu.:1.400&nbsp; &nbsp;1st Qu.:0.200&nbsp; &nbsp;versicolor: 0&nbsp;&nbsp;&nbsp;Median :5.000&nbsp; &nbsp;Median :3.400&nbsp; &nbsp;Median :1.500&nbsp; &nbsp;Median :0.200&nbsp; &nbsp;virginica : 0&nbsp;&nbsp;&nbsp;Mean&nbsp; &nbsp;:5.006&nbsp; &nbsp;Mean&nbsp; &nbsp;:3.428&nbsp; &nbsp;Mean&nbsp; &nbsp;:1.462&nbsp; &nbsp;Mean&nbsp; &nbsp;:0.246&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;3rd Qu.:5.200&nbsp; &nbsp;3rd Qu.:3.675&nbsp; &nbsp;3rd Qu.:1.575&nbsp; &nbsp;3rd Qu.:0.300&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;Max.&nbsp; &nbsp;:5.800&nbsp; &nbsp;Max.&nbsp; &nbsp;:4.400&nbsp; &nbsp;Max.&nbsp; &nbsp;:1.900&nbsp; &nbsp;Max.&nbsp; &nbsp;:0.600&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;--------------------------------------------------------------&nbsp;iris$Species: versicolor&nbsp; Sepal.Length&nbsp; &nbsp; Sepal.Width&nbsp; &nbsp; &nbsp;Petal.Length&nbsp; &nbsp;Petal.Width&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Species&nbsp;&nbsp;&nbsp;Min.&nbsp; &nbsp;:4.900&nbsp; &nbsp;Min.&nbsp; &nbsp;:2.000&nbsp; &nbsp;Min.&nbsp; &nbsp;:3.00&nbsp; &nbsp;Min.&nbsp; &nbsp;:1.000&nbsp; &nbsp;setosa&nbsp; &nbsp; : 0&nbsp;&nbsp;&nbsp;1st Qu.:5.600&nbsp; &nbsp;1st Qu.:2.525&nbsp; &nbsp;1st Qu.:4.00&nbsp; &nbsp;1st Qu.:1.200&nbsp; &nbsp;versicolor:50&nbsp;&nbsp;&nbsp;Median :5.900&nbsp; &nbsp;Median :2.800&nbsp; &nbsp;Median :4.35&nbsp; &nbsp;Median :1.300&nbsp; &nbsp;virginica : 0&nbsp;&nbsp;&nbsp;Mean&nbsp; &nbsp;:5.936&nbsp; &nbsp;Mean&nbsp; &nbsp;:2.770&nbsp; &nbsp;Mean&nbsp; &nbsp;:4.26&nbsp; &nbsp;Mean&nbsp; &nbsp;:1.326&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;3rd Qu.:6.300&nbsp; &nbsp;3rd Qu.:3.000&nbsp; &nbsp;3rd Qu.:4.60&nbsp; &nbsp;3rd Qu.:1.500&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;Max.&nbsp; &nbsp;:7.000&nbsp; &nbsp;Max.&nbsp; &nbsp;:3.400&nbsp; &nbsp;Max.&nbsp; &nbsp;:5.10&nbsp; &nbsp;Max.&nbsp; &nbsp;:1.800&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;--------------------------------------------------------------&nbsp;iris$Species: virginica&nbsp; Sepal.Length&nbsp; &nbsp; Sepal.Width&nbsp; &nbsp; &nbsp;Petal.Length&nbsp; &nbsp; Petal.Width&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Species&nbsp;&nbsp;&nbsp;Min.&nbsp; &nbsp;:4.900&nbsp; &nbsp;Min.&nbsp; &nbsp;:2.200&nbsp; &nbsp;Min.&nbsp; &nbsp;:4.500&nbsp; &nbsp;Min.&nbsp; &nbsp;:1.400&nbsp; &nbsp;setosa&nbsp; &nbsp; : 0&nbsp;&nbsp;&nbsp;1st Qu.:6.225&nbsp; &nbsp;1st Qu.:2.800&nbsp; &nbsp;1st Qu.:5.100&nbsp; &nbsp;1st Qu.:1.800&nbsp; &nbsp;versicolor: 0&nbsp;&nbsp;&nbsp;Median :6.500&nbsp; &nbsp;Median :3.000&nbsp; &nbsp;Median :5.550&nbsp; &nbsp;Median :2.000&nbsp; &nbsp;virginica :50&nbsp;&nbsp;&nbsp;Mean&nbsp; &nbsp;:6.588&nbsp; &nbsp;Mean&nbsp; &nbsp;:2.974&nbsp; &nbsp;Mean&nbsp; &nbsp;:5.552&nbsp; &nbsp;Mean&nbsp; &nbsp;:2.026&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;3rd Qu.:6.900&nbsp; &nbsp;3rd Qu.:3.175&nbsp; &nbsp;3rd Qu.:5.875&nbsp; &nbsp;3rd Qu.:2.300&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;&nbsp;Max.&nbsp; &nbsp;:7.900&nbsp; &nbsp;Max.&nbsp; &nbsp;:3.800&nbsp; &nbsp;Max.&nbsp; &nbsp;:6.900&nbsp; &nbsp;Max.&nbsp; &nbsp;:2.500&nbsp; &nbsp; &nbsp;它确实有效,结果非常令人惊讶。这是一个类的对象by,沿着Species(例如,对于每个类)计算summary每个变量。请注意,如果第一个参数是a data frame,则dispatched函数必须具有该类对象的方法。例如,我们将此代码与mean函数一起使用,我们将拥有完全没有意义的代码:&nbsp;by(iris, iris$Species, mean)iris$Species: setosa[1] NA-------------------------------------------&nbsp;iris$Species: versicolor[1] NA-------------------------------------------&nbsp;iris$Species: virginica[1] NAWarning messages:1: In mean.default(data[x, , drop = FALSE], ...) :&nbsp; argument is not numeric or logical: returning NA2: In mean.default(data[x, , drop = FALSE], ...) :&nbsp; argument is not numeric or logical: returning NA3: In mean.default(data[x, , drop = FALSE], ...) :&nbsp; argument is not numeric or logical: returning NA骨料aggregatetapply如果我们以这种方式使用它,可以被视为另一种不同的使用方式。at <- tapply(iris$Sepal.Length , iris$Species , mean)ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)&nbsp;at&nbsp; &nbsp; setosa versicolor&nbsp; virginica&nbsp;&nbsp; &nbsp; &nbsp;5.006&nbsp; &nbsp; &nbsp; 5.936&nbsp; &nbsp; &nbsp; 6.588&nbsp;&nbsp;ag&nbsp; &nbsp; &nbsp;Group.1&nbsp; &nbsp; &nbsp;x1&nbsp; &nbsp; &nbsp;setosa 5.0062 versicolor 5.9363&nbsp; virginica 6.588两个直接的区别是第二个参数aggregate 必须是一个列表,而tapply can(非强制性)是一个列表,输出aggregate是一个数据帧,而一个tapply是array。它的强大之aggregate处在于它可以使用subset参数轻松处理数据的子集,并且它还具有ts对象的方法formula。在某些情况下aggregate,这些元素更容易使用tapply。以下是一些示例(可在文档中找到):ag <- aggregate(len ~ ., data = ToothGrowth, mean)&nbsp;ag&nbsp; supp dose&nbsp; &nbsp;len1&nbsp; &nbsp;OJ&nbsp; 0.5 13.232&nbsp; &nbsp;VC&nbsp; 0.5&nbsp; 7.983&nbsp; &nbsp;OJ&nbsp; 1.0 22.704&nbsp; &nbsp;VC&nbsp; 1.0 16.775&nbsp; &nbsp;OJ&nbsp; 2.0 26.066&nbsp; &nbsp;VC&nbsp; 2.0 26.14我们可以实现相同,tapply但语法稍微困难,输出(在某些情况下)可读性较差:att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)&nbsp;att&nbsp; &nbsp; &nbsp; &nbsp;OJ&nbsp; &nbsp; VC0.5 13.23&nbsp; 7.981&nbsp; &nbsp;22.70 16.772&nbsp; &nbsp;26.06 26.14还有一些时候我们不能使用by或者tapply我们必须使用aggregate。&nbsp;ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)&nbsp;ag1&nbsp; Month&nbsp; &nbsp; Ozone&nbsp; &nbsp; &nbsp;Temp1&nbsp; &nbsp; &nbsp;5 23.61538 66.730772&nbsp; &nbsp; &nbsp;6 29.44444 78.222223&nbsp; &nbsp; &nbsp;7 59.11538 83.884624&nbsp; &nbsp; &nbsp;8 59.96154 83.961545&nbsp; &nbsp; &nbsp;9 31.44828 76.89655我们无法tapply在一次调用中获得先前的结果,但我们必须计算Month每个元素的平均值然后将它们组合起来(还要注意我们必须调用它na.rm = TRUE,因为函数的formula方法aggregate默认情况下是这样的na.action = na.omit):ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)&nbsp;cbind(ta1, ta2)&nbsp; &nbsp; &nbsp; &nbsp;ta1&nbsp; &nbsp; &nbsp; ta25 23.61538 65.548396 29.44444 79.100007 59.11538 83.903238 59.96154 83.967749 31.44828 76.90000虽然by我们实际上无法实现,但实际上以下函数调用会返回错误(但很可能与提供的函数有关mean):by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)其他时候结果是相同的,差异只是在类中(然后它是如何显示/打印的,而不仅仅是 - 例如,如何将其子集化)对象:byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)以前的代码实现了相同的目标和结果,在某些方面,使用的工具只是个人品味和需求的问题; 前两个对象在子集方面有非常不同的需求。
打开App,查看更多内容
随时随地看视频慕课网APP