分组函数（tapply，by，aggregate）和* apply系列

因为我意识到这篇文章的（非常优秀的）答案缺乏by和aggregate解释。这是我的贡献。通过by但是，如文档中所述，该函数可以作为“包装器” tapply。by当我们想要计算tapply无法处理的任务时，会产生这种力量。一个例子是这段代码：ct <- tapply(iris$Sepal.Width , iris$Species , summary )cb <- by(iris$Sepal.Width , iris$Species , summary ) cbiris$Species: setosa   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.300   3.200   3.400   3.428   3.675   4.400 -------------------------------------------------------------- iris$Species: versicolor   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.000   2.525   2.800   2.770   3.000   3.400 -------------------------------------------------------------- iris$Species: virginica   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.200   2.800   3.000   2.974   3.175   3.800 ct$setosa   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.300   3.200   3.400   3.428   3.675   4.400 $versicolor   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.000   2.525   2.800   2.770   3.000   3.400 $virginica   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.   2.200   2.800   3.000   2.974   3.175   3.800 如果我们打印这两个对象，ct并且cb我们“基本上”具有相同的结果，唯一的区别在于它们的显示方式和不同的class属性，分别by为for cb和arrayfor ct。正如我所说，by当我们不能使用时会产生力量tapply; 以下代码是一个例子： tapply(iris, iris$Species, summary )Error in tapply(iris, iris$Species, summary) :   arguments must have same lengthR表示参数必须具有相同的长度，比如“我们想要计算沿着因子summary的所有变量”：但是R不能这样做，因为它不知道如何处理。irisSpecies使用by函数R为data frame类调度一个特定的方法，然后让summary函数工作，即使第一个参数（和类型）的长度不同。bywork <- by(iris, iris$Species, summary )byworkiris$Species: setosa  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species   Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50   1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0   Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0   Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                   3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                   Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  -------------------------------------------------------------- iris$Species: versicolor  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species   Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0   1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50   Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0   Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                   3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                   Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  -------------------------------------------------------------- iris$Species: virginica  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species   Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0   1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0   Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50   Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                   3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                   Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     它确实有效，结果非常令人惊讶。这是一个类的对象by，沿着Species（例如，对于每个类）计算summary每个变量。请注意，如果第一个参数是a data frame，则dispatched函数必须具有该类对象的方法。例如，我们将此代码与mean函数一起使用，我们将拥有完全没有意义的代码： by(iris, iris$Species, mean)iris$Species: setosa[1] NA------------------------------------------- iris$Species: versicolor[1] NA------------------------------------------- iris$Species: virginica[1] NAWarning messages:1: In mean.default(data[x, , drop = FALSE], ...) :  argument is not numeric or logical: returning NA2: In mean.default(data[x, , drop = FALSE], ...) :  argument is not numeric or logical: returning NA3: In mean.default(data[x, , drop = FALSE], ...) :  argument is not numeric or logical: returning NA骨料aggregatetapply如果我们以这种方式使用它，可以被视为另一种不同的使用方式。at <- tapply(iris$Sepal.Length , iris$Species , mean)ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean) at    setosa versicolor  virginica      5.006      5.936      6.588  ag     Group.1     x1     setosa 5.0062 versicolor 5.9363  virginica 6.588两个直接的区别是第二个参数aggregate 必须是一个列表，而tapply can（非强制性）是一个列表，输出aggregate是一个数据帧，而一个tapply是array。它的强大之aggregate处在于它可以使用subset参数轻松处理数据的子集，并且它还具有ts对象的方法formula。在某些情况下aggregate，这些元素更容易使用tapply。以下是一些示例（可在文档中找到）：ag <- aggregate(len ~ ., data = ToothGrowth, mean) ag  supp dose   len1   OJ  0.5 13.232   VC  0.5  7.983   OJ  1.0 22.704   VC  1.0 16.775   OJ  2.0 26.066   VC  2.0 26.14我们可以实现相同，tapply但语法稍微困难，输出（在某些情况下）可读性较差：att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean) att       OJ    VC0.5 13.23  7.981   22.70 16.772   26.06 26.14还有一些时候我们不能使用by或者tapply我们必须使用aggregate。 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean) ag1  Month    Ozone     Temp1     5 23.61538 66.730772     6 29.44444 78.222223     7 59.11538 83.884624     8 59.96154 83.961545     9 31.44828 76.89655我们无法tapply在一次调用中获得先前的结果，但我们必须计算Month每个元素的平均值然后将它们组合起来（还要注意我们必须调用它na.rm = TRUE，因为函数的formula方法aggregate默认情况下是这样的na.action = na.omit）：ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE) cbind(ta1, ta2)       ta1      ta25 23.61538 65.548396 29.44444 79.100007 59.11538 83.903238 59.96154 83.967749 31.44828 76.90000虽然by我们实际上无法实现，但实际上以下函数调用会返回错误（但很可能与提供的函数有关mean）：by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)其他时候结果是相同的，差异只是在类中（然后它是如何显示/打印的，而不仅仅是 - 例如，如何将其子集化）对象：byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)以前的代码实现了相同的目标和结果，在某些方面，使用的工具只是个人品味和需求的问题; 前两个对象在子集方面有非常不同的需求。

分组函数（tapply，by，aggregate）和* apply系列

4回答