如何按组对变量求和

假设我有两列数据。第一个包含诸如“First”,“Second”,“Third”等类别。第二个包含代表我看到“First”的次数的数字。


例如:


Category     Frequency

First        10

First        15

First        5

Second       2

Third        14

Third        20

Second       3

我想按类别对数据进行排序并对频率求和:


Category     Frequency

First        30

Second       5

Third        34

我怎么会在R?


冉冉说
浏览 870回答 5
5回答

慕哥6287543

使用aggregate:aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)&nbsp; Category&nbsp; x1&nbsp; &nbsp; First 302&nbsp; &nbsp;Second&nbsp; 53&nbsp; &nbsp; Third 34在上面的示例中,可以在中指定多个维度list。可以通过cbind以下方式合并相同数据类型的多个聚合度量标准:aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...(嵌入@thelatemail评论),aggregate也有一个公式界面aggregate(Frequency ~ Category, x, sum)或者,如果要聚合多个列,可以使用.表示法(也适用于一列)aggregate(. ~ Category, x, sum)或者tapply:tapply(x$Frequency, x$Category, FUN=sum)&nbsp;First Second&nbsp; Third&nbsp;&nbsp; &nbsp; 30&nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp;34&nbsp;使用此数据:x <- data.frame(Category=factor(c("First", "First", "First", "Second",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; "Third", "Third", "Second")),&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Frequency=c(10,15,5,2,14,20,3))

慕神8447489

最近,您还可以使用dplyr包来实现此目的:library(dplyr)x %>%&nbsp;&nbsp; group_by(Category) %>%&nbsp;&nbsp; summarise(Frequency = sum(Frequency))#Source: local data frame [3 x 2]##&nbsp; Category Frequency#1&nbsp; &nbsp; First&nbsp; &nbsp; &nbsp; &nbsp; 30#2&nbsp; &nbsp;Second&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;5#3&nbsp; &nbsp; Third&nbsp; &nbsp; &nbsp; &nbsp; 34或者,对于多个汇总列(也适用于一列):x %>%&nbsp;&nbsp; group_by(Category) %>%&nbsp;&nbsp; summarise_each(funs(sum))更新dplyr> = 0.5: summarise_each已取代summarise_all,summarise_at和summarise_if家族的功能dplyr。或者,如果您有多个要分组的列,则可以group_by使用逗号分隔所有这些列:mtcars %>%&nbsp;&nbsp; group_by(cyl, gear) %>%&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # multiple group columns&nbsp; summarise(max_hp = max(hp), mean_mpg = mean(mpg))&nbsp; # multiple summary columns有关更多信息,包括%>%运算符,请参阅dplyr简介。

慕工程0101907

rcs提供的答案很简单。但是,如果您正在处理更大的数据集并需要提高性能,那么可以采用更快的替代方案:library(data.table)data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Frequency=c(10,15,5,2,14,20,3))data[, sum(Frequency), by = Category]#&nbsp; &nbsp; Category V1# 1:&nbsp; &nbsp; First 30# 2:&nbsp; &nbsp;Second&nbsp; 5# 3:&nbsp; &nbsp; Third 34system.time(data[, sum(Frequency), by = Category] )# user&nbsp; &nbsp; system&nbsp; &nbsp;elapsed&nbsp;# 0.008&nbsp; &nbsp; &nbsp;0.001&nbsp; &nbsp; &nbsp;0.009&nbsp;让我们使用data.frame和上面的内容将它与同一个东西进行比较:data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Frequency=c(10,15,5,2,14,20,3))system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))# user&nbsp; &nbsp; system&nbsp; &nbsp;elapsed&nbsp;# 0.008&nbsp; &nbsp; &nbsp;0.000&nbsp; &nbsp; &nbsp;0.015&nbsp;如果你想保留列,这就是语法:data[,list(Frequency=sum(Frequency)),by=Category]#&nbsp; &nbsp; Category Frequency# 1:&nbsp; &nbsp; First&nbsp; &nbsp; &nbsp; &nbsp; 30# 2:&nbsp; &nbsp;Second&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;5# 3:&nbsp; &nbsp; Third&nbsp; &nbsp; &nbsp; &nbsp; 34对于较大的数据集,差异将变得更加明显,如下面的代码所示:data = data.table(Category=rep(c("First", "Second", "Third"), 100000),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Frequency=rnorm(100000))system.time( data[,sum(Frequency),by=Category] )# user&nbsp; &nbsp; system&nbsp; &nbsp;elapsed&nbsp;# 0.055&nbsp; &nbsp; &nbsp;0.004&nbsp; &nbsp; &nbsp;0.059&nbsp;data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Frequency=rnorm(100000))system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )# user&nbsp; &nbsp; system&nbsp; &nbsp;elapsed&nbsp;# 0.287&nbsp; &nbsp; &nbsp;0.010&nbsp; &nbsp; &nbsp;0.296&nbsp;对于多个聚合,您可以组合lapply并按.SD如下方式进行组合data[, lapply(.SD, sum), by = Category]#&nbsp; &nbsp; Category Frequency# 1:&nbsp; &nbsp; First&nbsp; &nbsp; &nbsp; &nbsp; 30# 2:&nbsp; &nbsp;Second&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;5# 3:&nbsp; &nbsp; Third&nbsp; &nbsp; &nbsp; &nbsp; 34

明月笑刀无情

几年后,只是为了添加另一个简单的基础R解决方案,由于某种原因,这里不存在 -&nbsp; xtabsxtabs(Frequency ~ Category, df)# Category# First Second&nbsp; Third&nbsp;#&nbsp; &nbsp; 30&nbsp; &nbsp; &nbsp; 5&nbsp; &nbsp; &nbsp;34&nbsp;或者如果你想data.frame回来as.data.frame(xtabs(Frequency ~ Category, df))#&nbsp; &nbsp;Category Freq# 1&nbsp; &nbsp; First&nbsp; &nbsp;30# 2&nbsp; &nbsp;Second&nbsp; &nbsp; 5# 3&nbsp; &nbsp; Third&nbsp; &nbsp;34
打开App,查看更多内容
随时随地看视频慕课网APP