R:计算单个列中值的连续出现

我希望在每次运行时都创建一个相等值的序号,例如出现次数计数器,一旦当前行中的值与上一行不同,该序号就会重新开始。


请在下面找到输入和预期输出的示例。


dataset <- data.frame(input = c("a","b","b","a","a","c","a","a","a","a","b","c"))

dataset$counter <- c(1,1,2,1,2,1,1,2,3,4,1,1)

dataset


#    input counter

# 1      a       1

# 2      b       1

# 3      b       2

# 4      a       1

# 5      a       2

# 6      c       1

# 7      a       1

# 8      a       2

# 9      a       3

# 10     a       4

# 11     b       1

# 12     c       1

我的问题与这一问题非常相似:值出现的累积顺序。


慕莱坞森
浏览 659回答 3
3回答

扬帆大鱼

您需要使用sequence和rle:> sequence(rle(as.character(dataset$input))$lengths)&nbsp;[1] 1 1 2 1 2 1 1 2 3 4 1 1

不负相思意

而从v1.9.8(新闻项目16),采用rowid与rleiddataset[, counter := rowid(rleid(input))]计时码:set.seed(1L)library(data.table)DT <- data.table(input=sample(letters, 1e6, TRUE))DT1 <- copy(DT)bench::mark(DT[, counter := seq_len(.N), by=rleid(input)],&nbsp;&nbsp; &nbsp; DT1[, counter := rowid(rleid(input))])时间:&nbsp; expression&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; min&nbsp; median `itr/sec` mem_alloc `gc/sec` n_itr&nbsp; n_gc total_time&nbsp; <bch:expr>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <bch:t> <bch:t>&nbsp; &nbsp; &nbsp;<dbl> <bch:byt>&nbsp; &nbsp; <dbl> <int> <dbl>&nbsp; &nbsp;<bch:tm>1 DT[, `:=`(counter, seq_len(.N)), by = rleid(input)] 613.8ms 613.8ms&nbsp; &nbsp; &nbsp; 1.63&nbsp; &nbsp; 18.8MB&nbsp; &nbsp; &nbsp;8.15&nbsp; &nbsp; &nbsp;1&nbsp; &nbsp; &nbsp;5&nbsp; &nbsp; &nbsp; 614ms2 DT1[, `:=`(counter, rowid(rleid(input)))]&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 60.5ms&nbsp; 71.4ms&nbsp; &nbsp; &nbsp;12.7&nbsp; &nbsp; &nbsp;26.4MB&nbsp; &nbsp; 14.5&nbsp; &nbsp; &nbsp; 7&nbsp; &nbsp; &nbsp;8&nbsp; &nbsp; &nbsp; 553ms现在可以在名为的data.table程序包中获得下面编写的函数的高效且更直接的版本rleid。使用它,就是:setDT(dataset)[, counter := seq_len(.N), by=rleid(input)]有关?rleid更多用法和示例,请参见。感谢@Henrik提出的更新此帖子的建议。rle绝对是最方便的方法(+1 @Ananda)。但是,在更大的数据上,可以做得更好(就速度而言)。您可以按以下方式使用duplist和vecseq函数(未导出)data.table:require(data.table)arun <- function(y) {&nbsp; &nbsp; w = data.table:::duplist(list(y))&nbsp; &nbsp; w = c(diff(w), length(y)-tail(w,1L)+1L)&nbsp; &nbsp; data.table:::vecseq(rep(1L, length(w)), w, length(y))}x <- c("a","b","b","a","a","c","a","a","a","a","b","c")arun(x)# [1] 1 1 2 1 2 1 1 2 3 4 1 1大数据基准测试:set.seed(1)x <- sample(letters, 1e6, TRUE)# rle solutionananda <- function(y) {&nbsp; &nbsp; sequence(rle(y)$lengths)}require(microbenchmark)microbenchmark(a1 <- arun(x), a2<-ananda(x), times=100)Unit: milliseconds&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; expr&nbsp; &nbsp; &nbsp; &nbsp;min&nbsp; &nbsp; &nbsp; &nbsp; lq&nbsp; &nbsp; median&nbsp; &nbsp; &nbsp; &nbsp;uq&nbsp; &nbsp; &nbsp; &nbsp;max neval&nbsp; &nbsp;a1 <- arun(x)&nbsp; 123.2827&nbsp; 132.6777&nbsp; 163.3844&nbsp; 185.439&nbsp; 563.5825&nbsp; &nbsp;100&nbsp;a2 <- ananda(x) 1382.1752 1899.2517 2066.4185 2247.233 3764.0040&nbsp; &nbsp;100identical(a1, a2) # [1] TRUE

蝴蝶不菲

包亚军有专门的解决方案来计算需要什么。streak_run是最快的解决方案,接受向量作为输入。library(microbenchmark); library(runner)x&nbsp; &nbsp; &nbsp; <- sample(letters, 1e6, TRUE)ananda <- function(y) sequence(rle(y)$lengths)microbenchmark( a2<-ananda(x), runner <- streak_run(x), times=100)#Unit: milliseconds#&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; expr&nbsp; &nbsp; &nbsp;min&nbsp; &nbsp; &nbsp; lq&nbsp; &nbsp; &nbsp;mean&nbsp; median&nbsp; &nbsp; &nbsp; &nbsp;uq&nbsp; &nbsp; &nbsp; max neval#&nbsp; &nbsp; &nbsp;a2 <- ananda(x) 580.744 718.117 1059.676 944.073 1399.649 1699.293&nbsp; &nbsp; 10#run <- streak_run(x)&nbsp; 37.682&nbsp; 39.568&nbsp; &nbsp;42.277&nbsp; 40.591&nbsp; &nbsp;43.947&nbsp; &nbsp;52.917&nbsp; &nbsp; 10identical(a2, run)#[1] TRUE
打开App,查看更多内容
随时随地看视频慕课网APP