为什么并行包慢于使用apply?

为什么并行包慢于使用apply?

我正在尝试确定何时使用该parallel软件包来加快运行某些分析所需的时间。我需要做的一件事是创建矩阵,比较具有不同行数的两个数据帧中的变量。我问了一个关于在StackOverflow上做有效方法的问题,并在我的博客上写了关于测试的文章。由于我对最佳方法感到满意,因此我希望通过并行运行来加快这一过程。以下结果基于带有8GB RAM的2ghz i7 Mac。令我感到惊讶的是,特别是功能parallelparSapply比使用该apply功能更糟糕。复制它的代码如下。请注意,我目前只使用我创建的两个列中的一个,但最终想要同时使用它们。

执行时间http://jason.bryer.org/images/ParalleVsApplyTiming.png

require(parallel)require(ggplot2)require(reshape2)set.seed(2112)results <- list()sizes <- seq(1000, 30000, by=5000)pb <- txtProgressBar(min=0, max=length(sizes), style=3)for(cnt in 1:length(sizes)) {
    i <- sizes[cnt]
    df1 <- data.frame(row.names=1:i, 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE), 
                      var2=sample(1:10, i, replace=TRUE) )
    df2 <- data.frame(row.names=(i + 1):(i + i), 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE),
                      var2=sample(1:10, i, replace=TRUE))
    tm1 <- system.time({
        df6 <- sapply(df2$var1, FUN=function(x) { x == df1$var1 })
        dimnames(df6) <- list(row.names(df1), row.names(df2))
    })
    rm(df6)
    tm2 <- system.time({
        cl <- makeCluster(getOption('cl.cores', detectCores()))
        tm3 <- system.time({
            df7 <- parSapply(cl, df1$var1, FUN=function(x, df2) { x == df2$var1 }, df2=df2)
            dimnames(df7) <- list(row.names(df1), row.names(df2))
        })
        stopCluster(cl)
    })
    rm(df7)
    results[[cnt]] <- c(apply=tm1, parallel.total=tm2, parallel.exec=tm3)
    setTxtProgressBar(pb, cnt)}toplot <- as.data.frame(results)[,c('apply.user.self','parallel.total.user.self',
                          'parallel.exec.user.self')]toplot$size <- sizes
toplot <- melt(toplot, id='size')ggplot(toplot, aes(x=size, y=value, colour=variable)) + geom_line() + 
    xlab('Vector Size') + ylab('Time (seconds)')


紫衣仙女
浏览 718回答 3
3回答

翻翻过去那场雪

并行运行工作会产生开销。只有当您在工作节点上触发的作业花费大量时间时,并行化才能提高整体性能。当单个作业只需几毫秒时,不断解雇作业的开销将降低整体性能。诀窍是在节点上划分工作,使作业足够长,比如说至少几秒钟。我使用它可以同时运行六个Fortran模型,但是这些单独的模型运行需要几个小时,几乎可以消除开销的影响。请注意,我没有运行您的示例,但是当parallization比顺序运行花费更长时间时,我上面描述的情况通常是个问题。

当年话下

这些差异可归因于1)通信开销(特别是如果跨节点运行)和2)性能开销(例如,与启动并行化相比,您的工作不是那么密集)。通常,如果您正在并行化的任务并不那么耗时,那么您将发现并行化并没有太大的影响(这在大型数据集中非常明显。即使这可能不能直接回答您的基准测试,但我希望这应该是相当简单的并且可以与之相关。作为一个例子,在这里,我建立一个data.frame与1e6具有行1e4唯一的列group项,并在列中的某些值val。然后我plyr在parallel使用doMC和不使用并行化时运行。df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)),&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;val = sample(1:10, 1e6, replace = T))> head(df)&nbsp; group val# 1&nbsp; 8498&nbsp; &nbsp;8# 2&nbsp; 5253&nbsp; &nbsp;6# 3&nbsp; 1495&nbsp; &nbsp;1# 4&nbsp; 7362&nbsp; &nbsp;9# 5&nbsp; 2344&nbsp; &nbsp;6# 6&nbsp; 5602&nbsp; &nbsp;9> dim(df)# [1] 1000000&nbsp; &nbsp; &nbsp; &nbsp;2require(plyr)require(doMC)registerDoMC(20) # 20 processors# parallelisation using doMC + plyr&nbsp;P.PLYR <- function() {&nbsp; &nbsp; o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE)}# no parallelisationPLYR <- function() {&nbsp; &nbsp; o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE)}require(rbenchmark)benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")&nbsp; &nbsp; &nbsp; test replications elapsed relative user.self sys.self user.child sys.child2&nbsp; &nbsp;PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp;8.925&nbsp; &nbsp; 1.000&nbsp; &nbsp; &nbsp;8.865&nbsp; &nbsp; 0.068&nbsp; &nbsp; &nbsp; 0.000&nbsp; &nbsp; &nbsp;0.0001 P.PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 30.637&nbsp; &nbsp; 3.433&nbsp; &nbsp; 15.841&nbsp; &nbsp;13.945&nbsp; &nbsp; &nbsp; 8.944&nbsp; &nbsp; 38.858正如您所看到的,并行版本的plyr运行速度慢了3.5倍现在,让我使用相同的data.frame,但不是计算sum,让我构建一个更苛刻的功能,比如说median(.) * median(rnorm(1e4)((无意义,是的):你会看到潮汐开始转变:# parallelisation using doMC + plyr&nbsp;P.PLYR <- function() {&nbsp; &nbsp; o1 <- ddply(df, .(group), function(x)&nbsp;&nbsp; &nbsp; &nbsp; median(x$val) * median(rnorm(1e4)), .parallel = TRUE)}# no parallelisationPLYR <- function() {&nbsp; &nbsp; o2 <- ddply(df, .(group), function(x)&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;median(x$val) * median(rnorm(1e4)), .parallel = FALSE)}> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")&nbsp; &nbsp; &nbsp; test replications elapsed relative user.self sys.self user.child sys.child1 P.PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 41.911&nbsp; &nbsp; 1.000&nbsp; &nbsp; 15.265&nbsp; &nbsp;15.369&nbsp; &nbsp; 141.585&nbsp; &nbsp; 34.2542&nbsp; &nbsp;PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 73.417&nbsp; &nbsp; 1.752&nbsp; &nbsp; 73.372&nbsp; &nbsp; 0.052&nbsp; &nbsp; &nbsp; 0.000&nbsp; &nbsp; &nbsp;0.000这里,并行版本比非并行版本1.752 times 更快。编辑:关注 @Paul的评论,我刚刚实施了一个小延迟Sys.sleep()。当然结果很明显。但仅仅为了完整起见,这是20 * 2 data.frame的结果:df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20))# parallelisation using doMC + plyr&nbsp;P.PLYR <- function() {&nbsp; &nbsp; o1 <- ddply(df, .(group), function(x) {&nbsp; &nbsp; Sys.sleep(2)&nbsp; &nbsp; median(x$val)&nbsp; &nbsp; }, .parallel = TRUE)}# no parallelisationPLYR <- function() {&nbsp; &nbsp; o2 <- ddply(df, .(group), function(x) {&nbsp; &nbsp; &nbsp; &nbsp; Sys.sleep(2)&nbsp; &nbsp; &nbsp; &nbsp; median(x$val)&nbsp; &nbsp; }, .parallel = FALSE)}> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")#&nbsp; &nbsp; &nbsp; &nbsp;test replications elapsed relative user.self sys.self user.child sys.child# 1 P.PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; &nbsp;4.116&nbsp; &nbsp; 1.000&nbsp; &nbsp; &nbsp;0.056&nbsp; &nbsp; 0.056&nbsp; &nbsp; &nbsp; 0.024&nbsp; &nbsp; &nbsp; 0.04# 2&nbsp; &nbsp;PLYR()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 2&nbsp; 20.050&nbsp; &nbsp; 4.871&nbsp; &nbsp; &nbsp;0.028&nbsp; &nbsp; 0.000&nbsp; &nbsp; &nbsp; 0.000&nbsp; &nbsp; &nbsp; 0.00这里的差异并不令人惊讶。
打开App,查看更多内容
随时随地看视频慕课网APP