plyr speed in comparison to aggregate, doBy, sqldf

738 views

Skip to first unread message

Learning R Blog

unread,

Feb 24, 2010, 5:49:40 AM2/24/10

to manipulatr

Marek Janad posted on R-help mailinglist (https://stat.ethz.ch/
pipermail/r-help/2009-December/221513.html) a couple of test cases
comparing the performance of sqldf, doBy, aggregate. Based on his
scenario sqldf was the fastest.

I was interested to see how plyr (ddply) would perform in comparison,
and was surprised to see that it was considerably slower than any of
the other approaches shown (see below for code).

> rsqldf
user system elapsed
2.03 0.92 2.98
> rdoby
user system elapsed
14.42 0.05 14.50
> raggregate
user system elapsed
75.52 0.61 76.30
> rplyr
user system elapsed
261.44 1.31 264.13

Now I am wondering if there is anything that can/should be done with
the data before passing it to ddply that would speed up the
aggregation process?

--
http://learnr.wordpress.com

library(sqldf)
library(doBy)
library(plyr)

n<-100000
grp1<-sample(1:750, n, replace=T)
grp2<-sample(1:750, n, replace=T)
d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2, n,
replace=T)

# sqldf
rsqldf <- system.time(sqldf("select grp1, grp2, avg(x), avg(y) from d
group by grp1, grp2"))

#doBy
rdoby <- system.time(summaryBy(x+y~grp1+grp2, data=d, FUN=c(mean)))

#aggregate
raggregate <- system.time(aggregate(d, list(d$grp1, d$grp2),
function(x)mean(x)))

#plyr
rplyr <- system.time(ddply(d, .(grp1, grp2), summarise, avx = mean(x),
avy=mean(y)))

rsqldf
rdoby
raggregate
rplyr

hadley wickham

unread,

Feb 24, 2010, 10:30:49 AM2/24/10

to Learning R Blog, manipulatr

In this case, a lot of the overhead is in the data frames that
summarise is creating:

rplyr2 <- system.time(ddply(d, .(grp1, grp2), function(df) c( avx = mean(df$x),
+ avy=mean(df$y))))

Which is about 40% faster on my computer.

Looking at where the time is being spent:

library(profr)
p <- profr(ddply(d, .(grp1, grp2), function(df) c( avx = mean(df$x),
avy=mean(df$y))))
plot(p)

This indicated that first I'd made a silly mistake and was forcing
computation of all the indices up front (now fixed), but it looks like
the majority of the time is being spent in [.data.frame. I think
aggregate and doBy avoid this overhead by only indexing vectors, but I
don't see how to do this in the context of ddply (not without some
kind of compilation step which would require a lot more information
about what the summary function does)

Hadley

> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Reply all

Reply to author

Forward

0 new messages