# Speed of ddply in comparison to ave for factors with a lot of levels

313 views

### Paul Hiemstra

Aug 4, 2011, 4:03:57 AM8/4/11
to manipulatr
Dear list,

The piece of code at the end of this mail constructs a ggplot2 plot
that shows that for factors with a lot of levels (>200) the
performance of ddply is a lot worse than that of ave. Especially when
going to more than a 1000 factor levels, the difference becomes very
large. My question is how this is caused, and if ddply could be
improved based on how ave works.

regards,
Paul

ps My sessionInfo() is listed at the end of this mail

Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul

library(ggplot2)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
expdata = data.frame(value = runif(x\$datsize),
cat = round(runif(x\$datsize, min = 0, max = x\$noClasses)))

t1 = system.time(res1 <- with(expdata, ave(value, cat)))
t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
return(data.frame(tave = t1[3], tddply = t2[3]))
}, .progress = 'text')

ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

> res
datsize noClasses tave tddply
1 1e+05 10 0.088 0.028
2 1e+05 50 0.098 0.040
3 1e+05 100 0.101 0.055
4 1e+05 200 0.105 0.091
5 1e+05 500 0.109 0.189
6 1e+05 1000 0.117 0.352
7 1e+05 2500 0.141 0.843
8 1e+05 10000 0.252 3.331
9 1e+06 10 0.878 0.260
10 1e+06 50 0.986 0.377
11 1e+06 100 1.009 0.444
12 1e+06 200 1.049 0.667
13 1e+06 500 1.056 1.257
14 1e+06 1000 1.081 2.276
15 1e+06 2500 1.137 5.246
16 1e+06 10000 1.286 20.367

> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid stats graphics grDevices utils datasets
methods
[8] base

other attached packages:
[1] ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4 plyr_1.5.2
gstat_0.9-76
[6] sp_0.9-77 fortunes_1.4-1

loaded via a namespace (and not attached):
[1] digest_0.4.2 lattice_0.19-23 tools_2.13.0

Aug 4, 2011, 8:22:41 AM8/4/11
to Paul Hiemstra, manipulatr
This is a drawback of the way that ddply always works with data
frames. It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.

I have submitted a grant to google to fund "plyr2", which will take a
somewhat different approach and push off all computation to other,
faster, data stores like data.table and SQL databases.

> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University

### Paul Hiemstra

Aug 4, 2011, 8:44:00 AM8/4/11
to manipulatr
I looked into data.table just now and added it to the list in my code
example:

library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
expdata = data.frame(value = runif(x\$datsize),
cat = round(runif(x\$datsize, min = 0, max = x\$noClasses)))

t1 = system.time(res1 <- with(expdata, ave(value, cat)))
t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
return(data.frame(tave = t1[3], tddply = t2[3], tdata.table =
t3[3]))
}, .progress = 'text')
res

ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

It is easily the fastest option:

> res
datsize noClasses tave tddply tdata.table
1 1e+05 10 0.091 0.035 0.011
2 1e+05 50 0.102 0.050 0.012
3 1e+05 100 0.105 0.065 0.012
4 1e+05 200 0.109 0.101 0.010
5 1e+05 500 0.113 0.248 0.012
6 1e+05 1000 0.123 0.438 0.012
7 1e+05 2500 0.146 0.956 0.013
8 1e+05 10000 0.251 3.525 0.020
9 1e+06 10 0.905 0.393 0.101
10 1e+06 50 1.003 0.473 0.100
11 1e+06 100 1.036 0.579 0.105
12 1e+06 200 1.052 0.826 0.106
13 1e+06 500 1.079 1.508 0.109
14 1e+06 1000 1.092 2.652 0.111
15 1e+06 2500 1.167 6.051 0.117
16 1e+06 10000 1.338 23.224 0.132

Having plyr using these kinds of packages would be awesome!

Paul
> > For more options, visit this group athttp://groups.google.com/group/manipulatr?hl=en.

### Stavros Macrakis

Aug 4, 2011, 2:52:20 PM8/4/11
On Thu, Aug 4, 2011 at 08:22, Hadley Wickham wrote:
...I have submitted a grant to google to fund "plyr2", which will take a

somewhat different approach and push off all computation to other,
faster, data stores like data.table and SQL databases.

Is there anything we on the mailing list can do to support your grant application?

-s

Aug 4, 2011, 3:03:03 PM8/4/11
to Stavros Macrakis, manipulatr
>> ...I have submitted a grant to google to fund "plyr2", which will take a
>> somewhat different approach and push off all computation to other,
>> faster, data stores like data.table and SQL databases.
>
> Is there anything we on the mailing list can do to support your grant
> application?

Unless you work at google, I don't think so at the moment.