Dear list,
The piece of code at the end of this mail constructs a ggplot2 plot
that shows that for factors with a lot of levels (>200) the
performance of ddply is a lot worse than that of ave. Especially when
going to more than a 1000 factor levels, the difference becomes very
large. My question is how this is caused, and if ddply could be
improved based on how ave works.
regards,
Paul
ps My sessionInfo() is listed at the end of this mail
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel:
+31 30 2206 494
http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
library(ggplot2)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
expdata = data.frame(value = runif(x$datsize),
cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
t1 = system.time(res1 <- with(expdata, ave(value, cat)))
t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
return(data.frame(tave = t1[3], tddply = t2[3]))
}, .progress = 'text')
ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()
> res
datsize noClasses tave tddply
1 1e+05 10 0.088 0.028
2 1e+05 50 0.098 0.040
3 1e+05 100 0.101 0.055
4 1e+05 200 0.105 0.091
5 1e+05 500 0.109 0.189
6 1e+05 1000 0.117 0.352
7 1e+05 2500 0.141 0.843
8 1e+05 10000 0.252 3.331
9 1e+06 10 0.878 0.260
10 1e+06 50 0.986 0.377
11 1e+06 100 1.009 0.444
12 1e+06 200 1.049 0.667
13 1e+06 500 1.056 1.257
14 1e+06 1000 1.081 2.276
15 1e+06 2500 1.137 5.246
16 1e+06 10000 1.286 20.367
> sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets
methods
[8] base
other attached packages:
[1] ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4 plyr_1.5.2
gstat_0.9-76
[6] sp_0.9-77 fortunes_1.4-1
loaded via a namespace (and not attached):
[1] digest_0.4.2 lattice_0.19-23 tools_2.13.0