This is a drawback of the way that ddply always works with data
frames. It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.
I have submitted a grant to google to fund "plyr2", which will take a
somewhat different approach and push off all computation to other,
faster, data stores like data.table and SQL databases.
Hadley
On Thu, Aug 4, 2011 at 4:03 AM, Paul Hiemstra <p.h.hi...@gmail.com> wrote:
> Dear list,
>
> The piece of code at the end of this mail constructs a ggplot2 plot
> that shows that for factors with a lot of levels (>200) the
> performance of ddply is a lot worse than that of ave. Especially when
> going to more than a 1000 factor levels, the difference becomes very
> large. My question is how this is caused, and if ddply could be
> improved based on how ave works.
>
> regards,
> Paul
>
> ps My sessionInfo() is listed at the end of this mail
>
> Paul Hiemstra, Ph.D.
> Global Climate Division
> Royal Netherlands Meteorological Institute (KNMI)
> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
> P.O. Box 201 | 3730 AE | De Bilt
> tel: +31 30 2206 494
>
> http://intamap.geo.uu.nl/~paul
> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
>
>
> library(ggplot2)
> theme_set(theme_bw())
> datsize = c(10e4, 10e5)
> noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
> comb = expand.grid(datsize = datsize, noClasses = noClasses)
> res = ddply(comb, .(datsize, noClasses), function(x) {
> expdata = data.frame(value = runif(x$datsize),
> cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
>
> t1 = system.time(res1 <- with(expdata, ave(value, cat)))
> t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
> return(data.frame(tave = t1[3], tddply = t2[3]))
> }, .progress = 'text')
>
> ggplot(aes(x = noClasses, y = log(value), color = variable), data =
> melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
> + geom_line()
>
>> res
> datsize noClasses tave tddply
> 1 1e+05 10 0.088 0.028
> 2 1e+05 50 0.098 0.040
> 3 1e+05 100 0.101 0.055
> 4 1e+05 200 0.105 0.091
> 5 1e+05 500 0.109 0.189
> 6 1e+05 1000 0.117 0.352
> 7 1e+05 2500 0.141 0.843
> 8 1e+05 10000 0.252 3.331
> 9 1e+06 10 0.878 0.260
> 10 1e+06 50 0.986 0.377
> 11 1e+06 100 1.009 0.444
> 12 1e+06 200 1.049 0.667
> 13 1e+06 500 1.056 1.257
> 14 1e+06 1000 1.081 2.276
> 15 1e+06 2500 1.137 5.246
> 16 1e+06 10000 1.286 20.367
>
>> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets
> methods
> [8] base
>
> other attached packages:
> [1] ggplot2_0.8.9 proto_0.3-8 reshape_0.8.4 plyr_1.5.2
> gstat_0.9-76
> [6] sp_0.9-77 fortunes_1.4-1
>
> loaded via a namespace (and not attached):
> [1] digest_0.4.2 lattice_0.19-23 tools_2.13.0
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
>
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/