Google Groups

Re: Speed of ddply in comparison to ave for factors with a lot of levels


Hadley Wickham Aug 4, 2011 5:22 AM
Posted in group: manipulatr
This is a drawback of the way that ddply always works with data
frames.  It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.

I have submitted a grant to google to fund "plyr2", which will take a
somewhat different approach and push off all computation to other,
faster, data stores like data.table and SQL databases.

Hadley

On Thu, Aug 4, 2011 at 4:03 AM, Paul Hiemstra <p.h.hi...@gmail.com> wrote:
> Dear list,
>
> The piece of code at the end of this mail constructs a ggplot2 plot
> that shows that for factors with a lot of levels (>200) the
> performance of ddply is a lot worse than that of ave. Especially when
> going to more than a 1000 factor levels, the difference becomes very
> large. My question is how this is caused, and if ddply could be
> improved based on how ave works.
>
> regards,
> Paul
>
> ps My sessionInfo() is listed at the end of this mail
>
> Paul Hiemstra, Ph.D.
> Global Climate Division
> Royal Netherlands Meteorological Institute (KNMI)
> Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
> P.O. Box 201 | 3730 AE | De Bilt
> tel: +31 30 2206 494
>
> http://intamap.geo.uu.nl/~paul
> http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770
>
>
> library(ggplot2)
> theme_set(theme_bw())
> datsize = c(10e4, 10e5)
> noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
> comb = expand.grid(datsize = datsize, noClasses = noClasses)
> res = ddply(comb, .(datsize, noClasses), function(x) {
>  expdata = data.frame(value = runif(x$datsize),
>                      cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
>
>  t1 = system.time(res1 <- with(expdata, ave(value, cat)))
>  t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
>  return(data.frame(tave = t1[3], tddply = t2[3]))
> }, .progress = 'text')
>
> ggplot(aes(x = noClasses, y = log(value), color = variable), data =
> melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
> + geom_line()
>
>> res
>   datsize noClasses  tave tddply
> 1    1e+05        10 0.088  0.028
> 2    1e+05        50 0.098  0.040
> 3    1e+05       100 0.101  0.055
> 4    1e+05       200 0.105  0.091
> 5    1e+05       500 0.109  0.189
> 6    1e+05      1000 0.117  0.352
> 7    1e+05      2500 0.141  0.843
> 8    1e+05     10000 0.252  3.331
> 9    1e+06        10 0.878  0.260
> 10   1e+06        50 0.986 0.377
> 11   1e+06       100 1.009  0.444
> 12   1e+06       200 1.049  0.667
> 13   1e+06       500 1.056  1.257
> 14   1e+06      1000 1.081  2.276
> 15   1e+06      2500 1.137  5.246
> 16   1e+06     10000 1.286 20.367
>
>> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets
> methods
> [8] base
>
> other attached packages:
> [1] ggplot2_0.8.9  proto_0.3-8    reshape_0.8.4  plyr_1.5.2
> gstat_0.9-76
> [6] sp_0.9-77      fortunes_1.4-1
>
> loaded via a namespace (and not attached):
> [1] digest_0.4.2    lattice_0.19-23 tools_2.13.0
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to manipulatr+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
>
>

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/