Received: by 10.224.193.193 with SMTP id dv1mr552665qab.3.1312461841369; Thu, 04 Aug 2011 05:44:01 -0700 (PDT) X-BeenThere: manipulatr@googlegroups.com Received: by 10.224.89.4 with SMTP id c4ls3558926qam.1.gmail; Thu, 04 Aug 2011 05:44:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.208.199 with SMTP id gd7mr58416qab.56.1312461840346; Thu, 04 Aug 2011 05:44:00 -0700 (PDT) Received: by 1g2000vbu.googlegroups.com with HTTP; Thu, 4 Aug 2011 05:44:00 -0700 (PDT) Date: Thu, 4 Aug 2011 05:44:00 -0700 (PDT) In-Reply-To: References: <0e1e47e9-0eb5-46be-9e1f-63a37bea6898@df3g2000vbb.googlegroups.com> User-Agent: G2/1.0 X-HTTP-Via: 1.1 bips02.knmi.nl:80 (IronPort-WSA/6.3.7-013) X-HTTP-UserAgent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.10) Gecko/20100920 Fedora/3.6.10-1.fc13 Firefox/3.6.10,gzip(gfe) Message-ID: Subject: Re: Speed of ddply in comparison to ave for factors with a lot of levels From: Paul Hiemstra To: manipulatr Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I looked into data.table just now and added it to the list in my code example: library(ggplot2) library(data.table) theme_set(theme_bw()) datsize =3D c(10e4, 10e5) noClasses =3D c(10, 50, 100, 200, 500, 1000, 2500, 10e3) comb =3D expand.grid(datsize =3D datsize, noClasses =3D noClasses) res =3D ddply(comb, .(datsize, noClasses), function(x) { expdata =3D data.frame(value =3D runif(x$datsize), cat =3D round(runif(x$datsize, min =3D 0, max =3D x$noClasses))) expdataDT =3D data.table(expdata) t1 =3D system.time(res1 <- with(expdata, ave(value, cat))) t2 =3D system.time(res2 <- ddply(expdata, .(cat), mean)) t3 =3D system.time(res3 <- expdataDT[, sum(value), by =3D cat]) return(data.frame(tave =3D t1[3], tddply =3D t2[3], tdata.table =3D t3[3])) }, .progress =3D 'text') res ggplot(aes(x =3D noClasses, y =3D log(value), color =3D variable), data =3D melt(res, id.vars =3D c("datsize","noClasses"))) + facet_wrap(~ datsize) + geom_line() It is easily the fastest option: > res datsize noClasses tave tddply tdata.table 1 1e+05 10 0.091 0.035 0.011 2 1e+05 50 0.102 0.050 0.012 3 1e+05 100 0.105 0.065 0.012 4 1e+05 200 0.109 0.101 0.010 5 1e+05 500 0.113 0.248 0.012 6 1e+05 1000 0.123 0.438 0.012 7 1e+05 2500 0.146 0.956 0.013 8 1e+05 10000 0.251 3.525 0.020 9 1e+06 10 0.905 0.393 0.101 10 1e+06 50 1.003 0.473 0.100 11 1e+06 100 1.036 0.579 0.105 12 1e+06 200 1.052 0.826 0.106 13 1e+06 500 1.079 1.508 0.109 14 1e+06 1000 1.092 2.652 0.111 15 1e+06 2500 1.167 6.051 0.117 16 1e+06 10000 1.338 23.224 0.132 Having plyr using these kinds of packages would be awesome! Paul On Aug 4, 12:22=A0pm, Hadley Wickham wrote: > This is a drawback of the way that ddply always works with data > frames. =A0It will be a bit faster if you use summarise instead of > data.frame (because data.frame is very slow), but I'm still thinking > about how to overcome this fundamental limitation of the ddply > approach. > > I have submitted a grant to google to fund "plyr2", which will take a > somewhat different approach and push off all computation to other, > faster, data stores like data.table and SQL databases. > > Hadley > > > > On Thu, Aug 4, 2011 at 4:03 AM, Paul Hiemstra wr= ote: > > Dear list, > > > The piece of code at the end of this mail constructs a ggplot2 plot > > that shows that for factors with a lot of levels (>200) the > > performance of ddply is a lot worse than that of ave. Especially when > > going to more than a 1000 factor levels, the difference becomes very > > large. My question is how this is caused, and if ddply could be > > improved based on how ave works. > > > regards, > > Paul > > > ps My sessionInfo() is listed at the end of this mail > > > Paul Hiemstra, Ph.D. > > Global Climate Division > > Royal Netherlands Meteorological Institute (KNMI) > > Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 > > P.O. Box 201 | 3730 AE | De Bilt > > tel: +31 30 2206 494 > > >http://intamap.geo.uu.nl/~paul > >http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770 > > > library(ggplot2) > > theme_set(theme_bw()) > > datsize =3D c(10e4, 10e5) > > noClasses =3D c(10, 50, 100, 200, 500, 1000, 2500, 10e3) > > comb =3D expand.grid(datsize =3D datsize, noClasses =3D noClasses) > > res =3D ddply(comb, .(datsize, noClasses), function(x) { > > =A0expdata =3D data.frame(value =3D runif(x$datsize), > > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0cat =3D round(runif(x$datsiz= e, min =3D 0, max =3D x$noClasses))) > > > =A0t1 =3D system.time(res1 <- with(expdata, ave(value, cat))) > > =A0t2 =3D system.time(res2 <- ddply(expdata, .(cat), mean)) > > =A0return(data.frame(tave =3D t1[3], tddply =3D t2[3])) > > }, .progress =3D 'text') > > > ggplot(aes(x =3D noClasses, y =3D log(value), color =3D variable), data= =3D > > melt(res, id.vars =3D c("datsize","noClasses"))) + facet_wrap(~ datsize= ) > > + geom_line() > > >> res > > =A0 datsize noClasses =A0tave tddply > > 1 =A0 =A01e+05 =A0 =A0 =A0 =A010 0.088 =A00.028 > > 2 =A0 =A01e+05 =A0 =A0 =A0 =A050 0.098 =A00.040 > > 3 =A0 =A01e+05 =A0 =A0 =A0 100 0.101 =A00.055 > > 4 =A0 =A01e+05 =A0 =A0 =A0 200 0.105 =A00.091 > > 5 =A0 =A01e+05 =A0 =A0 =A0 500 0.109 =A00.189 > > 6 =A0 =A01e+05 =A0 =A0 =A01000 0.117 =A00.352 > > 7 =A0 =A01e+05 =A0 =A0 =A02500 0.141 =A00.843 > > 8 =A0 =A01e+05 =A0 =A0 10000 0.252 =A03.331 > > 9 =A0 =A01e+06 =A0 =A0 =A0 =A010 0.878 =A00.260 > > 10 =A0 1e+06 =A0 =A0 =A0 =A050 0.986 0.377 > > 11 =A0 1e+06 =A0 =A0 =A0 100 1.009 =A00.444 > > 12 =A0 1e+06 =A0 =A0 =A0 200 1.049 =A00.667 > > 13 =A0 1e+06 =A0 =A0 =A0 500 1.056 =A01.257 > > 14 =A0 1e+06 =A0 =A0 =A01000 1.081 =A02.276 > > 15 =A0 1e+06 =A0 =A0 =A02500 1.137 =A05.246 > > 16 =A0 1e+06 =A0 =A0 10000 1.286 20.367 > > >> sessionInfo() > > R version 2.13.0 (2011-04-13) > > Platform: i686-pc-linux-gnu (32-bit) > > > locale: > > =A0[1] LC_CTYPE=3Den_US.UTF-8 =A0 =A0 =A0 LC_NUMERIC=3DC > > =A0[3] LC_TIME=3Den_US.UTF-8 =A0 =A0 =A0 =A0LC_COLLATE=3Den_US.UTF-8 > > =A0[5] LC_MONETARY=3DC =A0 =A0 =A0 =A0 =A0 =A0 =A0LC_MESSAGES=3Den_US.U= TF-8 > > =A0[7] LC_PAPER=3Den_US.UTF-8 =A0 =A0 =A0 LC_NAME=3DC > > =A0[9] LC_ADDRESS=3DC =A0 =A0 =A0 =A0 =A0 =A0 =A0 LC_TELEPHONE=3DC > > [11] LC_MEASUREMENT=3Den_US.UTF-8 LC_IDENTIFICATION=3DC > > > attached base packages: > > [1] grid =A0 =A0 =A0stats =A0 =A0 graphics =A0grDevices utils =A0 =A0 d= atasets > > methods > > [8] base > > > other attached packages: > > [1] ggplot2_0.8.9 =A0proto_0.3-8 =A0 =A0reshape_0.8.4 =A0plyr_1.5.2 > > gstat_0.9-76 > > [6] sp_0.9-77 =A0 =A0 =A0fortunes_1.4-1 > > > loaded via a namespace (and not attached): > > [1] digest_0.4.2 =A0 =A0lattice_0.19-23 tools_2.13.0 > > > -- > > You received this message because you are subscribed to the Google Grou= ps "manipulatr" group. > > To post to this group, send email to manipulatr@googlegroups.com. > > To unsubscribe from this group, send email to manipulatr+unsubscribe@go= oglegroups.com. > > For more options, visit this group athttp://groups.google.com/group/man= ipulatr?hl=3Den. > > -- > Assistant Professor / Dobelman Family Junior Chair