The piece of code at the end of this mail constructs a ggplot2 plot
that shows that for factors with a lot of levels (>200) the
performance of ddply is a lot worse than that of ave. Especially when
going to more than a 1000 factor levels, the difference becomes very
large. My question is how this is caused, and if ddply could be
improved based on how ave works.
regards,
Paul
ps My sessionInfo() is listed at the end of this mail
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494
This is a drawback of the way that ddply always works with data frames. It will be a bit faster if you use summarise instead of data.frame (because data.frame is very slow), but I'm still thinking about how to overcome this fundamental limitation of the ddply approach.
I have submitted a grant to google to fund "plyr2", which will take a somewhat different approach and push off all computation to other, faster, data stores like data.table and SQL databases.
On Thu, Aug 4, 2011 at 4:03 AM, Paul Hiemstra <p.h.hiems...@gmail.com> wrote: > Dear list,
> The piece of code at the end of this mail constructs a ggplot2 plot > that shows that for factors with a lot of levels (>200) the > performance of ddply is a lot worse than that of ave. Especially when > going to more than a 1000 factor levels, the difference becomes very > large. My question is how this is caused, and if ddply could be > improved based on how ave works.
> regards, > Paul
> ps My sessionInfo() is listed at the end of this mail
> Paul Hiemstra, Ph.D. > Global Climate Division > Royal Netherlands Meteorological Institute (KNMI) > Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 > P.O. Box 201 | 3730 AE | De Bilt > tel: +31 30 2206 494
> loaded via a namespace (and not attached): > [1] digest_0.4.2 lattice_0.19-23 tools_2.13.0
> -- > You received this message because you are subscribed to the Google Groups "manipulatr" group. > To post to this group, send email to manipulatr@googlegroups.com. > To unsubscribe from this group, send email to manipulatr+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/manipulatr?hl=en.
-- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
> This is a drawback of the way that ddply always works with data
> frames. It will be a bit faster if you use summarise instead of
> data.frame (because data.frame is very slow), but I'm still thinking
> about how to overcome this fundamental limitation of the ddply
> approach.
> I have submitted a grant to google to fund "plyr2", which will take a
> somewhat different approach and push off all computation to other,
> faster, data stores like data.table and SQL databases.
> Hadley
> On Thu, Aug 4, 2011 at 4:03 AM, Paul Hiemstra <p.h.hiems...@gmail.com> wrote:
> > Dear list,
> > The piece of code at the end of this mail constructs a ggplot2 plot
> > that shows that for factors with a lot of levels (>200) the
> > performance of ddply is a lot worse than that of ave. Especially when
> > going to more than a 1000 factor levels, the difference becomes very
> > large. My question is how this is caused, and if ddply could be
> > improved based on how ave works.
> > regards,
> > Paul
> > ps My sessionInfo() is listed at the end of this mail
> > Paul Hiemstra, Ph.D.
> > Global Climate Division
> > Royal Netherlands Meteorological Institute (KNMI)
> > Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
> > P.O. Box 201 | 3730 AE | De Bilt
> > tel: +31 30 2206 494
> > loaded via a namespace (and not attached):
> > [1] digest_0.4.2 lattice_0.19-23 tools_2.13.0
> > --
> > You received this message because you are subscribed to the Google Groups "manipulatr" group.
> > To post to this group, send email to manipulatr@googlegroups.com.
> > To unsubscribe from this group, send email to manipulatr+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/manipulatr?hl=en.
> --
> Assistant Professor / Dobelman Family Junior Chair
> Department of Statistics / Rice Universityhttp://had.co.nz/
On Thu, Aug 4, 2011 at 08:22, Hadley Wickham <had...@rice.edu> wrote: > ...I have submitted a grant to google to fund "plyr2", which will take a > somewhat different approach and push off all computation to other, > faster, data stores like data.table and SQL databases.
Is there anything we on the mailing list can do to support your grant application?
>> ...I have submitted a grant to google to fund "plyr2", which will take a >> somewhat different approach and push off all computation to other, >> faster, data stores like data.table and SQL databases.
> Is there anything we on the mailing list can do to support your grant > application?
Unless you work at google, I don't think so at the moment.
Hadley
-- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/