Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Speed of ddply in comparison to ave for factors with a lot of levels
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Paul Hiemstra  
View profile  
 More options Aug 4 2011, 4:03 am
From: Paul Hiemstra <p.h.hiems...@gmail.com>
Date: Thu, 4 Aug 2011 01:03:57 -0700 (PDT)
Local: Thurs, Aug 4 2011 4:03 am
Subject: Speed of ddply in comparison to ave for factors with a lot of levels
Dear list,

The piece of code at the end of this mail constructs a ggplot2 plot
that shows that for factors with a lot of levels (>200) the
performance of ddply is a lot worse than that of ave. Especially when
going to more than a 1000 factor levels, the difference becomes very
large. My question is how this is caused, and if ddply could be
improved based on how ave works.

regards,
Paul

ps My sessionInfo() is listed at the end of this mail

Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770

library(ggplot2)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
                      cat = round(runif(x$datsize, min = 0, max = x$noClasses)))

  t1 = system.time(res1 <- with(expdata, ave(value, cat)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
  return(data.frame(tave = t1[3], tddply = t2[3]))

}, .progress = 'text')

ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

> res

   datsize noClasses  tave tddply
1    1e+05        10 0.088  0.028
2    1e+05        50 0.098  0.040
3    1e+05       100 0.101  0.055
4    1e+05       200 0.105  0.091
5    1e+05       500 0.109  0.189
6    1e+05      1000 0.117  0.352
7    1e+05      2500 0.141  0.843
8    1e+05     10000 0.252  3.331
9    1e+06        10 0.878  0.260
10   1e+06        50 0.986  0.377
11   1e+06       100 1.009  0.444
12   1e+06       200 1.049  0.667
13   1e+06       500 1.056  1.257
14   1e+06      1000 1.081  2.276
15   1e+06      2500 1.137  5.246
16   1e+06     10000 1.286 20.367

> sessionInfo()

R version 2.13.0 (2011-04-13)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets
methods
[8] base

other attached packages:
[1] ggplot2_0.8.9  proto_0.3-8    reshape_0.8.4  plyr_1.5.2
gstat_0.9-76
[6] sp_0.9-77      fortunes_1.4-1

loaded via a namespace (and not attached):
[1] digest_0.4.2    lattice_0.19-23 tools_2.13.0


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hadley Wickham  
View profile  
 More options Aug 4 2011, 8:22 am
From: Hadley Wickham <had...@rice.edu>
Date: Thu, 4 Aug 2011 08:22:41 -0400
Local: Thurs, Aug 4 2011 8:22 am
Subject: Re: Speed of ddply in comparison to ave for factors with a lot of levels
This is a drawback of the way that ddply always works with data
frames.  It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.

I have submitted a grant to google to fund "plyr2", which will take a
somewhat different approach and push off all computation to other,
faster, data stores like data.table and SQL databases.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Hiemstra  
View profile  
 More options Aug 4 2011, 8:44 am
From: Paul Hiemstra <p.h.hiems...@gmail.com>
Date: Thu, 4 Aug 2011 05:44:00 -0700 (PDT)
Subject: Re: Speed of ddply in comparison to ave for factors with a lot of levels
I looked into data.table just now and added it to the list in my code
example:

library(ggplot2)
library(data.table)
theme_set(theme_bw())
datsize = c(10e4, 10e5)
noClasses = c(10, 50, 100, 200, 500, 1000, 2500, 10e3)
comb = expand.grid(datsize = datsize, noClasses = noClasses)
res = ddply(comb, .(datsize, noClasses), function(x) {
  expdata = data.frame(value = runif(x$datsize),
                      cat = round(runif(x$datsize, min = 0, max = x$noClasses)))
  expdataDT = data.table(expdata)

  t1 = system.time(res1 <- with(expdata, ave(value, cat)))
  t2 = system.time(res2 <- ddply(expdata, .(cat), mean))
  t3 = system.time(res3 <- expdataDT[, sum(value), by = cat])
  return(data.frame(tave = t1[3], tddply = t2[3], tdata.table =
t3[3]))

}, .progress = 'text')

res

ggplot(aes(x = noClasses, y = log(value), color = variable), data =
melt(res, id.vars = c("datsize","noClasses"))) + facet_wrap(~ datsize)
+ geom_line()

It is easily the fastest option:

> res

   datsize noClasses  tave tddply tdata.table
1    1e+05        10 0.091  0.035       0.011
2    1e+05        50 0.102  0.050       0.012
3    1e+05       100 0.105  0.065       0.012
4    1e+05       200 0.109  0.101       0.010
5    1e+05       500 0.113  0.248       0.012
6    1e+05      1000 0.123  0.438       0.012
7    1e+05      2500 0.146  0.956       0.013
8    1e+05     10000 0.251  3.525       0.020
9    1e+06        10 0.905  0.393       0.101
10   1e+06        50 1.003  0.473       0.100
11   1e+06       100 1.036  0.579       0.105
12   1e+06       200 1.052  0.826       0.106
13   1e+06       500 1.079  1.508       0.109
14   1e+06      1000 1.092  2.652       0.111
15   1e+06      2500 1.167  6.051       0.117
16   1e+06     10000 1.338 23.224       0.132

Having plyr using these kinds of packages would be awesome!

Paul

On Aug 4, 12:22 pm, Hadley Wickham <had...@rice.edu> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Stavros Macrakis  
View profile  
 More options Aug 4 2011, 2:52 pm
From: Stavros Macrakis <macra...@alum.mit.edu>
Date: Thu, 4 Aug 2011 14:52:20 -0400
Local: Thurs, Aug 4 2011 2:52 pm
Subject: Re: Speed of ddply in comparison to ave for factors with a lot of levels

On Thu, Aug 4, 2011 at 08:22, Hadley Wickham <had...@rice.edu> wrote:
> ...I have submitted a grant to google to fund "plyr2", which will take a
> somewhat different approach and push off all computation to other,
> faster, data stores like data.table and SQL databases.

Is there anything we on the mailing list can do to support your grant
application?

             -s


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hadley Wickham  
View profile  
 More options Aug 4 2011, 3:03 pm
From: Hadley Wickham <had...@rice.edu>
Date: Thu, 4 Aug 2011 15:03:03 -0400
Local: Thurs, Aug 4 2011 3:03 pm
Subject: Re: Speed of ddply in comparison to ave for factors with a lot of levels

>> ...I have submitted a grant to google to fund "plyr2", which will take a
>> somewhat different approach and push off all computation to other,
>> faster, data stores like data.table and SQL databases.

> Is there anything we on the mailing list can do to support your grant
> application?

Unless you work at google, I don't think so at the moment.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »