dplyr & conditional sums

11,491 views
Skip to first unread message

Robbie Edwards

unread,
Nov 25, 2014, 2:11:10 PM11/25/14
to manip...@googlegroups.com
Continuing my conversion to dplyr, I have the following in plyr... ( which is probably suboptimal )

>set.seed( 1 )
>df <- data.frame( group=sample( letters[1:4], 30, replace=T)
, c1=sample( letters[20:22], 30, replace=T)
, v1 = rbinom( 30, 1, .5 )
, c2=sample( letters[20:22], 30, replace=T)
, v2 = rbinom( 30, 1, .5 )
)

>ddply( df, .(group), summarize
, t1t = sum( ifelse( c1 == "t", v1, 0 ))
, t1u = sum( ifelse( c1 == "u", v1, 0 ))
, t1v = sum( ifelse( c1 == "v", v1, 0 ))
)

group t1t t1u t1v
1     a   1   2   2
2     b   1   1   2
3     c   2   0   0
4     d   0   0   1

As is hopefully evident, I'm trying to calculate certain values based on the values stored in other columns.  I suppose I could do it in multiple steps and merge.  Is there a better way?

robbie



Robbie Edwards

unread,
Nov 25, 2014, 2:28:43 PM11/25/14
to manip...@googlegroups.com
So I discovered this solution.  Guess it's operational, still, certainly there is a better way.

df %>%
  mutate
(
      t1t = ifelse( c1 == "t", v1, 0 )
    , t1u = ifelse( c1 == "u", v1, 0 )
    , t1v = ifelse( c1 == "v", v1, 0 )
 
) %>%
  group_by
( group ) %>%
  summarise
(
      t1t
= sum( t1t )
    , t1u = sum( t1u )
    , t1v = sum( t1v )
 
)



Robbie Edwards

unread,
Nov 25, 2014, 2:40:55 PM11/25/14
to manip...@googlegroups.com
Alright, this works.  Guess it wasn't so hard after all.  I'm curious still if this is the fastest way.

df %>% group_by( group ) %>%
    summarise( 
  t1t = sum( ifelse( c1 == "t", v1, 0 ) )
, t1u = sum( ifelse( c1 == "u", v1, 0 ) ) 
, t1v = sum( ifelse( c1 == "v", v1, 0 ) ) 
)


Hadley Wickham

unread,
Nov 25, 2014, 2:52:55 PM11/25/14
to Robbie Edwards, manipulatr
Two approaches:

df %>%
group_by(group) %>%
summarise(
t1t = sum(v1[c1 == "t"]),
t1u = sum(v1[c1 == "u"]),
t1v = sum(v1[c1 == "v"]))

# Or
library("tidyr")

df %>%
group_by(group, c1) %>%
summarise(t1 = sum(v1)) %>%
spread(c1, t1)

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.



--
http://had.co.nz/

Robbie Edwards

unread,
Nov 25, 2014, 2:58:11 PM11/25/14
to manip...@googlegroups.com
Doesn't seem to work when the table is a mysql table.  

Returns the error:
Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: FUNCTION <tablename>.IFELSE does not exist)


Jaime Ashander

unread,
Nov 25, 2014, 3:30:36 PM11/25/14
to Robbie Edwards, manipulatr
I had written a suggestion using reshape2, when Hadley sent his similar approach with tidyr (another cool package! I hadn't used it yet). 

To add some value, I put our 3 suggestions up against each other with microbenchmark. (I wanted to include Hadley's with tidyr but get an Error: index out of bounds when running his code).

The upshot: Robbie/ Hadley outperform on this small example dataset. For a similar structure with 3 million rows, the reshape strategy seems much faster. See below for the details.

- Jaime


library(dplyr)
## robbie
Robbie <- function(df) {
    df %>% group_by( group ) %>% 
        summarise( 
            t1t = sum( ifelse( c1 == "t", v1, 0 ) )
            , t1u = sum( ifelse( c1 == "u", v1, 0 ) ) 
            , t1v = sum( ifelse( c1 == "v", v1, 0 ) ) 
            )
}

## hadley 1
Hadley1 <- function(df) {
    df %>%
        group_by(group) %>%
            summarise(
                t1t = sum(v1[c1 == "t"]),
                t1u = sum(v1[c1 == "u"]),
                t1v = sum(v1[c1 == "v"]))
}

## hadley 2
library("tidyr")
Hadley2<- function(df) {
    ## require tidyr
    df %>%
        group_by(group, c1) %>%
            summarise(t1 = sum(v1)) %>%
                spread(c1, t1)
}

## mine
library(reshape2)
Jaime <- function(df) {
    ## require reshape2
    df %>% group_by(group, c1) %>%
        summarize(condsum = sum(v1)) %>%
            melt(value.name='condsum', id.vars=c('group', 'c1')) %>%
                dcast(group ~ c1, value.var='condsum')
}

## let's do some benchmarking
library(microbenchmark)

set.seed( 1 )
N <- 30
df <- data.frame( group=sample( letters[1:4], N, replace=T)
                 , c1=sample( letters[20:22], N, replace=T)
                 , v1 = rbinom( N, 1, .5 )
                 , c2=sample( letters[20:22], N, replace=T)
                 , v2 = rbinom( N, 1, .5 )
                 )
## tidyr solution throws error so don't include :(
##> Hadley2(df)
##Error: index out of bounds

microbenchmark(
    Robbie(df),
    Hadley1(df),
    Jaime(df), times=100)
## Unit: milliseconds
##         expr      min       lq     mean   median       uq      max neval
##   Robbie(df) 2.149660 2.405781 2.740472 2.571891 2.821945 11.86290  1000
##  Hadley1(df) 1.851988 2.076746 2.410784 2.221399 2.452184 27.91905  1000
##    Jaime(df) 6.018156 6.547313 7.356067 6.890498 7.415158 18.21118  1000

## now on a much larger (still in memory) example dataset
## note-- these take a bit of time to run!
set.seed( 1 )
N <- 3e6
df <- data.frame( group=sample( letters[1:4], N, replace=T)
                 , c1=sample( letters[20:22], N, replace=T)
                 , v1 = rbinom( N, 1, .5 )
                 , c2=sample( letters[20:22], N, replace=T)
                 , v2 = rbinom( N, 1, .5 )
                 )

microbenchmark(
    Robbie(df),
    Hadley1(df),
    Jaime(df), times=50)
## Unit: milliseconds
##         expr       min        lq      mean    median       uq       max neval
##   Robbie(df) 5709.4797 5791.7142 5903.6788 5865.5936 5966.040 6319.9855    50
##  Hadley1(df) 1762.8244 1828.1561 1875.0274 1863.7192 1906.411 2122.2618    50
##    Jaime(df)  219.5084  225.2723  244.2125  229.4026  242.537  458.7869    50

Robbie Edwards

unread,
Nov 25, 2014, 4:10:35 PM11/25/14
to manip...@googlegroups.com, robbie....@gmail.com
Thanks for the suggestion Jaime.

Do you think that performance would hold up across a more complex data set?  The data set right now is about 1e6 rows with combinations of 4 sets of variables to calculate across.  


Reply all
Reply to author
Forward
0 new messages