dplyr & conditional sums

Robbie Edwards

unread,

Nov 25, 2014, 2:11:10 PM11/25/14

to manip...@googlegroups.com

Continuing my conversion to dplyr, I have the following in plyr... ( which is probably suboptimal )

>set.seed( 1 )
>df <- data.frame( group=sample( letters[1:4], 30, replace=T)
	, c1=sample( letters[20:22], 30, replace=T)
	, v1 = rbinom( 30, 1, .5 )
	, c2=sample( letters[20:22], 30, replace=T)
	, v2 = rbinom( 30, 1, .5 )
	)

>ddply( df, .(group), summarize
	, t1t = sum( ifelse( c1 == "t", v1, 0 ))
	, t1u = sum( ifelse( c1 == "u", v1, 0 ))
	, t1v = sum( ifelse( c1 == "v", v1, 0 ))
	)

group t1t t1u t1v
1     a   1   2   2
2     b   1   1   2
3     c   2   0   0
4     d   0   0   1

As is hopefully evident, I'm trying to calculate certain values based on the values stored in other columns. I suppose I could do it in multiple steps and merge. Is there a better way?

robbie

Robbie Edwards

unread,

Nov 25, 2014, 2:28:43 PM11/25/14

to manip...@googlegroups.com

So I discovered this solution. Guess it's operational, still, certainly there is a better way.

df %>% 
  mutate( 
      t1t = ifelse( c1 == "t", v1, 0 )
    , t1u = ifelse( c1 == "u", v1, 0 )
    , t1v = ifelse( c1 == "v", v1, 0 )
  ) %>% 
  group_by( group ) %>% 
  summarise( 
      t1t = sum( t1t )
    , t1u = sum( t1u )
    , t1v = sum( t1v )
  )

Robbie Edwards

unread,

Nov 25, 2014, 2:40:55 PM11/25/14

to manip...@googlegroups.com

Alright, this works. Guess it wasn't so hard after all. I'm curious still if this is the fastest way.

df %>% group_by( group ) %>% 
    summarise( 
	  t1t = sum( ifelse( c1 == "t", v1, 0 ) )
	, t1u = sum( ifelse( c1 == "u", v1, 0 ) ) 
	, t1v = sum( ifelse( c1 == "v", v1, 0 ) ) 
	)

Hadley Wickham

unread,

Nov 25, 2014, 2:52:55 PM11/25/14

to Robbie Edwards, manipulatr

Two approaches:

df %>%
group_by(group) %>%
summarise(
t1t = sum(v1[c1 == "t"]),
t1u = sum(v1[c1 == "u"]),
t1v = sum(v1[c1 == "v"]))

# Or
library("tidyr")

df %>%
group_by(group, c1) %>%
summarise(t1 = sum(v1)) %>%
spread(c1, t1)

Hadley

> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at http://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.

--
http://had.co.nz/

Robbie Edwards

unread,

Nov 25, 2014, 2:58:11 PM11/25/14

to manip...@googlegroups.com

Doesn't seem to work when the table is a mysql table.

Returns the error:

Error in mysqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not run statement: FUNCTION <tablename>.IFELSE does not exist)

Jaime Ashander

unread,

Nov 25, 2014, 3:30:36 PM11/25/14

to Robbie Edwards, manipulatr

I had written a suggestion using reshape2, when Hadley sent his similar approach with tidyr (another cool package! I hadn't used it yet).

To add some value, I put our 3 suggestions up against each other with microbenchmark. (I wanted to include Hadley's with tidyr but get an Error: index out of bounds when running his code).

The upshot: Robbie/ Hadley outperform on this small example dataset. For a similar structure with 3 million rows, the reshape strategy seems much faster. See below for the details.

- Jaime

library(dplyr)

## robbie

Robbie <- function(df) {

df %>% group_by( group ) %>%

summarise(

t1t = sum( ifelse( c1 == "t", v1, 0 ) )

, t1u = sum( ifelse( c1 == "u", v1, 0 ) )

, t1v = sum( ifelse( c1 == "v", v1, 0 ) )

)

}

## hadley 1

Hadley1 <- function(df) {

df %>%

group_by(group) %>%

summarise(

t1t = sum(v1[c1 == "t"]),

t1u = sum(v1[c1 == "u"]),

t1v = sum(v1[c1 == "v"]))

}

## hadley 2

library("tidyr")

Hadley2<- function(df) {

## require tidyr

df %>%

group_by(group, c1) %>%

summarise(t1 = sum(v1)) %>%

spread(c1, t1)

}

## mine

library(reshape2)

Jaime <- function(df) {

## require reshape2

df %>% group_by(group, c1) %>%

summarize(condsum = sum(v1)) %>%

melt(value.name='condsum', id.vars=c('group', 'c1')) %>%

dcast(group ~ c1, value.var='condsum')

}

## let's do some benchmarking

library(microbenchmark)

set.seed( 1 )

N <- 30

df <- data.frame( group=sample( letters[1:4], N, replace=T)

, c1=sample( letters[20:22], N, replace=T)

, v1 = rbinom( N, 1, .5 )

, c2=sample( letters[20:22], N, replace=T)

, v2 = rbinom( N, 1, .5 )

)

## tidyr solution throws error so don't include :(

##> Hadley2(df)

##Error: index out of bounds

microbenchmark(

Robbie(df),

Hadley1(df),

Jaime(df), times=100)

## Unit: milliseconds

## expr min lq mean median uq max neval

## Robbie(df) 2.149660 2.405781 2.740472 2.571891 2.821945 11.86290 1000

## Hadley1(df) 1.851988 2.076746 2.410784 2.221399 2.452184 27.91905 1000

## Jaime(df) 6.018156 6.547313 7.356067 6.890498 7.415158 18.21118 1000

## now on a much larger (still in memory) example dataset

## note-- these take a bit of time to run!

set.seed( 1 )

N <- 3e6

df <- data.frame( group=sample( letters[1:4], N, replace=T)

, c1=sample( letters[20:22], N, replace=T)

, v1 = rbinom( N, 1, .5 )

, c2=sample( letters[20:22], N, replace=T)

, v2 = rbinom( N, 1, .5 )

)

microbenchmark(

Robbie(df),

Hadley1(df),

Jaime(df), times=50)

## Unit: milliseconds

## expr min lq mean median uq max neval

## Robbie(df) 5709.4797 5791.7142 5903.6788 5865.5936 5966.040 6319.9855 50

## Hadley1(df) 1762.8244 1828.1561 1875.0274 1863.7192 1906.411 2122.2618 50

## Jaime(df) 219.5084 225.2723 244.2125 229.4026 242.537 458.7869 50

Robbie Edwards

unread,

Nov 25, 2014, 4:10:35 PM11/25/14

to manip...@googlegroups.com, robbie....@gmail.com

Thanks for the suggestion Jaime.

Do you think that performance would hold up across a more complex data set? The data set right now is about 1e6 rows with combinations of 4 sets of variables to calculate across.

Reply all

Reply to author

Forward