maply efficiency?

37 views

Skip to first unread message

philip

unread,

Sep 2, 2012, 1:05:09 PM9/2/12

to manip...@googlegroups.com

Dear list members,

Any help on this efficiency issue would be greatly appreciated.

I would like to find the most efficient way to run a non-vectorized function (here: fisher exact test p-value) iteratively using 4 matrices with identical dimensions. And as a result I aim for an array with identical dimensions containing the corresponding p-values. Please consider some code using a trivial example with 3x4 arrays below. Eventually I would like to run code on 2e3 x 7e6 arrays, for which someone suggested Amazon EC2 already...

Q1: would you agree that fisher.test() is not vectorizable? e.g. fisher.test( matrix(c(Ax,Ay,Bx,By),ncol=2) ) does not work

Q2: direct use of Ax, Ay, Bx, By as input instead of a (list) transform for the input would seem beneficial for speed

Q3: parallelization of the iterative process seems to make sense.

Q4: a progress bar seems to save peace of mind having no clue of the runtime.

Q5: avoidance of an output transform to get array from vector

Q6: for Q2/3/4/5 plyr seems to be ideal (e.g. maply)

Please also find some solutions below.

solution 1: using mapply

solution 2: using lapply

solution 3: using mclapply

attempt 4: stuck on plyr implementation

--Philip

### CODE START ###

Ax <- matrix(c(2,3,5,6,

3,7,8,9,

8,2,1,3), ncol = 4)

Ay <- matrix(c(9,8,5,7,

4,9,9,9,

8,7,5,4), ncol = 4)

Bx <- matrix(c(1,5,9,8,

4,7,8,9,

2,3,2,1), ncol = 4)

By <- matrix(c(5,5,9,9,

9,8,8,9,

5,5,3,2), ncol = 4)

### solution 1 using mapply

# proper answer, no input transform, output transform, no parallelization, no progress update

sol1 <- function() {

res1 <- mapply(

function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2), conf.int=F)$p.value },

i=Ax, j=Ay, k=Bx, l=By,

SIMPLIFY=TRUE)

ans1 <- matrix(res1,ncol=4)

return(ans1)

}

s1 <- sol1()

### solution 2 using lapply

# proper answer, input transform, output transform, no parallelization, no progress update

sol2 <- function() {

tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay), as.numeric(Bx), as.numeric(By)))

# determine fisher.test p-values as list

res2 <- lapply(tmp.list,

function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$p.value })

ans2 <- matrix(unlist(res2),ncol=4)

return(ans2)

}

s2 <- sol2()

### solution 3 using mclapply

# proper answer using input transform, output transform, parallelization, no progress update

library(multicore)

sol3 <- function() {

tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay), as.numeric(Bx), as.numeric(By)))

# determine fisher.test p-values as list

res3 <- mclapply(tmp.list,

function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$p.value },

mc.cores=4)

ans3 <- matrix(unlist(res3),ncol=4)

}

s3 <- sol3()

### solution 4 using plyr::maply

# difficulty finding equivalent code

# benefit could be: no input transform, no output transform, parallelization, and progress update

library(plyr)

library(abind)

library(doMC)

registerDoMC(cores=4)

sol4 <- function() {

ans4 <- maply(

#.data = abind(i=Ax,j=Ay,k=Bx,l=By,along=0),

#.data = abind(Ax,Ay,Bx,By,along=3),

#.data = data.frame(i=Ax, j=Ay, k=Bx, l=By),

#.data = cbind(i=as.vector(Ax), j=as.vector(Ay), k=as.vector(Bx), l=as.vector(By)),

#.data = list(i=Ax, j=Ay, k=Bx, l=By),

.data = abind(i=Ax, j=Ay, k=Bx, l=By, along=3),

.fun = function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2), conf.int=F)$p.value },

.progress = "text",

.parralel = TRUE

)

return(ans4)

}

all.equal(s1,s2) # TRUE

all.equal(s1,s3) # TRUE

library(microbenchmark)

microbenchmark(sol1, times=1000)

microbenchmark(sol2, times=1000)

microbenchmark(sol3, times=1000)

microbenchmark(sol4, times=1000)

Hadley Wickham

unread,

Sep 10, 2012, 5:56:28 PM9/10/12

to philip, manip...@googlegroups.com

Hi Philip,

I think you will only get marginal gains through tweaking how you run
the non-vectorised code. If you really want performance improvements,
I think you need to bite the bullet and vectorise fisher.test.

Hadley

> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/manipulatr/-/MRvbT1kwDmMJ.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to
> manipulatr+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/manipulatr?hl=en.

--
RStudio / Rice University
http://had.co.nz/

Reply all

Reply to author

Forward

0 new messages