maply efficiency?

37 views
Skip to first unread message

philip

unread,
Sep 2, 2012, 1:05:09 PM9/2/12
to manip...@googlegroups.com
Dear list members,

Any help on this efficiency issue would be greatly appreciated.

I would like to find the most efficient way to run a non-vectorized function (here: fisher exact test p-value) iteratively using 4 matrices with identical dimensions. And as a result I aim for an array with identical dimensions containing the corresponding p-values. Please consider some code using a trivial example with 3x4 arrays below. Eventually I would like to run code on 2e3 x 7e6 arrays, for which someone suggested Amazon EC2 already...

Q1: would you agree that fisher.test() is not vectorizable? e.g. fisher.test( matrix(c(Ax,Ay,Bx,By),ncol=2) ) does not work
Q2: direct use of Ax, Ay, Bx, By as input instead of a (list) transform for the input would seem beneficial for speed
Q3: parallelization of the iterative process seems to make sense.
Q4: a progress bar seems to save peace of mind having no clue of the runtime.
Q5: avoidance of an output transform to get array from vector  
Q6: for Q2/3/4/5 plyr seems to be ideal (e.g. maply) 

Please also find some solutions below.  
solution 1: using mapply
solution 2: using lapply
solution 3: using mclapply
attempt 4: stuck on plyr implementation

--Philip

### CODE START ###

Ax <- matrix(c(2,3,5,6,
               3,7,8,9,
               8,2,1,3), ncol = 4)
Ay <- matrix(c(9,8,5,7,
               4,9,9,9,
               8,7,5,4), ncol = 4)
Bx <- matrix(c(1,5,9,8,
               4,7,8,9,
               2,3,2,1), ncol = 4)
By <- matrix(c(5,5,9,9,
               9,8,8,9,
               5,5,3,2), ncol = 4)

### solution 1 using mapply
# proper answer, no input transform, output transform, no parallelization, no progress update
sol1 <- function() {
 res1 <- mapply(
   function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2), conf.int=F)$p.value },
   i=Axj=Ayk=Bxl=By
   SIMPLIFY=TRUE)
 ans1 <- matrix(res1,ncol=4)
 return(ans1)
}
s1 <- sol1()

### solution 2 using lapply
# proper answer, input transform, output transform, no parallelization, no progress update
sol2 <- function() {
 tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay), as.numeric(Bx), as.numeric(By)))
 # determine fisher.test p-values as list
 res2 <- lapply(tmp.list
    function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$p.value })  
 ans2 <- matrix(unlist(res2),ncol=4)
 return(ans2)
}
s2 <- sol2()

### solution 3 using mclapply
# proper answer using input transform, output transform, parallelization, no progress update
library(multicore)
sol3 <- function() {
  tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay), as.numeric(Bx), as.numeric(By)))
  # determine fisher.test p-values as list
  res3 <- mclapply(tmp.list
    function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$p.value },
    mc.cores=4)  
  ans3 <- matrix(unlist(res3),ncol=4)
}
s3 <- sol3()

### solution 4 using plyr::maply
# difficulty finding equivalent code
# benefit could be: no input transform, no output transform, parallelization, and progress update
 library(plyr)
 library(abind)
 library(doMC)
 registerDoMC(cores=4)
 sol4 <- function() {
  ans4 <- maply(
   #.data = abind(i=Ax,j=Ay,k=Bx,l=By,along=0),
   #.data = abind(Ax,Ay,Bx,By,along=3),
   #.data = data.frame(i=Ax, j=Ay, k=Bx, l=By),
   #.data = cbind(i=as.vector(Ax), j=as.vector(Ay), k=as.vector(Bx), l=as.vector(By)),
   #.data = list(i=Ax, j=Ay, k=Bx, l=By),
   .data = abind(i=Axj=Ayk=Bxl=Byalong=3),
   .fun = function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2), conf.int=F)$p.value },
   .progress "text",
   .parralel TRUE
  )
  return(ans4)
}

all.equal(s1,s2) # TRUE
all.equal(s1,s3) # TRUE

library(microbenchmark)
microbenchmark(sol1times=1000)
microbenchmark(sol2times=1000)
microbenchmark(sol3times=1000)
microbenchmark(sol4times=1000)

Hadley Wickham

unread,
Sep 10, 2012, 5:56:28 PM9/10/12
to philip, manip...@googlegroups.com
Hi Philip,

I think you will only get marginal gains through tweaking how you run
the non-vectorised code. If you really want performance improvements,
I think you need to bite the bullet and vectorise fisher.test.

Hadley
> --
> You received this message because you are subscribed to the Google Groups
> "manipulatr" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/manipulatr/-/MRvbT1kwDmMJ.
> To post to this group, send email to manip...@googlegroups.com.
> To unsubscribe from this group, send email to
> manipulatr+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/manipulatr?hl=en.



--
RStudio / Rice University
http://had.co.nz/
Reply all
Reply to author
Forward
0 new messages