Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
maply efficiency?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  2 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
philip  
View profile  
 More options Sep 2 2012, 1:05 pm
From: philip <pcv...@gmail.com>
Date: Sun, 2 Sep 2012 10:05:09 -0700 (PDT)
Local: Sun, Sep 2 2012 1:05 pm
Subject: maply efficiency?

Dear list members,

Any help on this efficiency issue would be greatly appreciated.

I would like to find the most efficient way to run a non-vectorized
function (here: fisher exact test p-value) iteratively using 4 matrices
with identical dimensions. And as a result I aim for an array with
identical dimensions containing the corresponding p-values. Please consider
some code using a trivial example with 3x4 arrays below. Eventually I would
like to run code on 2e3 x 7e6 arrays, for which someone suggested Amazon
EC2 already...

Q1: would you agree that fisher.test() is not vectorizable? e.g.
fisher.test( matrix(c(Ax,Ay,Bx,By),ncol=2) ) does not work
Q2: direct use of Ax, Ay, Bx, By as input instead of a (list) transform for
the input would seem beneficial for speed
Q3: parallelization of the iterative process seems to make sense.
Q4: a progress bar seems to save peace of mind having no clue of the
runtime.
Q5: avoidance of an output transform to get array from vector  
Q6: for Q2/3/4/5 plyr seems to be ideal (e.g. maply)

Please also find some solutions below.  
solution 1: using mapply
solution 2: using lapply
solution 3: using mclapply
attempt 4: stuck on plyr implementation

--Philip

### CODE START ###

Ax <- matrix(c(2,3,5,6,
               3,7,8,9,
               8,2,1,3), ncol = 4)
Ay <- matrix(c(9,8,5,7,
               4,9,9,9,
               8,7,5,4), ncol = 4)
Bx <- matrix(c(1,5,9,8,
               4,7,8,9,
               2,3,2,1), ncol = 4)
By <- matrix(c(5,5,9,9,
               9,8,8,9,
               5,5,3,2), ncol = 4)

### solution 1 using mapply
# proper answer, no input transform, output transform, no parallelization,
no progress update
sol1 <- function() {
 res1 <- mapply(
   function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2), conf.int=F)$
p.value },
   i=Ax, j=Ay, k=Bx, l=By,
   SIMPLIFY=TRUE)
 ans1 <- matrix(res1,ncol=4)
 return(ans1)

}

s1 <- sol1()

### solution 2 using lapply
# proper answer, input transform, output transform, no parallelization, no
progress update
sol2 <- function() {
 tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay), as.numeric(
Bx), as.numeric(By)))
 # determine fisher.test p-values as list
 res2 <- lapply(tmp.list,
    function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$
p.value })  
 ans2 <- matrix(unlist(res2),ncol=4)
 return(ans2)

}

s2 <- sol2()

### solution 3 using mclapply
# proper answer using input transform, output transform, parallelization,
no progress update
library(multicore)
sol3 <- function() {
  tmp.list <- as.data.frame(rbind(as.numeric(Ax), as.numeric(Ay),
as.numeric(Bx), as.numeric(By)))
  # determine fisher.test p-values as list
  res3 <- mclapply(tmp.list,
    function(x) { fisher.test( matrix(unlist(x), ncol=2), conf.int=F)$
p.value },
    mc.cores=4)  
  ans3 <- matrix(unlist(res3),ncol=4)

}

s3 <- sol3()

### solution 4 using plyr::maply
# difficulty finding equivalent code
# benefit could be: no input transform, no output transform,
parallelization, and progress update
 library(plyr)
 library(abind)
 library(doMC)
 registerDoMC(cores=4)
 sol4 <- function() {
  ans4 <- maply(
   #.data = abind(i=Ax,j=Ay,k=Bx,l=By,along=0),
   #.data = abind(Ax,Ay,Bx,By,along=3),
   #.data = data.frame(i=Ax, j=Ay, k=Bx, l=By),
   #.data = cbind(i=as.vector(Ax), j=as.vector(Ay), k=as.vector(Bx),
l=as.vector(By)),
   #.data = list(i=Ax, j=Ay, k=Bx, l=By),
   .data = abind(i=Ax, j=Ay, k=Bx, l=By, along=3),
   .fun = function(i,j,k,l) { fisher.test( matrix(c(i,j,k,l), ncol=2),
conf.int=F)$p.value },
   .progress = "text",
   .parralel = TRUE
  )
  return(ans4)

}

all.equal(s1,s2) # TRUE
all.equal(s1,s3) # TRUE

library(microbenchmark)
microbenchmark(sol1, times=1000)
microbenchmark(sol2, times=1000)
microbenchmark(sol3, times=1000)
microbenchmark(sol4, times=1000)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hadley Wickham  
View profile  
 More options Sep 10 2012, 5:56 pm
From: Hadley Wickham <h.wick...@gmail.com>
Date: Mon, 10 Sep 2012 16:56:28 -0500
Local: Mon, Sep 10 2012 5:56 pm
Subject: Re: maply efficiency?
Hi Philip,

I think you will only get marginal gains through tweaking how you run
the non-vectorised code.  If you really want performance improvements,
I think you need to bite the bullet and vectorise fisher.test.

Hadley

--
RStudio / Rice University
http://had.co.nz/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »