parboot() and parallel processing

798 views
Skip to first unread message

Ted Jones

unread,
Jun 21, 2012, 8:16:10 PM6/21/12
to unmarked
Dear unmarked users,

I am curious if any among you have experience or tips you might be
able to share with respect to parsing a parboot() run to multiple
processor cores, particularly on a Windows based machine?

I have a double-observer pcount() model with 13 primary occasions in
which approximately 150 locations were sampled each time. The 150
sites are delineated into 7 stream segments, and by modeling abundance
as time dependent (primary occasion) with an interaction between
stream segment plus four additional habitat covariates, I end up with
more than 100 parameter estimates for the full model. The data are
overdispersed, and I'm wanting to use parboot() to estimate the
dispersion, so I can then rank models by QAIC.

The problem is that the full pcount() model takes about 10 hrs to fit
on my Core2Duo desktop (3.2GB RAM), and experience has shown (with
simpler models) that the length of time to complete a single parboot()
iteration can vary considerably depending on the particular dataset
generated for a given simulation run. I've benchmarked some pcount()
model fitting run times on various machines, including one in my
laboratory with dual Xeon processors and 16GB of RAM, as will was a 20
core (7GB RAM) Amazon EC2 machine on the "cloud", which both decreased
the processing time by about 20%.

I've tested the "boot()" example from the "parallel" R package and I
am able to generate 1000 runs by bootstrapping 500 simulations to each
of the two processors on my desktop. The example I used from the
documentation is this:

> run1 <- function(...) {
+ library(boot)
+ cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle
$v)
+ cd4.mle <- list(m = colMeans(cd4), v = var(cd4))
+ boot(cd4, corr, R = 500, sim = "parametric",
+ ran.gen = cd4.rg, mle = cd4.mle)
+ }
>mc<- 2
> cl <- makeCluster(mc)
> ## make this reproducible
> clusterSetRNGStream(cl, 123)
> library(boot) # needed for c() method on master
> cd4.boot <- do.call(c, parLapply(cl, seq_len(mc), run1) )
> boot.ci(cd4.boot, type = c("norm", "basic", "perc"),
+ conf = 0.9, h = atanh, hinv = tanh)
> stopCluster(cl)

I've tried to expand upon this example to get parboot() to run
similarly on multiple cores. Ideally, for example, I'd like to send
one parboot simulation run for my full model to each of 20 cores on an
Amazon EC2 machine which might then yield 20 parboot() simulation runs
in about 10 hours (plus or minus). If I repeated this on n machines I
could theoretically obtain 20n parboot() runs in about a 10 hour
period.

Thus far, none of my implementations have worked. I think I was close
on a recent attempt, but I believe the two cores on my desktop might
have been stomping on eachother with simultaneous requests to access
the stored pcount() model object.

Apologies in advance for my lack of tech savvy computer lingo, or my
poor modeling jargon-- I am still learning much in both departments.
That said, I am grateful for the informative discussions I've learned
from here.

Any ideas greatly appreciated.

Kind regards, Ted Jones




Richard Chandler

unread,
Jun 22, 2012, 9:34:01 AM6/22/12
to unma...@googlegroups.com
Hi Ted,

The Royle (Biometrics 2004) N-mixture model assumes population closure (ie 1 primary period) so I don't see how you could model abundance as time-dependent. Have you looked at pcountOpen, which fits the open-population model of Dail and Madsen (Biometrics 2011)?  Also, have you considered directly modeling overdispersion using the negative bionomial or ZIP distributions?

To run parboot on multiple cores, you could do something like this:

library(unmarked)
library(parallel)
example(parboot)

cl <- makeCluster(getOption("cl.cores", 2))
clusterExport(cl, c("fm", "fitstats"))
clusterSetRNGStream(cl, 123)

sims <- clusterEvalQ(cl, {
    library(unmarked)
    parboot(fm, fitstats, nsim=20)
})
str(sims)

stopCluster(cl)


Hope this helps.

Richard


_____________________________________
Richard Chandler, post-doc
USGS Patuxent Wildlife Research Center
301-497-5696



From: Ted Jones <salmo...@gmail.com>
To: unmarked <unma...@googlegroups.com>
Date: 06/21/2012 10:20 PM
Subject: [unmarked] parboot() and parallel processing
Sent by: unma...@googlegroups.com


Ted Jones

unread,
Jun 22, 2012, 7:45:09 PM6/22/12
to unmarked
Many thanks for the reply, Richard

Your easy to follow example appears to be working!

Looking forward to testing it on the Amazon Cloud (and possibly seeing
a parboot run of 100 or more sometime before next year).

Regarding the N-mixture model, I am aware of the closure assumption,
and considered the Dail-Madsen open model. However, to keep things
simple, I used the Royle (2004) pcount model by assigning primary
occasion (bi-monthly sample) as a factor, and I assume the change in
abundance between occasions represents the balance of recruitment/
mortality/movement that occurs in the bi-monthly interim between
primary occasions.

On a side note, the assumption of closure for populations at the site-
level is not perfectly met at the secondary occasion level for my
data. There is some small amount of movement by individuals among
trap sites during a double-sample, although movement between stream
segments is unlikely during the two consecutive samples which occur on
back-to-back days. I believe the movement among trap sites may be
partially responsible for the excessive dispersion (2 < c-hat < 4,
likely).

I tried both NB and ZIP distributions as well. For the NB models, I
observed the same phenomenon of increasing lamda and decreasing p,
when walking K upward, to the point where the estimates become very
unrealistic. I am not entirely interested in estimating total
abundance per se, because it is not entirely clear what the sampling
universe is (ie, the universe defined by the range of trap influence),
especially in light of the potential for movement between trap sites
during a double-sample. Instead, I'm using the model to more
generally infer about the relationship between abundance and habitat.

The webinar, this discussion board, and several great papers like the
Royle (2004) and the Dail and Madsen (2011) papers you mentioned, have
helped make the learning curve much less steep.

Thanks again,

Ted

Ted Jones

unread,
Jul 5, 2012, 3:09:31 PM7/5/12
to unmarked
In follow up...

I was able to apply the logic from Chandler's patch of code to a high-
CPU Amazon EC2 machine with 20 virtual processor cores, and succeeded
in parsing the parboot() simulations to each of the 20 cores. I
accomplished 100 parboot() simulations in about 1.5 days, which is
impressive considering the pcount() model takes about 10 hrs to fit on
my desktop.

Thanks, again!

MP

unread,
Aug 1, 2014, 1:43:04 PM8/1/14
to unma...@googlegroups.com
I am able to get parboot to run in parallel instances, in my case in 8 clusters.  I was wondering if there's a way to combine each of the resulting outputs into a single parboot class object? 
I got as far as creating a data frame combining the SSE, Chisq, and freemanTukey estimates between the 8 parboot instances, but I'd like to have it all together in a handy package.

Carly Scott

unread,
May 16, 2019, 4:27:27 PM5/16/19
to unmarked
To whomever might be able to help!

I am trying to use this same logic to run mb.gof.test in parallel on a remote cluster after also running my models in parallel on the cluster, however I keep getting the following error:

     Error in checkForRemoteErrors(lapply(cl, recvResult)) : 
  10 nodes produced errors; first error: object 'x' not found

I don't even know what object 'x' is here!!!

This is my code:

bear3<-unmarkedFrameOccu(y=y3, siteCovs=site3, obsCovs=list(month=month3[,2:289]))

run_um<-function(x) {unmarked::occu(as.formula(x[[1]]),x[[2]])}

clusterExport(cl, c("bear3", "run_um"))

mods<-list(bearglobdet3<-list(formula='~cat+hum+elev+topo+tree+month~1', data=bear3),
              bearglobabu3<-list(formula='~1~cat+hum+elev+topo+tree', data=bear3))

modsout<-parLapply(cl=cl, X=mods, fun=run_um) 

#now to do goodness of fit
clusterExport(cl, "modsout")

clusterSetRNGStream(cl, 123)

gofout<-clusterEvalQ(cl, {
  library(AICcmodavg)
  mb.gof.test(modsout[[1]], nsim=1000, plot.hist=TRUE, report=NULL, parallel=TRUE)
}) # which results in the error

 Any advice you have would be very useful for me, I am still relatively new to r so its probably a simple mistake somewhere! I have a ton of goodness of fit tests to run on my very small lab computer, which is why I am trying to make use of the remote server!

Thanks in advance. 

Ken Kellner

unread,
May 16, 2019, 9:16:37 PM5/16/19
to unmarked
This error is due to the way unmarked fits updated models with new formulas or datasets, which is required for parboot() and thus mb.gof.test(). Some unmarked helper functions can't always handle situations where the original fitted model had expressions passed as arguments, as you have with 
unmarked::occu(as.formula(x[[1]]),x[[2]])

You can work around this issue in an extremely hacky way by modifying the saved `call` attribute of the unmarked fit object to trick unmarked into using the right formula and dataset. I don't have your full dataset to demonstrate but here's an example with the built in frog dataset:

data(frogs)
pferUMF <- unmarkedFrameOccu(pfer.bin)
# add some fake covariates for illustration
siteCovs(pferUMF) <- data.frame(sitevar1 = rnorm(numSites(pferUMF)))
# observation covariates are in site-major, observation-minor order
obsCovs(pferUMF) <- data.frame(obsvar1 = rnorm(numSites(pferUMF) * obsNum(pferUMF)))

mods <- list(
             list(formula='~1~1',data=pferUMF),
             list(formula='~obsvar1~1', data=pferUMF)
             )

run_um <- function(x){unmarked::occu(as.formula(x[[1]]),x[[2]])} #here is where the issue comes in

cl <- makeCluster(detectCores()-1)

mods_out <- parLapply(cl=cl, X=mods, run_um)

#GOF

#Would need to change pferUMF here to your dataset name
clusterExport(cl, c('pferUMF', 'mods', 'mods_out'))

run_gof <- function(i){
  
  #Modify call attribute of the model to work around issue
  new_call <- call('occu',as.formula(mods[[i]][[1]]),mods[[i]][[2]])

  this_mod <- mods_out[[i]]
  this_mod@call[2] <- new_call[2]
  this_mod@call[3] <- new_call[3]
   
  #Don't want to set this to parallel also, as you are
  #already in a parallel loop
  #note change in nsim vs your code
  AICcmodavg::mb.gof.test(this_mod, nsim=10, plot.hist=FALSE, 
                          report=NULL, parallel=FALSE)
}

#List of GOF output in same order as mods_out
gof_out <- parLapply(cl=cl, X=1:length(mods_out), run_gof)


stopCluster(cl)


However, I think you may be better off not running your GOF tests in parallel this way. If you run each GOF test in a separate core, as you are attempting to do, you can't also have mb.gof.test() run simulations in parallel (since each parallelized GOF test is, by definition, only able to use a single core). The number of simulations running in each mb.gof.test()  in your code (1000) is likely much higher than the total number of models you have. Thus to me it makes more sense to just run the GOF tests in a regular non-parallel for loop or similar, and let mb.gof.test() run in in parallel. Unfortunately you still have to correct for the problem above:

gof_out <- vector("list", length=length(mods_out))
for (i in seq_along(mods_out)){

  new_call <- call('occu',as.formula(mods[[i]][[1]]),mods[[i]][[2]])

  this_mod <- mods_out[[i]]
  this_mod@call[2] <- new_call[2]
  this_mod@call[3] <- new_call[3]

  #Note change in nsim vs your code
  gof_out[[i]] <- AICcmodavg::mb.gof.test(this_mod, nsim=10, plot.hist=FALSE,
                                          report=NULL, parallel=TRUE)
}


Message has been deleted

Carly Scott

unread,
Jun 4, 2019, 8:47:36 PM6/4/19
to unmarked
Thanks so much Ken! I was able to get it all figured out using the code you provided! Worked perfectly!
Reply all
Reply to author
Forward
0 new messages