Sampling chains in parallel with rstan under windows

517 views
Skip to first unread message

Jonathan Gilligan

unread,
Jul 23, 2014, 5:19:59 PM7/23/14
to stan-...@googlegroups.com
I've been grateful for everyone (especially Bob Carpenter) who's patiently answered my stan questions. I hope sharing my experiences getting multiple chains running in parallel under windows will return some of those favors and be useful.

I have been playing with sampling stan chains in parallel under Windows for long computations. There is some documentation on the rstan site about running parallel chains, but because windows doesn't allow unix-style forking processes, things are more complicated. I have tried several parallel R packages with varying results. Snow works all right, and is simple to invoke, but the default of "SOCKS" communication produces intermittent but annoyingly frequent interprocess communication glitches that waste time when I have to restart runs. Snow doesn't work with MPI under windows, at least not for me, even when rmpi is installed and working. So recently, I tried pbdMPI (pbd = "programming with big data": see http://r-pbd.org/ and http://cran.r-project.org/web/packages/pbdMPI/index.html ). This is harder to code with, but once you have the hang of it, it's pretty straightforward to get multiple stan chains running in parallel.

I'm using the R pbdMPI library with the Microsoft High Performance Computing MPI package ( http://technet.microsoft.com/en-us/library/cc514029.aspx ). Microsoft HPC is free as an MPI package that runs multiple processes on multiple processors on a single computer. To get a cluster going with multiple computers, you would need to get MS's HPC server (but I can't see why anyone would want to run huge numbers of chains). Alternately, all this should work with OpenMPI under Linux and MacOS either on a single computer or in a cluster.

I am attaching an example of a toy R script that just runs the 8schools model with either 4 parallel chains (if use.mpi is either undefined or set to TRUE when the script is loaded) or 4 sequential chains (if use.mpi is set to FALSE when the script is loaded). It should serve as a template in which you can substitute your own model for the 8schools one. Running the 8schools model with 4 chains of 1 million iterations (absurdly large, but long processing times reduce the noise due to random stuff at initialization and termination) on a rather modest workstation with four cores takes 115 seconds in normal single-process mode and 32 seconds with parallel processing (this is the time for sampling, not including the time for compiling the model with stan and gpp).

For most small models, there is no benefit to running chains in parallel: the time to compile the stan model takes up the lion's share of processor time for small models. But for big models that take hours to converge, cutting the sampling time by a factor of three or more by running chains in parallel may make it worth the hassle of running the chains in parallel.

The key here are the pbdMPI functions bcast() and task.pull(). The function bcast() broadcasts a value from one thread (process #0, the master process, by default) to all other threads. This is how you initialize variables and compile the stan model once, and then pass the R stanmodel object to all the processes. The function task.pull() tells the worker threads to start executing a function N times. If N <= # threads, all the tasks run simultaneously. If N > # threads, each worker thread executes one task and then when it's done, asks (pulls) a new task from the master thread until all tasks are done. The master thread collects the results and returns them as a list, which can be assembled into a stanfit object using the rstan sflist2stanfit() function.  It is possible to pass parameters to the worker tasks using ellipsis arguments in R functions, but I found it easier to pass them as global variables that I initialize for each thread using bcast().

pbdMPI does not want to run in interactive mode (e.g., RStudio), so you need to invoke it via the mpiexec command from a command prompt or batch file/shell script. The best thing to do is run the stan computation under mpiexec and and have the script save the stanfit object as a .RData file, which you can read into an interactive session to analyze.

When using pbdMPI, some terminology that may be confusing is that the process number (0, 1, 2, ...) is referred to throughout the documentation as the "rank" of the process, so you get the thread # by calling comm.rank(). The rank-zero process is, by default, the master and the higher-rank processes are worker/slave threads. If you have code (such as compiling a stan model) which you only want one thread to execute, you can enclose it in an if(comm.rank() == 0) statement.

To get this going from scratch (assuming you're running windows and have already got R and rstan installed):

1. Download and install the Microsoft HPC pack
2. Install pbdMPI either from CRAN or from the r-pbd web site
3. From the command line, in the directory where the attached script lives (the -np option indicates the number of processes to start):
mpiexec -np 5 Rscript.exe MPI_schools.R

To run in single-process mode, invoke the script thus:
Rscript -e "use.mpi <- FALSE; source('MPI_schools.R')"

One thing that's not obvious is that if you want to run full out with one chain on each core, then you want the number of processes (the -np option for mpiexec) to be one greater than min(#  processors, # chains) because one process (thread) will be the master and the rest will be workers that execute the chains, so you need n+1 processors to run n threads full-out on n cores while the master waits for the workers to finish. Thus, to run 4 chains on 4 cores, you would use -np 5. To run 3 chains on 4 cores, you'd use -np 4.
MPI_schools.R

Linas Mockus

unread,
Jul 24, 2014, 3:18:43 PM7/24/14
to stan-...@googlegroups.com
Hi,

In case anybody needs below is R code to run chains in parallel (in simplest sense when chains don't talk with each other) in windows.  I am using the code to ensure that convergence is reached for multiple chains.

library (parallel)

stanmod <- function (chain, file, sm, data, warmup, n.draws, pars, init) {
    if (length (init) == 1) {
        if (init == "random") {
            init <- "random"
        } else {
            init <- list (init [[chain]])
        }
    } else {
        init <- list (init [[chain]])
    }
    sampling (sm, data = data, pars = pars, chains = 1,
        iter = warmup + n.draws, warmup = warmup,
        init = init, 
        diagnostic_file = paste (file, chain, "d.txt", sep = ""),
        sample_file = paste (file, chain, "s.txt", sep = ""),
        verbose = TRUE, refresh = max ((warmup + n.draws) / 10, 1),
        chain_id = chain)
}
par.stan <- function (file, code, data, warmup, n.draws, chains, pars,
    init)
{
    # translate model to C++
    rt <- stanc (file, model_code = code, model_name = file)
    if (rt$status == FALSE) return (NULL)
    # compile C++ code
    sm <- stan_model (stanc_ret = rt, verbose = FALSE)
    cat ("COMPILED\n")
    cl <- makePSOCKcluster (chains,
        outfile = paste (file, "out.txt", sep = ""))
    setDefaultCluster (cl)
    clusterExport (NULL, c ('stanmod'))
    clusterEvalQ (NULL, library (rstan))
    posterior <- parLapply (NULL, 1 : chains, stanmod, file, sm, data, warmup,
        n.draws, pars, init)
    stopCluster (cl)
    return (posterior)
}

Linas
Reply all
Reply to author
Forward
0 new messages