Why is rstan so much slower in a docker container

683 views
Skip to first unread message

James Camac

unread,
Nov 29, 2016, 3:27:33 AM11/29/16
to Stan users mailing list
I know I've asked a variant of this before. But I have worked on a couple of projects now that use run rstan in docker containers and every time I've noticed the runs within a docker container are about 4 times slower relative to what I get from running it on the host machine.

For example, the current project I am working on has the following dockerfile which loads the environment and data for an entire project:

FROM rocker/tidyverse:3.3.2
MAINTAINER James Camac <james...@gmail.com>

RUN    apt-get update \
    && apt-get install -y --no-install-recommends \
         libcurl4-openssl-dev \
         texlive-latex-recommended \
         texlive-latex-extra \
         texlive-humanities \
         texlive-fonts-recommended \
         texlive-science \
         lmodern \
         git \
    && apt-get clean \
    && apt-get autoremove \
    && rm -rf var/lib/apt/lists/*

    --deps "TRUE" \
    rstan reshape2 cowplot lubridate

RUN installGithub.r \
    --deps "TRUE" \
    richfitz/remake

RUN rm -rf /tmp/downloaded_packages/ /tmp/*.rds

RUN git clone https://github.com/jscamac/Alpine_Shrub_Experiment /home/Alpine_Shrub_Experiment

This project can then be made reproducible (from data analysis through to compiling the pdf of the manuscript) in an interactive docker container by building the image and then running an interactive container, where you can move to the folder home/Alpine_Shrub_Experiment and run `remake::make()`.
On my host computer (Macbook Pro, OSX 10.11.6, 16 GB RAM, i7 2.8 GHz) the whole process takes about 50 minutes (it compiles the data, runs 11 stan models, builds figures and compiles the manuscript). However, using a docker container (allowing it full use of my machine) it takes 195 minutes.

Am I missing something with linux and some stan optimisation?



James Camac

unread,
Nov 29, 2016, 3:29:51 AM11/29/16
to Stan users mailing list
I should have added. Stan is definitely the bottle neck in the docker container. The processing of the data, building of figures and compiling of the manuscript only take a fraction of the time (and I haven't noticed any difference between host and container with these components).

Matt Espe

unread,
Nov 29, 2016, 12:09:53 PM11/29/16
to Stan users mailing list
Hi James,

I have found through experimentation (and Bob has mentioned earlier) that Stan is often bottlenecked by memory. I have found dramatic decreases in run-times moving to machines with faster memory, even with slower CPUs. There are a handful of reports of slow IO/memory performance on Docker images. For example:


I am no expert, but I would guess that could be the culprit.

Matt

Bob Carpenter

unread,
Nov 29, 2016, 1:43:42 PM11/29/16
to stan-...@googlegroups.com
Thanks for letting us know. Have other people experienced
the same slowdown? What platform are you running docker on?

This stuff's very tricky to profile at high optimization
levels because the optimizer and profilers don't play
nicely together.

Memory's always a likely culprit. Disk I/O can also be
an issue with RStan, which flushes rather aggressively,
which isn't very nice if you have slow I/O (we tend to mask
that pain running on solid state drives in all of our development
environments---Daniel and I were just discussing this yesterday).

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

James Camac

unread,
Nov 29, 2016, 5:21:06 PM11/29/16
to stan-...@googlegroups.com
Thanks Matt and Bob,
Just so you know, running the models uses about 1.5 GB of RAM... and on my machine the docker container has access to the entire 16GB available.
The only difference between the docker container and the host machine run is that the former runs debian:jessie and the latter is OSX 10.11.6.
Oh and probably different compilers gcc for debian and clang for OSX.

I'd be really interested to see if others are having similar issues and whether they have found any solutions.


J

James Camac

unread,
Nov 29, 2016, 5:30:47 PM11/29/16
to stan-...@googlegroups.com
If I get time today I will make a minimum working example.

J


Bob Carpenter

unread,
Nov 29, 2016, 5:58:29 PM11/29/16
to stan-...@googlegroups.com
Have you compared speeds on those machines without Docker and
with the same compilers that docker's using? Do they both have
solid-state drives? Are you running multiple chains in parallel
inside and outside docker?

You really need an apples-to-apples comparison here.

And it's not memory size so much as whether docker inserts
a layer of processing somewhere.

- Bob

Daniel Lee

unread,
Nov 29, 2016, 6:19:18 PM11/29/16
to stan-...@googlegroups.com
I was about to ask the same thing. What's the performance hit if you use the same OS as the host OS with the same compiler and compiler options?

James Camac

unread,
Nov 30, 2016, 7:02:49 AM11/30/16
to Stan users mailing list
I don't have access to a linux machine at the moment so I can't compare apples with apples so to speak.
I'll see if I can find one tomorrow.

Bob, some answers to your questions below:

Have you compared speeds on those machines without Docker and 
with the same compilers that docker's using? No not yet. As I said in my previous email the docker container is using gcc and the iMac is using clang.
Do they both have solid-state drives?  Um I'm running the docker container on the same machine (so not sure what you mean by both?). I'm using an iMac which has a 1TB "fusion drive"
Are you running multiple chains in parallel inside and outside docker? 3 chains and yes in parallel.

While I find a linux machine to test this. I've produced a more manageable working example that others can try if they want.
If you have docker installed you can run the following from a terminal:
docker pull jscamac/rstan_docker_test # Loads docker image

docker run
-it jscamac/rstan_docker_test # Runs a docker container that opens R.

# Once in R just run the following:
system.time(remake::make('greaus_seedling_density_model'))


The same model can be compared on a local machine by doing the following:

#Using the terminal:
git clone https
://github.com/jscamac/Alpine_Shrub_Experiment /home/Alpine_Shrub_Experiment.
#Open R in the repository just downloaded,
installing remake devtools
::install_github("richfitz/remake", dependencies=TRUE)
# install any missing packages by running:
remake
::install_missing_packages()
#and then run the model using:
system
.time(remake::make('greaus_seedling_density_model'))

Using an IMac (4 GHz i7, 16 GB RAM, OSX = 10.11.6)
Docker container:

   user  system elapsed 

 16.810   1.310 834.164 


Chain times:

 Elapsed Time: 347.746 seconds (Warm-up)

               298.067 seconds (Sampling)

               645.813 seconds (Total)


 Elapsed Time: 355.069 seconds (Warm-up)

               384.621 seconds (Sampling)

               739.69 seconds (Total)


 Elapsed Time: 521.517 seconds (Warm-up)

               292.196 seconds (Sampling)

               813.713 seconds (Total)


Locally on iMac:

   user  system elapsed 

 22.487   1.096 116.995 


Chain times:

 Elapsed Time: 39.6557 seconds (Warm-up)

               30.5696 seconds (Sampling)

               70.2254 seconds (Total)


 Elapsed Time: 39.9883 seconds (Warm-up)

               32.2393 seconds (Sampling)

               72.2276 seconds (Total)


 Elapsed Time: 42.8292 seconds (Warm-up)

               44.0682 seconds (Sampling)

               86.8974 seconds (Total)


Daniel Lee

unread,
Nov 30, 2016, 6:40:51 PM11/30/16
to stan-users mailing list
Hi James,

That output looks really suspect.

If I'm reading the output of the time command correctly, it looks like running in Docker is faster than locally:
Docker ~18s:

   user  system elapsed 

 16.810   1.310 834.164 

local ~23s:

  user  system elapsed 

 22.487   1.096 116.995 


See this thread on StackOverflow:


At this point, you've got to figure out where all the time is going. I just took a look at your repo: there's a lot more than just rstan in there. Can you boil that down to simple rstan commands so I can run this locally?

Just to give you a point of reference, when I run this on the example-models/bugs_examples/vol1/blocker/blocker.stan, this is my result:

library(rstan)
fit <- stan("blocker.stan")
stan_data <- read_rdump("blocker.data.R")
system.time(fit <- stan(fit = fit, data = stan_data))

   user  system elapsed 
  2.298   0.075   2.555 


When I enable parallel, this is my result:

options(mc.cores = parallel::detectCores())
system.time(fit <- stan(fit = fit, data = stan_data))

   user  system elapsed 
  0.191   0.085   4.382 


Notice: my elapsed time goes up. My user time + system time drops a bit.

If that's how you're running it, try running it in serial and see what results you get.



Daniel





--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Daniel Falster

unread,
Nov 30, 2016, 9:15:44 PM11/30/16
to stan-...@googlegroups.com

Hi all,

I have experienced similar slowdown using rstan via docker containers in another project so am keen to help figure this out.

Thanks for the example James. I have run your examples natively on a  mac (OS X Yosemite 10.10.5), natively on linux (Ubuntu 14.04.3 LTS), and in docker containers (Debian GNU/Linux 8 (jessie)) on both mac and linux. From everything I can see there is a big slowdown exactly as James reported.

Number of CPUs: When i first read James's results I was concerned that perhaps the docker container was only accessing 1 CPU. But that seems not to be the case. The results below show that when I ensure a docker container is using 3 cpus (we're running 3 chains), i get similar elapsed time to James (~800s). When I run results in a docker container using 1 cpu the elapsed time is about 3 times longer (~2243 s) while the time reported by rstan to sample each chain is similar to when using 3 cpus (~700-800s). The attached pictures show evidence of using 1 vs 3 cpu.

Native Mac vs Native Linux: I'm getting similar speeds when running the code on a mac and a linux box running Ubuntu 14.04.3 LTS.

In a docker container on Mac or Linux:  I'm getting similar speeds when running the code in the supplied docker container on either a mac or a linux box running Ubuntu 14.04.3 LTS.
In both cases the results via docker are running 5-10 times slower in the docker container than when run natively. I verified that the docker containers were accessing all 3 cpus.

Elapsed vs User time: I'm confused as to how to interpret the user vs elapsed time. In all the results I ran, the code had 100% access to the CPUs. Running in docker containers took a lot longer than running natively. The elapsed time returned by `system.time` in R is very close to the statistics reported by rstan for each chain.


Results supporting the above:

On my mac, run natively using 3 CPUs:

Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 60.0834 seconds (Warm-up)
               49.3 seconds (Sampling)
               109.383 seconds (Total)


Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 61.2126 seconds (Warm-up)
               48.9942 seconds (Sampling)
               110.207 seconds (Total)


Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 64.7741 seconds (Warm-up)
               53.9386 seconds (Sampling)
               118.713 seconds (Total)

   user  system elapsed
 19.419   1.295 152.333

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

On linux, run natively using 3 CPUs:

Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 37.8166 seconds (Warm-up)
               32.2405 seconds (Sampling)
               70.0571 seconds (Total)


Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 41.5698 seconds (Warm-up)
               31.7675 seconds (Sampling)
               73.3373 seconds (Total)


Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 42.8351 seconds (Warm-up)
               32.1626 seconds (Sampling)
               74.9977 seconds (Total)

   user  system elapsed 
 32.726   0.876 110.121 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

In a docker container on a linux machine, using 3 CPUs:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)


Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 395.384 seconds (Warm-up)
               328.645 seconds (Sampling)
               724.03 seconds (Total)


Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 437.642 seconds (Warm-up)
               318.54 seconds (Sampling)
               756.182 seconds (Total)

Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 547.706 seconds (Warm-up)
               355.394 seconds (Sampling)
               903.1 seconds (Total)

   user  system elapsed 
 19.160   1.040 926.962 


In a docker container on my mac, using 3 CPUs:


Elapsed Time: 394.128 seconds (Warm-up)
               308.669 seconds (Sampling)
               702.796 seconds (Total)


Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 386.403 seconds (Warm-up)
               324.391 seconds (Sampling)
               710.795 seconds (Total)


Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 430.622 seconds (Warm-up)
               332.088 seconds (Sampling)
               762.71 seconds (Total)

   user  system elapsed
 25.650   4.070 821.169

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

In a docker container on my mac, using 1 CPU:

Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 395.79 seconds (Warm-up)
               318.767 seconds (Sampling)
               714.557 seconds (Total)

Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 386.743 seconds (Warm-up)
               361.51 seconds (Sampling)
               748.253 seconds (Total)

Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 376.643 seconds (Warm-up)
               375.721 seconds (Sampling)
               752.365 seconds (Total)

    user   system  elapsed
2237.390    1.270 2243.029


> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

docker-1cpu.png
docker-3cpu.png

Daniel Falster

unread,
Dec 1, 2016, 12:50:41 AM12/1/16
to Stan users mailing list
Sometime ago i mentioned the issue of a slowdown in rstan performance when run via docker to some experienced colleagues, Dirk Eddelbuettel and Colin Gillespie. Both wondered whether changing the BLAS library would help. This is all beyond my expertise but I'll mention it FWIW. Apparently optimising the BLAS library can result in a big speed when using R for any machine. But more worrying, Dirk wondered whether there were two layers trying to multithread and use all the cores? This could explain the big slowdown being experienced. He suggested checking what BLAS library stan uses, and what was installed in the docker image. And then try replacing with with Atllas if it's not already the default.

James's docker container is building off rocker/tidyverse which builds off rocker/rstudio-stable which builds off rocker/r-ver which builds off debian:jessie. The installation of R in construction of rocker/r-ver uses options "--without-blas" which according to these instructions suggests it is using R's default BLAS library. 

Perhaps one of the stan team can comment on whether this might indeed make any difference and be worth pursuing further? 

 

Ben Goodrich

unread,
Dec 1, 2016, 1:15:52 AM12/1/16
to Stan users mailing list
On Thursday, December 1, 2016 at 12:50:41 AM UTC-5, Daniel Falster wrote:
Perhaps one of the stan team can comment on whether this might indeed make any difference and be worth pursuing further? 

I don't think Stan uses BLAS. There are options to use BLAS with Eigen

https://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html

but we don't set set any of those flags by default.

James Camac

unread,
Dec 1, 2016, 2:48:56 AM12/1/16
to Stan users mailing list
Thanks Daniel Lee,
Yes I can produce a MWE. I just wanted to make sure others could replicate my problem (which Daniel Falster has done nicely on multiple computers).
Yes the repo has a lot more stuff on it. But that's because it is designed to run a series of models. I just selected one for this case.
But now I know this seems to be more then an iMac vs Linux issue I'll focus on making a simpler example in the next day or so.

J

Matt Espe

unread,
Dec 2, 2016, 12:09:44 PM12/2/16
to stan-...@googlegroups.com
One issue is the read_rdump function is stupid slow. I put up a pull request for a fix on the rstan-dev repo. Just this little change results in something like a 140X speedup in reading the data into R. It doesn't help with the rest of the model, but it cut 15 min off my analysis.

Matt

Bob Carpenter

unread,
Dec 2, 2016, 12:34:58 PM12/2/16
to stan-...@googlegroups.com

> On Dec 2, 2016, at 12:09 PM, Matt Espe <lck...@gmail.com> wrote:
>
> One issue is the read_rdump function is stupid slow. I put up a pull request for a fix on the rstan-dev repo. Just this little change results in something like a 140X speedup in reading the data into R.

Wow. And it looks like just this:

Altered read_rdump() to pass in keep.source = FALSE to source().

Hope that's OK and can be merged.

- Bob

James Camac

unread,
Dec 5, 2016, 4:34:01 AM12/5/16
to stan-...@googlegroups.com
Hi guys,
I finally managed to find some time to develop some simpler examples without the `remake` workflow and other stuff.

Basically I set up a repository with 4 models (3 of my own + the 8schools example) to test.

Example 1: 8schools
8schools is a simple example taken directly from the stan repository. In total the model estimates 10 parameters.

Example 2: recruits 
recruits is a simple count model that also extracts for categorical generated quantity parameters. In total the model estimates 5 parameters + 32 plot random effects.

Example 3: shrub density
density is another count model but with many more parameters & random effects. In total this model has 209 parameters (200 of which are random effect parameters). This model also contains range of generated quantities

Example 4: shrub density minus generated quantities.
is the same as example 3 except with no generated quantities.

You can see the actual models by going to: https://github.com/jscamac/mwe_stan_docker
That link also provides instructions on how these models can be run in a docker container using an image I've already prebuilt (for details see the Dockerfile in the repository).
But the basics:
On docker container:
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)      

On host/local machine
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

Both are running rstan 2.12.1.

Below are the time results I got from running these models both locally on my iMac  (4 GHz i7, 16 GB RAM, OSX = 10.11.6) and within the docker container on the same machine.

Main thing to notice is that the chain times for ALL models are substantially faster locally then on the docker container (see attached). Even in the 8 school example. Like Daniel Falster, I found all models completed substantially faster on the local machine. So the system.time estimates (as seen under TOTAL) seem a little odd. As such I'd probably focus more on stan's chain times - as you can see for pretty much all models, it takes 10x as long within the docker container.


Screen Shot 2016-12-05 at 8.41.24 PM.png

James Camac

unread,
Dec 5, 2016, 4:55:59 AM12/5/16
to Stan users mailing list
Sorry I just realised my table didn't compile correctly. So I've edited my previous post and added it as an attachment.

Daniel Lee

unread,
Dec 5, 2016, 9:04:49 AM12/5/16
to stan-...@googlegroups.com
Hi James,

The work you put in for the table is on the right path, but isn't enough to distinguish causes. We're trying to find an order of magnitude difference. 

I think I remembered Bob outlining a few things that could have differences in a docker container vs local. Here are some that I can think of:
- cpu runtime
- parallel not implemented correctly
- I/O

If you want to see if it's the first or the second, please rerun with 1 core only. It looks like when you run in parallel, time doesn't report the subprocesses. 

For the I/O, one way to do that is with a comparison to CmdStan. 


Daniel


On Dec 5, 2016, at 4:55 AM, James Camac <james...@gmail.com> wrote:

Sorry I just realised my table didn't compile correctly. So I've edited my previous post and added it as an attachment.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Dec 5, 2016, 11:42:14 AM12/5/16
to stan-...@googlegroups.com
Someone also brought up memory---I haven't been the only one
chiming in. These are all generic and not Stan-specific, so
I'd think there'd be a lot of information out there on this if
this kind of slowdown can happen.

- Bob

Daniel Falster

unread,
Dec 5, 2016, 8:01:35 PM12/5/16
to stan-...@googlegroups.com
Hi, To determine whether the slowdown James and I have experienced is specific to stan or more generally because of using docker, I benchmarked R running natively vs R running inside docker. (I get similar results using the rocker/tidyverse docker container or james's jscamac/rstan_docker_test container.) 

In summary, running R inside the rocker/tidyverse container results took 1.28 times longer than running natively. This is a slowdown but nothing like the 10x slowdown we're seeing in the stan sampling. 

I used the package benchmarkme to run a standardized set of tests https://github.com/csgillespie/benchmarkme
In docker, individual tasks take from 0.92-2.5 times the time take when run natively. Summed across all tests docker takes 1.28 times running natively. 

Code for running the tests is below. Rds files with my results are attached.

To start docker machine: docker run -v ${PWD}:/root -it rocker/tidyverse R

# Running tests in docker
install.packages("benchmarkme")
library(benchmarkme)
benchmarkme::get_ram()
res = benchmark_std()
saveRDS(res, "/root/docker-6GB.rds")

# Running tests natively
install.packages("benchmarkme")
library(benchmarkme)
benchmarkme::get_ram()
res = benchmark_std()
saveRDS(res, "native.rds")


# Comparing results
library(tidyverse)
res1 <-  readRDS("native.rds") %>%
  mutate(grp= "native")
res2 <-  readRDS("docker-6GB.rds") %>%
  mutate(grp= "docker")
results <- bind_rows(res1, res2)

# plot docker v native
ggplot(results, aes(test, user)) +
  geom_point(aes(col=grp)) +
  scale_y_continuous(name = "User (sec)")

# ratio of docker to native by task
results %>% group_by(grp, test) %>% summarise(mn=mean(user)) %>% arrange(test) %>% ungroup() %>%
  group_by(test) %>%
    summarise(native = mn[grp=="native"], docker = mn[grp=="docker"], ratio = mn[grp=="docker"]/mn[grp=="native"])

# ratio of docker to native summed over all tasks
results %>% group_by(grp) %>% summarise(mn=sum(user)) %>% ungroup() %>%
    summarise(native = mn[grp=="native"], docker = mn[grp=="docker"], ratio = mn[grp=="docker"]/mn[grp=="native"])



 
docker-6GB.rds
native.rds
benchmark.png

Bob Carpenter

unread,
Dec 5, 2016, 8:33:33 PM12/5/16
to stan-...@googlegroups.com

> On Dec 5, 2016, at 8:01 PM, Daniel Falster <adaptiv...@gmail.com> wrote:
>
> Hi, To determine whether the slowdown James and I have experienced is specific to stan or more generally because of using docker, I benchmarked R running natively vs R running inside docker. (I get similar results using the rocker/tidyverse docker container or james's jscamac/rstan_docker_test container.)
>
> In summary, running R inside the rocker/tidyverse container results took 1.28 times longer than running natively. This is a slowdown but nothing like the 10x slowdown we're seeing in the stan sampling.

> I used the package benchmarkme to run a standardized set of tests https://github.com/csgillespie/benchmarkme
> In docker, individual tasks take from 0.92-2.5 times the time take when run natively. Summed across all tests docker takes 1.28 times running natively.

I know nothing at all about docker or how it runs, but
if there is a slowdown somewhere in something like memory or
disk access, then the tighter the code is, the larger the
slowdown will appear. I don't know what those R benchmarks
are---is that mostly R code or is it things like matrix operations
that call Fortran on the back end? If it's the latter, then
that's more like Stan.

- Bob

Daniel Falster

unread,
Dec 5, 2016, 8:58:09 PM12/5/16
to Stan users mailing list
Sorry if this was a diversion, I was hoping it would be helpful in isolating where the slowdown is happening. I don't know much about this stuff.

As far as i can tell, most of the tests -- while implemented in R -- are calling to backend operations, ie so testing R's basic linear algebra library and laplack. Here is some more info that gives the types of operations being tested (executed via this code https://github.com/csgillespie/benchmarkme/tree/master/R):

# Programming benchmarks (5 tests):

3,500,000 Fibonacci numbers calculation (vector calc): 0.608 (sec).

Grand common divisors of 1,000,000 pairs (recursion): 1.23 (sec).

Creation of a 3500x3500 Hilbert matrix (matrix calc): 0.438 (sec).

Creation of a 3000x3000 Toeplitz matrix (loops): 18.8 (sec).

Escoufier's method on a 60x60 matrix (mixed): 2.67 (sec).

# Matrix calculation benchmarks (5 tests):

Creation, transp., deformation of a 5000x5000 matrix: 1.15 (sec).

2500x2500 normal distributed random matrix ^1000: 0.54 (sec).

Sorting of 7,000,000 random values: 1.11 (sec).

2500x2500 cross-product matrix (b = a' * a): 7.42 (sec).

Linear regr. over a 3000x3000 matrix (c = a \ b'): 4.94 (sec).

# Matrix function benchmarks (5 tests):

Cholesky decomposition of a 3000x3000 matrix: 3.73 (sec).

Determinant of a 2500x2500 random matrix: 2.97 (sec).

Eigenvalues of a 640x640 random matrix: 0.658 (sec).

FFT over 2,500,000 random values: 0.629 (sec).

Inverse of a 1600x1600 random matrix: 2.66 (sec)

James Camac

unread,
Dec 5, 2016, 10:49:00 PM12/5/16
to Stan users mailing list
Hi Daniel Lee,
In my rush to post this last night I forgot to include the tests without parallel processing. The short answer is that doesn't seem to be the problem (see attached).

I haven't used cmd stan before. I managed to create a Docker container with it installed but I haven't figured out how to run cmd stan yet. I'll need to spend some time looking at the manual. (The parallel comparison is in the table in the previous post).

I'll see what I can come up with.


Daniel Lee

unread,
Dec 5, 2016, 11:15:31 PM12/5/16
to stan-users mailing list
Hi James,

That output illuminates a lot! And it opens up a lot more questions.

How does Ex 1 work? On docker, it's ~0.5 sec per chain and total time is 17 sec. On local, it's ~0.05 sec per chain (that's an order of magnitude) and the total time is ~27 sec. What's going on? How many chains are you running? If that's because you've enabled parallelization through RStan, turn that off (don't do this: options(mc.cores = parallel::detectCores()) -- I don't know how to turn it off once it's on).

Regarding those benchmarks you ran: how many times did you run them? There's a lot of variation run-to-run. Locally, I see something like a 2x difference on small runs... even when no active processes are running. Just because. I don't know what the variation is going to be through a docker container. You'll want to know how much variation there is when you're running the same bits (make sure seeds are the same, everything is the same).

I was telling Bob earlier, I wish I knew more about docker. If I did, I would probably run through these things out of curiousity.


Daniel




--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

James Camac

unread,
Dec 5, 2016, 11:43:19 PM12/5/16
to Stan users mailing list
Hi Daniel,
It isn't running parallel processors. I made sure of that. It is running 4 chains using rstan's default settings (i.e. a single processor). This means the chains run sequentially (i.e. one chain at a time).
I have run a single chain example and the same 10-fold difference doesn't change.

I think the bulk of the total time (i.e. system.time() ) is captured by the compiling of the model rather then the run itself. I'm less worried about the compiling times and more worried about the sampling times (i.e. chain times).
Also I have set the same seed for all models (12345).

I'm not sure what benchmark runs you are referring too? The tables are the time results for each chain (total of 4 chains) for each model. Plus the TOTAL which refers to R's system.time() return.

J

Daniel Lee

unread,
Dec 5, 2016, 11:52:26 PM12/5/16
to stan-users mailing list
On Mon, Dec 5, 2016 at 11:43 PM, James Camac <james...@gmail.com> wrote:
Hi Daniel,
It isn't running parallel processors. I made sure of that. It is running 4 chains using rstan's default settings (i.e. a single processor). This means the chains run sequentially (i.e. one chain at a time).
I have run a single chain example and the same 10-fold difference doesn't change.

Great. That's good to know. For future timing, can you separate the compile time from the run time?

And... are you getting similar n_effs? (We've already been bitten by that across different compilers.) I just want to make sure we're comparing apples to apples as much as we can.

Speaking of... what versions of compilers are you using? Sorry for being so pedantic, but these things matter if you want to time.

And can you verify that this is with optimization O3 on both? If the total time (compile + runtime) is shorter for example 1 on docker, that's fishy. Maybe the compiler options aren't actually set properly.
 
I think the bulk of the total time (i.e. system.time() ) is captured by the compiling of the model rather then the run itself. I'm less worried about the compiling times and more worried about the sampling times (i.e. chain times).

That's fine, but when there's a difference where the compile time is faster in the docker image, that should have thrown a red flag.
 
Also I have set the same seed for all models (12345).

I don't think this will hold, but can you see if the draws are identical? If not, then you'll have to compare time / n_eff, not total time (or time / iterations).
 
I'm not sure what benchmark runs you are referring too? The tables are the time results for each chain (total of 4 chains) for each model. Plus the TOTAL which refers to R's system.time() return.

The benchmarks you ran for R. I couldn't tell if you ran that once and reported a single set of numbers or if you ran it N times and averaged it. Here, the variation is also important.



Daniel

 

J

James Camac

unread,
Dec 6, 2016, 12:37:41 AM12/6/16
to Stan users mailing list
The docker container is using compiler: g++ (Debian 4.9.2-10) 4.9.2
My iMac uses: Apple LLVM version 8.0.0 (clang-800.0.42.1).
Daniel Falster has a linux machine and did a more comparable test above running the models within a docker container on his linux machine as well as outside the container - He found the same performance problem... so this doesn't appear to be a operating system problem.

re: effective sample size comparison

DOCKER
Inference for Stan model: 8schools.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu       7.95    0.15 5.02  -1.78   4.81   7.76  11.10  17.78  1074    1
tau      6.23    0.16 5.38   0.20   2.22   4.87   8.72  20.65  1170    1
eta[1]   0.35    0.02 0.95  -1.58  -0.23   0.37   0.99   2.13  2692    1
eta[2]   0.01    0.02 0.88  -1.69  -0.58   0.00   0.59   1.74  2650    1
eta[3]  -0.19    0.02 0.93  -2.01  -0.82  -0.21   0.42   1.64  2805    1
eta[4]  -0.02    0.02 0.87  -1.73  -0.60  -0.03   0.55   1.69  3127    1
eta[5]  -0.34    0.02 0.87  -2.00  -0.91  -0.36   0.23   1.39  2699    1
eta[6]  -0.18    0.02 0.90  -1.94  -0.79  -0.18   0.41   1.62  2557    1
eta[7]   0.33    0.02 0.90  -1.44  -0.25   0.33   0.91   2.18  2315    1
eta[8]   0.04    0.02 0.96  -1.85  -0.61   0.05   0.67   1.90  2932    1
lp__   -39.63    0.07 2.67 -45.59 -41.23 -39.40 -37.76 -35.03  1283    1

Samples were drawn using NUTS(diag_e) at Tue Dec  6 04:56:24 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

LOCAL

Inference for Stan model: 8schools.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu       8.04    0.12 4.92  -1.80   4.87   8.05  11.23  17.97  1682 1.00
tau      6.63    0.15 5.65   0.24   2.47   5.27   9.25  20.92  1476 1.01
eta[1]   0.38    0.02 0.94  -1.53  -0.25   0.40   1.04   2.12  2823 1.00
eta[2]   0.01    0.02 0.87  -1.73  -0.55   0.02   0.59   1.70  2919 1.00
eta[3]  -0.18    0.02 0.90  -1.93  -0.77  -0.19   0.40   1.65  2776 1.00
eta[4]  -0.01    0.02 0.83  -1.62  -0.56  -0.02   0.55   1.65  2426 1.00
eta[5]  -0.33    0.02 0.87  -2.01  -0.90  -0.34   0.23   1.39  2844 1.00
eta[6]  -0.23    0.02 0.87  -1.95  -0.80  -0.23   0.35   1.51  3293 1.00
eta[7]   0.34    0.02 0.84  -1.38  -0.21   0.35   0.89   1.98  2775 1.00
eta[8]   0.07    0.02 0.92  -1.73  -0.54   0.09   0.69   1.92  3580 1.00
lp__   -39.39    0.10 2.58 -45.21 -40.94 -39.13 -37.53 -35.06   689 1.01

Samples were drawn using NUTS(diag_e) at Tue Dec  6 16:03:27 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).


I'll need to do some tweaking to the code the get sampling time (but isn't that in essence the chain times?)

I didn't run the benchmarks from R. That was Daniel Falster.

On the iMac I'm using -O3 optimisation. I couldn't find the \.R/Makevars on the docker container... interesting.

Daniel Lee

unread,
Dec 6, 2016, 12:42:57 AM12/6/16
to stan-users mailing list
Yup.

Looks like there are differences in the runs. (not a statement whether that's good or bad, just a fact.)
 

I didn't run the benchmarks from R. That was Daniel Falster.

Ah. Sorry about that -- too much activity on our mailing list. It makes it difficult to keep everything together.

 

On the iMac I'm using -O3 optimisation. I couldn't find the \.R/Makevars on the docker container... interesting.

That could play a large part if you're not running O3.



Daniel

Ben Goodrich

unread,
Dec 6, 2016, 12:45:08 AM12/6/16
to Stan users mailing list
On Tuesday, December 6, 2016 at 12:37:41 AM UTC-5, James Camac wrote:
On the iMac I'm using -O3 optimisation. I couldn't find the \.R/Makevars on the docker container... interesting.

James Camac

unread,
Dec 6, 2016, 1:12:41 AM12/6/16
to Stan users mailing list
This is looking very very promising.
Docker chain speeds for 8 schools is now similar to what I have been getting locally.

 Elapsed Time: 0.019911 seconds (Warm-up)
               0.0188 seconds (Sampling)
               0.038711 seconds (Total)

 Elapsed Time: 0.020201 seconds (Warm-up)
               0.015754 seconds (Sampling)
               0.035955 seconds (Total)

 Elapsed Time: 0.019631 seconds (Warm-up)
               0.030389 seconds (Sampling)
               0.05002 seconds (Total)

 Elapsed Time: 0.020383 seconds (Warm-up)
               0.022288 seconds (Sampling)
               0.042671 seconds (Total)


Message has been deleted

James Camac

unread,
Dec 6, 2016, 4:08:50 AM12/6/16
to Stan users mailing list
Hi I've now rerun those tests and the optimisation issue was def the problem. Setting the optimisation mode and using clang instead of g++ I'm getting pretty much identical reads now.
I can't believe I forgot about that optimisation!

Daniel Falster

unread,
Dec 6, 2016, 5:55:04 PM12/6/16
to Stan users mailing list
Wow, great detective work. I can confirm that the other problem where I experienced a slowdown while running rstan in docker also lacks the custom .R/Makevars configuration. 
Reply all
Reply to author
Forward
0 new messages