Looping in STAN vs. looping in R (20 fold faster)

2,035 views
Skip to first unread message

Steve

unread,
May 16, 2013, 2:48:18 PM5/16/13
to stan-...@googlegroups.com
In preparation for running a hierarchical model in STAN, we performed a timing test of two approaches. The first called STAN from R, passed a matrix of 1000x63 to STAN, looped in STAN over i=1,...,1,000 samples (rows), with each sample consisting of 63 observations from a N(mu_i=0, sigma_i=1) distribution (columns), and estimated mu_i and sigma_i. The second approach looped in R over i = 1, ...,1,000 samples and called STAN 1,000 times, once for each sample (row), passing to STAN a vector with 63 observations to estimate mu_i and sigma_i. 

The second approach - looping in R - was 20 times as fast as the first approach - looping in STAN.

This is a problem because we plan to use the first approach to implement a hierarchical model on data on 16,000 samples each with 63 observations. Implementing the hierarchical model does not seem possible with the second approach. We are hoping someone can point out why our STAN code for the first approach is inefficient, and what STAN code can be used to achieve the same speed (or better) as the second approach, so that a hierarchical model can be fitted for 16,000 mu_i, sigma_i in reasonable time.

Thanks!

###########################################################################################
# Filename: timing_20130516
# Use: Compare timing between looping in R and STAN using simple model/data
#
# Updates:
# 20130516 DJR Initial code created
###########################################################################################

######################################################
# PACKAGES
######################################################
library(rstan)

##############################################
# Data
##############################################
set.seed(444)
mat = matrix(rnorm(63*1000,0,1),ncol=63)
r = nrow(mat)
c = ncol(mat)

######################################################
# STAN models
# stan_model1: takes in matrix, est mu/sigma for each row
# stan_model2: takes in vector, est mu/sigma for vector
######################################################

stan_model1 <- '
data {
int R; //number of rows
int C; //number of columns
  real mat[R,C]; //matrix of outcomes
}
parameters {
real mu[R]; //vector of locations
real<lower=0> sigma[R]; //vector of scales
}
model {
for (m in 1:R) {
mat[m] ~ normal(mu[m],sigma[m]);
}
}
'

stan_model2 <- '
data {
int<lower=0> C;
  real vec[C]; //vector of outcomes
}
parameters {
real mu; //location
real<lower=0> sigma; //scale
}
model {
vec ~ normal(mu,sigma);
}
'

###########################################################
# TRANSLATE AND COMPILE STAN MODEL
###########################################################
compiled1 <- stan_model(model_code = stan_model1)
compiled2 <- stan_model(model_code = stan_model2)

#####################################
# METHOD 1: Loop in Stan 
#####################################

######################################################
# Assign data
######################################################
stan_data1 = list(mat = mat, R = r, C = c)
######################################################
# Run data through the model
######################################################
start = proc.time()
fit <- sampling(object=compiled1, data=stan_data1, iter=1000, chains=4, seed = 444)
time1 = proc.time() - start

#####################################
# METHOD 2: Loop in R 
#####################################
start = proc.time()
for (m in 1:r) {

######################################################
# Assign data for current marker
######################################################

stan_data2 = list(vec = mat[r,], C = c)
######################################################
# Run data through the model
######################################################

fit2 <- sampling(object=compiled2, data=stan_data2, iter=1000, chains=4, seed = 444)
#################
# END FIRST LOOP
#################
}
time2 = proc.time() - start

#####################################
# Compare times between methods
#####################################

print(time1)
## Loop in Stan
##   user  system elapsed 
##  191.09    3.70  196.22 

print(time2)
## Loop in R
##   user  system elapsed 
##  37.00    0.19   37.85

Ben Goodrich

unread,
May 16, 2013, 3:19:36 PM5/16/13
to stan-...@googlegroups.com
On Thursday, May 16, 2013 2:48:18 PM UTC-4, Steve wrote:
We are hoping someone can point out why our STAN code for the first approach is inefficient, and what STAN code can be used to achieve the same speed (or better) as the second approach, so that a hierarchical model can be fitted for 16,000 mu_i, sigma_i in reasonable time.

Part of it is using a plain two dimensional array for the data
 
stan_model1 <- '
data {
int R; //number of rows
int C; //number of columns
  real mat[R,C]; //matrix of outcomes
}

Instead you can use an array of vectors

vector[C] mat[R];

but make sure it gets passed in from R correctly.

We've been planning for a long time to support some form of

mat ~ normal(mu,sigma);

where mu and sigma were scalars but we haven't gotten it in yet. And this case where the LHS is two-dimensional while mu and sigma are vectors is even more tricky to support in a robust way.

Ben


Bob Carpenter

unread,
May 16, 2013, 3:28:10 PM5/16/13
to stan-...@googlegroups.com
I'm currently testing exactly this suggestion and have
it in a pending reply.

What are those timing results run on? It seems to be running way slower
than 3 minutes on my Mac for the all-in-one model.

Just to be clear, it's not the loops that are slow, per se,
it's the copying around of data by accessing rows of the
matrix (plus one other thing I can think of

There's another issue which may be tricky here, which is
how we do adaptation of per-parameter step sizes. Because this
adaptation is normalized, there's extra overhead here
and the sampling may not be as efficient.

Definitely something we need to look into.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Steve

unread,
May 16, 2013, 5:02:02 PM5/16/13
to stan-...@googlegroups.com
Thanks. Will try this approach and report on timing.

Steve

unread,
May 16, 2013, 5:09:40 PM5/16/13
to stan-...@googlegroups.com
Bob:

Here is the sessionInfo() output.

We will try Ben's approach and report on the timing, but my guess is that being able to use a sampling statement such as:

         mat ~ normal (mu, sigma)    #  where mu and sigma are vectors of length nrow, mat is a matrix nrow x ncol

would speed up cpu time considerably. 

Thanks for all the team's work on STAN. 

Steve

> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: i386-w64-mingw32/i386 (32-bit)
 
locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252   
 
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base   

Bob Carpenter

unread,
May 16, 2013, 5:40:30 PM5/16/13
to stan-...@googlegroups.com
What I had queued up to respond is below. It expands on
my earlier comment. I'm running the test now in another window
and will report when it's done.


On 5/16/13 2:48 PM, Steve wrote:
> In preparation for running a hierarchical model in STAN, we performed a timing test of two approaches. The first called
> STAN from R, passed a matrix of 1000x63 to STAN, looped in STAN over i=1,...,1,000 samples (rows), with each sample
> consisting of 63 observations from a N(mu_i=0, sigma_i=1) distribution (columns), and estimated mu_i and sigma_i. The
> second approach looped in R over i = 1, ...,1,000 samples and called STAN 1,000 times, once for each sample (row),
> passing to STAN a vector with 63 observations to estimate mu_i and sigma_i.
>
> The second approach - looping in R - was 20 times as fast as the first approach - looping in STAN.
>
> This is a problem because we plan to use the first approach to implement a hierarchical model on data on 16,000 samples
> each with 63 observations. Implementing the hierarchical model does not seem possible with the second approach. We are
> hoping someone can point out why our STAN code for the first approach is inefficient, and what STAN code can be used to
> achieve the same speed (or better) as the second approach, so that a hierarchical model can be fitted for 16,000 mu_i,
> sigma_i in reasonable time.

Thanks -- this is useful to know and a great example --- I'm
going to add a discussion to the optimization chapter of
the programming part.

Please let me start with a disclaimer for those who might
not read on: we're not talking about looping
per se, which is MUCH faster in Stan than in R (because Stan
compiles down to C loops), but the issue of whether fitting a big joint
model of independent parameters is faster than fitting many individual
models.

Here's the slower joint model:

> stan_model1 <- '
> data {
> int R; //number of rows
> int C; //number of columns
> real mat[R,C]; //matrix of outcomes
> }
> parameters {
> real mu[R]; //vector of locations
> real<lower=0> sigma[R]; //vector of scales
> }
> model {
> for (m in 1:R) {
> mat[m] ~ normal(mu[m],sigma[m]);
> }
> }
> '


The problem with the above model for speed is this
declaration:

> real mat[R,C]; //matrix of outcomes

coupled with this usage in the model:

> for (m in 1:R)
> mat[m] ~ normal(mu[m],sigma[m]);

There are two key issues:

1. The call to mat[m] creates a new short-lived vector object
which requires both a memory allocation and deallocation in every
iteration through the loop, which is much much slower than
arithmetic.

2. Because mat[m] is a row vector, the copy is very inefficient
because matrices are stored in column-major order. If R is
big, every element of mat[m] is going to cause a cache miss,
which is very expensive.

Instead, I would suggest declaring mat as follows:

row_vector[C] mat[R];

Then mat[m] just refers to an existing data structure in the
array of vectors so no copying is needed.

The reason you're seeing such a stark difference is because
this is about all this model's doing! Every other operation in
the model is pretty simple arithmetic.

- Bob

Steve

unread,
May 16, 2013, 8:04:24 PM5/16/13
to stan-...@googlegroups.com
Bob:

Thanks for these insights. I have implemented this approach and it is slower than the previous two approaches. However, I ran this session on my laptop at home (MacBook Air), whereas the results in the first post were run on a desktop PC at work. The times for the three approaches on MacBook Air are:

           user.self sys.self elapsed
time1   126.676    0.974 142.138
time2     82.337    1.498   97.104
time3   162.242    1.901 181.235

time1 & time2 sys.self comparison is reversed and no longer substantially different. time3 is longer than time1 and time2.

So I'm a little confused.....

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_1.3.0   inline_0.3.11 Rcpp_0.10.3  

loaded via a namespace (and not attached):
[1] codetools_0.2-8 tools_3.0.0    

stan_model3 <- '
data {
int R; //number of rows
int C; //number of columns
  row_vector[C] mat[R]; //matrix of outcomes
}
parameters {
real mu[R]; //vector of locations
real<lower=0> sigma[R]; //vector of scales
}
model {
for (m in 1:R) {
mat[m] ~ normal(mu[m],sigma[m]);
}
}
'
###########################################################
# TRANSLATE AND COMPILE STAN MODEL
###########################################################
compiled1 <- stan_model(model_code = stan_model1)
compiled2 <- stan_model(model_code = stan_model2)
compiled3 <- stan_model(model_code = stan_model3)

#####################################
# METHOD 3: Loop in Stan using row vectors
#####################################

######################################################
# Assign data
######################################################
stan_data3 = list(mat = mat, R = r, C = c)
######################################################
# Run data through the model
######################################################
start = proc.time()
fit <- sampling(object=compiled3, data=stan_data3, iter=1000, chains=4, seed = 444)
time3 = proc.time() - start




On Thursday, May 16, 2013 2:48:18 PM UTC-4, Steve wrote:

Steve

unread,
May 16, 2013, 8:44:45 PM5/16/13
to stan-...@googlegroups.com
Ran the three methods again on MacBook Air all within the same command, this time without compiling method3 in between, and obtained consistent results as before for user time and elapsed time, but different for sys time.

          user.self sys.self elapsed
time1   124.473    0.999 143.861     real mat[R,C]                 big joint model of independent parameters
time2     82.162    1.902   97.861     real vec[C]                    many individual models (via looping in R) 
time3   161.069    1.288 182.942     row_vector[C] mat[R]      big joint model of independent parameters

I'm surprised sys time (time1, time3, time2) provides a different ordering to user time (time2, time1, time3).
Previous       sys time (time1, time2, time3)  ordering, where I compiled method 3 after running methods 1 & 2, then ran method 3.


On Thursday, May 16, 2013 2:48:18 PM UTC-4, Steve wrote:

Bob Carpenter

unread,
May 17, 2013, 1:11:24 AM5/17/13
to stan-...@googlegroups.com
I got rather different results that were more in line
with what I was expecting relatively for the different
approaches.

I'm running Stan 1.3.0 on R 3.0.0 on a Macbook Pro
with 16GB of memory and an i7 CPU. I just freshly
installed both because I was behind a revision or two.
I then edited my R Makeconf file here:

/Library/Frameworks/R.framework/Resources/etc/Makeconf

to set optimization level to -O3 in CFLAGS and CXXFLAGS
lines. Unlike before, it now says -mtune=core2, which I
left (but I have no idea whether I should have or not).

I also have

> set_cppo('fast')

set.

I then did two runs, first the script exactly as you sent it,
which used a matrix for the data.

RUN 1

matrix[R,C] mat;

loop over rows of matrix
user system elapsed
83.624 0.306 84.049

loop in R
user system elapsed
45.188 0.874 47.164

The values are much closer together than you had, with the
looping over data sets a little less than twice as fast.

Then I closed R and restarted it to make sure any comparison
issues weren't due to R's memory management. Then I did a second
run using an array of row vectors:

RUN 2

row_vector[C] mat[R]; // array of vector outcomes

loop over array of row vectors
user system elapsed
117.639 0.390 117.746

loop in R
user system elapsed
46.949 0.759 48.942

As I was expecting, the matrix version took about 50% longer
compared to the array of row vectors when accessed by row.

I had noticed that a lot of the time was being spent on
a very slow chain --- chain 3 with the original seed. So I
just mashed the numberline to generate a new seed

seed = 9879087987

and got more comparable results:

loop over array of row vectors
user system elapsed
63.450 0.243 63.476

loop in R
user system elapsed
45.495 0.875 47.609

I'm really surprised the array of row vectors was slower for
you.

It may be R's memory. So you might want to try only
running one of these tests per R session.

Do you know which optimization levels you compiled the Stan
libraries and Stan models with? The default optimization level
is -O2 if you didn't fiddle with it.

To answer the other question, from R help for proc.time:

The definition of �user� and �system� times is from your OS. Typically it is something like
The �user time� is the CPU time charged for the execution of user instructions of the calling process.
The �system time� is the CPU time charged for execution by the system on behalf of the calling process.

So it looks like the values can be system dependent.

The doc also says the third column is:

the �real� elapsed time since the process was started

Thanks again for bringing this up. I'm going to add a discussion
of matrices vs. arrays to the optimization chapter of the manual.

- Bob

Steve

unread,
May 17, 2013, 1:18:31 PM5/17/13
to stan-...@googlegroups.com
Bob:

Thanks - this is all helpful, although puzzling on some points.

  1. Your times indicate that the matrix form is faster than the row_vectors form (like the results we posted yesterday), while your conclusion is that the row_vectors form speeds up the computation. Maybe I'm misinterpreting the proc.time output.  
  2. The time we care about is "elapsed time" - time it takes to obtain the results. It is surprising that sometimes the ordering given by elapsed time and by system CPU time are different, and not just by small margins.
  3. The Rstan manual has now omitted changing the makefile to set flags to -O3, and I understood the reason was that using the R command set_cppo("fast") achieved the same optimization in C++ compilation. Does editing Makeconf optimize C++ compilation further than solely using set_cppo("fast")?
Our timing from yesterday was based on R 2.15. We upgraded to R 3.0, and reinstalled STAN and Rstan. This is on a PC with 3.46GB Intel DUO CPU. The new timing is:

         user.self sys.self elapsed
time1    105.50     3.53   109.26     real mat[R,C]                big joint model of independent parameters
time2     37.73   161.17  199.30     real vec[C]                    many individual models via looping in R
time3     68.32     2.80     71.41     row_vector[C] mat[R]     big joint model of independent parameters

So this run exhibits the reduction in timing for row_vectors you originally suggested, about a 35% reduction. What is puzzling about this run is that the looping in R 3.0 now takes 199 for elapsed time instead of 37 in R 2.15.

Given these results, we plan on implementing the hierarchical model with the row_vector data structure.


Thanks for all your work on this issue. Looking forward to your discussion of it in the next version of the manual.


Steve




On Thursday, May 16, 2013 2:48:18 PM UTC-4, Steve wrote:

Bob Carpenter

unread,
May 17, 2013, 2:01:10 PM5/17/13
to stan-...@googlegroups.com


On 5/17/13 1:18 PM, Steve wrote:
> Bob:
>
> Thanks - this is all helpful, although puzzling on some points.
>
> 1. Your times indicate that the matrix form is faster than the row_vectors form (like the results we posted yesterday),
> while your conclusion is that the row_vectors form speeds up the computation. Maybe I'm misinterpreting the
> proc.time output.

I mislabeled the results --- it was the other way around.
I reran below to make sure.

The array of vectors was faster than matrix.

> 2. The time we care about is "elapsed time" - time it takes to obtain the results. It is surprising that sometimes the
> ordering given by elapsed time and by system CPU time are different, and not just by small margins.

I think part of the system time may be waiting time for
access to things like file handles, which is going to be
impacted by other things going on in the system.

> 3. The Rstan manual has now omitted changing the makefile to set flags to -O3, and I understood the reason was that
> using the R command set_cppo("fast") achieved the same optimization in C++ compilation. Does editing Makeconf
> optimize C++ compilation further than solely using set_cppo("fast")?

I'm still trying to work this out myself. The key is that
you want to compile the rstan binaries with -O3 optimization.
We'll fix the instructions over the next week to make this
clearer.

> Our timing from yesterday was based on R 2.15. We upgraded to R 3.0, and reinstalled STAN and Rstan. This is on a PC
> with 3.46GB Intel DUO CPU. The new timing is:
>
> user.self sys.self elapsed
> time1 105.50 3.53 109.26 real mat[R,C] big joint model of independent parameters
> time2 37.73 161.17 199.30 real vec[C] many individual models via looping in R
> time3 68.32 2.80 71.41 row_vector[C] mat[R] big joint model of independent parameters
>
> So this run exhibits the reduction in timing for row_vectors you originally suggested, about a 35% reduction.

Sorry -- I mislabeled the output, because I'd done them
in the opposite order. I just reran all the tests to be sure.

Original seed=444
user system elapsed
116.242 0.303 116.397 loop over rows of matrix
81.561 0.274 81.581 loop over array of row vectors
49.315 0.625 50.781 loop in R

With a different seed, we're actually faster than looping in
R:

New seed=9879087987 (seed = 444 still used in R for data)
user system elapsed
63.034 0.517 63.575 loop over rows of matrix
46.661 1.034 48.728 loop in R
44.313 0.188 44.525 loop over array of row vectors

The looping in R is more stably estimated because it's essentially
an average of 1000 independent trials, one for each row.

This also highlights how much variability there is in Stan's
implementation of NUTS and warmup adaptation. We're in the process
of figuring out how to quantify all the timing and variability issues
across a large number of models.

I think the user time here is measuring the work that's actually
being done by the code locally. System time's going to include time
to get memory or file handles from the system. And probbably
process management time.

The other big issue here in profiling is that what we really care
about is effective sample size per second, not number of iterations.
In this case, the two models are equal, so we don't need to adjust
for that, because the samples and hence effective sample sizes should
be the same.

> What is
> puzzling about this run is that the looping in R 3.0 now takes 199 for elapsed time instead of 37 in R 2.15.

I can imagine the system could get bogged down if there
are lots of open file handles or something --- I really don't
know what R does on the back end. You should try it again with
it the only thing running and other processes shut down.

But this is also one of the reason why it's tough to measure
software performance. What can work well under low system load
(either file handles, threads, processes, whatever) might not work
well under high system load. In general, it's better to pool/batch
system requests as much as possible because they tend to be really
slow and scarce compared to executing code.

> Given these results, we plan on implementing the hierarchical model with the row_vector data structure.

That's what I'd recommend.

> Thanks for all your work on this issue. Looking forward to your discussion of it in the next version of the manual.

The next release is going to be a big one.

- Bob

Jiqiang Guo

unread,
May 17, 2013, 2:09:52 PM5/17/13
to stan-...@googlegroups.com
Using set_cppo('fast') is enough.  If you are not sure about if O3 was used with your current installed rstan, you can reinstall rstan 
after calling set_cppo('fast').  But if you have never used set_cppo('debug'), I am sure you rstan was installed with O3. 

--
Jiqiang 

Bob Carpenter

unread,
May 17, 2013, 4:01:35 PM5/17/13
to stan-...@googlegroups.com
The default for R in their binary make config is -O2.
Or at least that's what it looks like.

So if you have never installed Stan and install it, don't
you get -O2?

- Bob

On 5/17/13 2:09 PM, Jiqiang Guo wrote:
> Using set_cppo('fast') is enough. If you are not sure about if O3 was used with your current installed rstan, you can
> reinstall rstan
> after calling set_cppo('fast'). But if you have never used set_cppo('debug'), I am sure you rstan was installed with O3.
>
> --
> Jiqiang
>
> On Fri, May 17, 2013 at 1:18 PM, Steve <ssk...@gmail.com <mailto:ssk...@gmail.com>> wrote:
>
> The Rstan manual has now omitted changing the makefile to set flags to -O3, and I understood the reason was that
> using the R command set_cppo("fast") achieved the same optimization in C++ compilation. Does editing Makeconf
> optimize C++ compilation further than solely using set_cppo("fast")?
>
>
>

Ben Goodrich

unread,
May 17, 2013, 4:18:34 PM5/17/13
to stan-...@googlegroups.com
On Friday, May 17, 2013 4:01:35 PM UTC-4, Bob Carpenter wrote:
The default for R in their binary make config is -O2.
Or at least that's what it looks like.

So if you have never installed Stan and install it, don't
you get -O2?

No, that optimization can be overwritten by each package

git grep -n -F "O3" rstan/ | cat
Binary file rstan/rstan/R/sysdata.rda matches
rstan
/rstan/man/makeconf_path.Rd:26:  Stan models, increase the optimization level to \code{-O3} defined
rstan
/rstan/man/stanmodel-class.Rd:51:      used for compiling the model. The returned string is like \code{CXXFLAGS = -O3}.}
rstan
/rstan/src/Makefile:25:CXXFLAGS = -O3 $(LTO)
rstan
/rstan/src/Makefile.win:33:CXXFLAGS = -O3

unless that is overwritten by the configuration files in $HOME/.R/Makevars or the system's Makeconf unless that is overwritten using the new facilities in R 3.0.1.

Ben

Reply all
Reply to author
Forward
0 new messages