Process R killed--likely too much memory

1,635 views
Skip to first unread message

Ross Boylan

unread,
Sep 12, 2013, 8:34:39 PM9/12/13
to stan-...@googlegroups.com
Summary: While estimating a model memory use increased steadily,
resulting in a dead job or unresponsive system around 8G. Anything I
can do about that?

I ran a model for 200 iterations and 1 chain; it finished in about 65
seconds.
> system.time(r <- stan(model_code = stan_model, data=stanDat,
+ iter=200, chains=1))

Then I used the model compiled at the previous step to do something more:
system.time(r2 <- stan(fit=r, data=stanDat, iter=2000, chains=3))

That ran for a few minutes and then I got "Process R killed".

I rebuilt with debugging level non-optimization, as suggested in the
FAQ. The job took steadily increasing amounts of memory. I killed it
when it was just over 8G and my system became unresponsive.

Is this typical, or is something wrong with my model or the underlying
code (like a memory leak)? The problem doesn't seem that big: 15
parameters, ~2,500 observations with 3 variables for each. It's a
multinomial model. I added value to the comments.

I'm using rstan from git as of this morning. The model is
data {
int<lower=0> N; // number of partners ~2,500
int<lower=2> K; // number of types of partner race 5
// K is the number of categories in the outcome pEthnic
int<lower=0> J; //number of covariates 3
int<lower=1,upper=K> pEthnic[N]; // partner race
//int<lower=0> NCLUSTER; //number of respondents
matrix[N, J] x; // covariates
}
parameters {
matrix[J, K] beta;
}
model {
vector[K] beta_x;
for(ir in 1:J)
beta[ir] ~ normal(0, 5);
for(n in 1:N) {
beta_x <- (x[n] * beta)';
pEthnic[n] ~ categorical(softmax(beta_x));
}
}

The model has a number of oddities, at least when viewed from my
frequentist experience:
1. no intercept
2. overdetermined outcomes. The outcome is one of 5 categories; all 5
are modeled.
3. overdetermined covariates. Each observation is one of 3 categories;
each category has a dummy variable.
Other example models I looked at seemed similarly over-determined, so I
thought maybe this was OK (though the example I copied from did have an
intercept).

I based this on the "Speeding Up Multinomial Logistic Regressions in
Stan" thread started April 19. I tried to follow Bob's advice about
using vector and matrix arithmetic when possible, but my efforts to
imitate the code revisions he gave were not successful. The manual
suggests (x[n] * beta)' is not the most efficient way to go. (I skipped
the stuff about beta_raw to simplify).

Any tips or advice would be great.

Running under R 2.15.1, Debian wheezy amd64 bit. This is without any of
my efforts to use system boost or eigen headers. I refreshed rstan (and
the stan subproject) and built the tarball from the upper level
makefile. Then I did an R CMD INSTALL on that tarball, using a local
path for installation.

Ross Boylan

P.S. Time for the initial 200 iterations with 65 seconds with
optimization, 763 seconds without optimization. Wow!

Bob Carpenter

unread,
Sep 12, 2013, 10:57:40 PM9/12/13
to stan-...@googlegroups.com
On 9/12/13 8:34 PM, Ross Boylan wrote:
> Summary: While estimating a model memory use increased steadily, resulting in a dead job or unresponsive system around
> 8G. Anything I can do about that?

Running from the command line, models should allocate
memory before the first iteration and never need more
memory than the amount used after the first iteration.
If it's going up otherwise, it may be a signal of a memory
leak.

Your parameter is a J x K matrix and you'll need enough memory in
R to store the number of samples times a J x K matrix times 8 bytes.
At minimum (not accounting for R overhead), this will require

2000 iterations/chain
* 3 chains
* 8 bytes/double
* 1 double/parameter
* J * K parameters/iteration

= 2000 * 3 * 8 * J * K bytes

If that's a lot less than 8 GB, it may be the sign of a problem.

> I ran a model for 200 iterations and 1 chain; it finished in about 65 seconds.
> > system.time(r <- stan(model_code = stan_model, data=stanDat,
> + iter=200, chains=1))

Can you do that for 4 chains and if so, does it produce enough
effective samples? If so, you may not need to run longer.

If the memory's too big or there's too much correlation in the
samples, you can thin the output, which should cut down on memory
usage.

> Then I used the model compiled at the previous step to do something more:
> system.time(r2 <- stan(fit=r, data=stanDat, iter=2000, chains=3))
>
> That ran for a few minutes and then I got "Process R killed".
>
> I rebuilt with debugging level non-optimization, as suggested in the FAQ. The job took steadily increasing amounts of
> memory. I killed it when it was just over 8G and my system became unresponsive.

I have no idea what this does w.r.t. R, but just at the C++
level, the optimization level shouldn't have only a minimal impact on
memory usage.

> Is this typical, or is something wrong with my model or the underlying code (like a memory leak)? The problem doesn't
> seem that big: 15 parameters, ~2,500 observations with 3 variables for each. It's a multinomial model. I added value to
> the comments.
>
> I'm using rstan from git as of this morning.

Very brave of you! We're trying to keep the development branches
of RStan and Stan in working order.

> The model is
> data {
> int<lower=0> N; // number of partners ~2,500
> int<lower=2> K; // number of types of partner race 5
> // K is the number of categories in the outcome pEthnic
> int<lower=0> J; //number of covariates 3
> int<lower=1,upper=K> pEthnic[N]; // partner race
> //int<lower=0> NCLUSTER; //number of respondents
> matrix[N, J] x; // covariates
> }
> parameters {
> matrix[J, K] beta;
> }
> model {
> vector[K] beta_x;
> for(ir in 1:J)
> beta[ir] ~ normal(0, 5);

Replace this loop with the vectorized form:

beta ~ normal(0,5);

Is 5 a reasonable prior scale here in that you expect
the posterior for beta to be in the [-10,10] range or so?

> for(n in 1:N) {
> beta_x <- (x[n] * beta)';
> pEthnic[n] ~ categorical(softmax(beta_x));
> }
> }

Stan 2 (the current dev branch), will let you do this:

pEthnic[n] ~ categorical_logit(beta_x);

It should be a bit more efficient than categorical(softmax(beta_x)).


It'll also be more efficient if the covariate type is:

row_vector[J] x[N]; // covariates

Then there's no memory allocation for x[n].


> The model has a number of oddities, at least when viewed from my frequentist experience:
> 1. no intercept
> 2. overdetermined outcomes. The outcome is one of 5 categories; all 5 are modeled.
> 3. overdetermined covariates. Each observation is one of 3 categories; each category has a dummy variable.
> Other example models I looked at seemed similarly over-determined, so I thought maybe this was OK (though the example I
> copied from did have an intercept).

None of this should matter for memory usage.

> I based this on the "Speeding Up Multinomial Logistic Regressions in Stan" thread started April 19. I tried to follow
> Bob's advice about using vector and matrix arithmetic when possible, but my efforts to imitate the code revisions he
> gave were not successful. The manual suggests (x[n] * beta)' is not the most efficient way to go. (I skipped the stuff
> about beta_raw to simplify).

If you have specific questions here, we're happy to help.

> Any tips or advice would be great.
>
> Running under R 2.15.1, Debian wheezy amd64 bit. This is without any of my efforts to use system boost or eigen
> headers. I refreshed rstan (and the stan subproject) and built the tarball from the upper level makefile. Then I did
> an R CMD INSTALL on that tarball, using a local path for installation.

Probably better if you stick to our headers for now.

Presumably the R CMD INSTALL is configured with optimization, too.

>
> Ross Boylan
>
> P.S. Time for the initial 200 iterations with 65 seconds with optimization, 763 seconds without optimization. Wow!

We built the code to be optimized --- that is, we didn't
try to build code that would be efficient if it wasn't
compiled with optimization.

- Bob

Ross Boylan

unread,
Sep 13, 2013, 1:22:56 AM9/13/13
to stan-...@googlegroups.com
On Thu, 2013-09-12 at 22:57 -0400, Bob Carpenter wrote:
> On 9/12/13 8:34 PM, Ross Boylan wrote:
> > Summary: While estimating a model memory use increased steadily, resulting in a dead job or unresponsive system around
> > 8G. Anything I can do about that?
>
> Running from the command line, models should allocate
> memory before the first iteration and never need more
> memory than the amount used after the first iteration.
> If it's going up otherwise, it may be a signal of a memory
> leak.
Would it be any different if run from R?

>
> Your parameter is a J x K matrix and you'll need enough memory in
> R to store the number of samples times a J x K matrix times 8 bytes.
> At minimum (not accounting for R overhead), this will require
>
> 2000 iterations/chain
> * 3 chains
> * 8 bytes/double
> * 1 double/parameter
> * J * K parameters/iteration
>
> = 2000 * 3 * 8 * J * K bytes
= 720,000 << 8G

>
> If that's a lot less than 8 GB, it may be the sign of a problem.
>
> > I ran a model for 200 iterations and 1 chain; it finished in about 65 seconds.
> > > system.time(r <- stan(model_code = stan_model, data=stanDat,
> > + iter=200, chains=1))
>
> Can you do that for 4 chains and if so, does it produce enough
> effective samples? If so, you may not need to run longer.
Effective sample sizes for the 1 chain ranged from 10 through 22; even
times 4 that's pretty low. Plus 100 burn-in is not large either, though
Rhat wasn't bad (1.1 or less).
>
> If the memory's too big or there's too much correlation in the
> samples, you can thin the output, which should cut down on memory
> usage.
If the problem is saving the results. But it appears to be something
else.
>
> > Then I used the model compiled at the previous step to do something more:
> > system.time(r2 <- stan(fit=r, data=stanDat, iter=2000, chains=3))
> >
> > That ran for a few minutes and then I got "Process R killed".
> >
> > I rebuilt with debugging level non-optimization, as suggested in the FAQ. The job took steadily increasing amounts of
> > memory. I killed it when it was just over 8G and my system became unresponsive.
>
> I have no idea what this does w.r.t. R, but just at the C++
> level, the optimization level shouldn't have only a minimal impact on
> memory usage.
I have no reason to think it made any difference; I just didn't look at
the memory use in the earlier run. And the fact that the debug run was
slower made it easier to watch the progress.
I did this partly because the code I copied had it and partly because it
seemed semi-reasonable for the parameters I was estimating. Even a
value as big as 2 or 3 would be surprising.
>
> > for(n in 1:N) {
> > beta_x <- (x[n] * beta)';
> > pEthnic[n] ~ categorical(softmax(beta_x));
> > }
> > }
>
> Stan 2 (the current dev branch), will let you do this:
>
> pEthnic[n] ~ categorical_logit(beta_x);
Cool; I'll try it out. I might as well get some goodies with the risk.
Because of the earlier thread I looked a little in the recent commits
and in the feature/ branches (just the branch names, not the contents)
for this feature, but didn't see it.
>
> It should be a bit more efficient than categorical(softmax(beta_x)).
>
>
> It'll also be more efficient if the covariate type is:
>
> row_vector[J] x[N]; // covariates
Will this affect the order (or type) of the x data I pass in?
>
> Then there's no memory allocation for x[n].
>
>
> > The model has a number of oddities, at least when viewed from my frequentist experience:
> > 1. no intercept
> > 2. overdetermined outcomes. The outcome is one of 5 categories; all 5 are modeled.
> > 3. overdetermined covariates. Each observation is one of 3 categories; each category has a dummy variable.
> > Other example models I looked at seemed similarly over-determined, so I thought maybe this was OK (though the example I
> > copied from did have an intercept).
>
> None of this should matter for memory usage.
That was an implicit question if this would cause problems in general.
Also, I figured if it made things go ape it might have something to do
with the memory problem.
>
> > I based this on the "Speeding Up Multinomial Logistic Regressions in Stan" thread started April 19. I tried to follow
> > Bob's advice about using vector and matrix arithmetic when possible, but my efforts to imitate the code revisions he
> > gave were not successful. The manual suggests (x[n] * beta)' is not the most efficient way to go. (I skipped the stuff
> > about beta_raw to simplify).
>
> If you have specific questions here, we're happy to help.
I was wondering what the optimal way to set things up was; I think you
answered that with the suggestion about x. Though I'm a little
confused, since in the earlier thread you suggested making x a matrix.
>
> > Any tips or advice would be great.
> >
> > Running under R 2.15.1, Debian wheezy amd64 bit. This is without any of my efforts to use system boost or eigen
> > headers. I refreshed rstan (and the stan subproject) and built the tarball from the upper level makefile. Then I did
> > an R CMD INSTALL on that tarball, using a local path for installation.
>
> Probably better if you stick to our headers for now.
I figured that was safer, as well as being the easiest thing to do.
>
> Presumably the R CMD INSTALL is configured with optimization, too.
I didn't do anything special, but the generated commands from the
makefile had -o3.

Next steps? valgrind (likely really slow)? Running via free-standing
stan rather than R? For the latter, is there a simple way to right the
data out from R in the necessary format?


>
> >
> > Ross Boylan
> >
> > P.S. Time for the initial 200 iterations with 65 seconds with optimization, 763 seconds without optimization. Wow!
>
> We built the code to be optimized --- that is, we didn't
> try to build code that would be efficient if it wasn't
> compiled with optimization.
>
> - Bob
>

P.S. I was suprised to see in the manual that MSVC can't handle the
templates. Template handling was a weak spot in MSVC when I used it ~ a
decade ago, but I would have expected them to fix it by now. Maybe C#
is getting all the love.


Jiqiang Guo

unread,
Sep 13, 2013, 8:46:34 AM9/13/13
to stan-...@googlegroups.com
I think running Stan from the command line would be the next to try.  The extra memory RStan needs is for storing the samples for all the iterations of all chains, which in fact is not big deal in your example.  

You can use the dump function in R or stan_rdump in rstan to dump the data for the use of Stan from the command line. 

--
Jiqiang 




--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ross Boylan

unread,
Sep 13, 2013, 11:18:50 AM9/13/13
to stan-...@googlegroups.com
I think there's a leak at the C++ level, associated with
categorical(softmax()).

I reset the optimization level and redid the model with 200 iterations
and 1 chain. Each run added 3.5 to 4G of memory to the R process
permanently. If I understand this report, the memory use is not in R
objects:
> gc(verbose=TRUE)
Garbage collection 53 = 31+8+14 (level 2) ...
26.7 Mbytes of cons cells used (67%)
22.6 Mbytes of vectors used (67%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 498280 26.7 741108 39.6 597831 32.0
Vcells 2950231 22.6 4426304 33.8 3866037 29.5
# and after a second run in the same R session
# see below for model changes for 2nd run
> gc(verbose=TRUE)
Garbage collection 58 = 33+8+17 (level 2) ...
27.0 Mbytes of cons cells used (56%)
24.5 Mbytes of vectors used (63%)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 504899 27.0 899071 48.1 818163 43.7
Vcells 3200629 24.5 5043999 38.5 3866037 29.5

Another clue: when I switched from
pEthnic[n] ~ categorical(softmax(beta_x));
to
pEthnic[n] ~ categorical_logit(beta_x);
the outrageous memory use went away: the total is just 67k after the
run,
which was also much faster (37s).

On Thu, 2013-09-12 at 22:57 -0400, Bob Carpenter wrote:
That didn't work:
Error in stanc(file = file, model_code = model_code, model_name =
model_name, :

failed to parse Stan model 'stan_model' with error message:
EXPECTATION FAILURE LOCATION: file=input; line=16, column=25

beta ~ normal(0, 5);
^-- here


DIAGNOSTIC(S) FROM PARSER:
no matches for function name="normal_log"
arg 0 type=matrix
arg 1 type=int
arg 2 type=int

I'm guessing that one can't set the prior distribution as a matrix operation (beta is a matrix).

>
> Is 5 a reasonable prior scale here in that you expect
> the posterior for beta to be in the [-10,10] range or so?
>
> > for(n in 1:N) {
> > beta_x <- (x[n] * beta)';
> > pEthnic[n] ~ categorical(softmax(beta_x));
> > }
> > }
>
> Stan 2 (the current dev branch), will let you do this:
>
> pEthnic[n] ~ categorical_logit(beta_x);
>
> It should be a bit more efficient than categorical(softmax(beta_x)).
>
>
> It'll also be more efficient if the covariate type is:
>
> row_vector[J] x[N]; // covariates
>
> Then there's no memory allocation for x[n].
I made this change before the 2nd run. Run time was 58s vs 65 before,
though the elapsed time reported by stan was very slightly higher.


Bob Carpenter

unread,
Sep 13, 2013, 1:37:21 PM9/13/13
to stan-...@googlegroups.com


- Bob

On 9/13/13 11:18 AM, Ross Boylan wrote:
> I think there's a leak at the C++ level, associated with
> categorical(softmax()).

Thanks for looking into this further.

> I reset the optimization level

To 0 or 3?
Great. I'll track down what's going on with softmax.

There's an allocation in the GitHub version of softmax that
I may not have done right (the trick is integrating with
our arena-based memory allocation that we use for
calculating gradients with automatic differentiation).

More below.

> On Thu, 2013-09-12 at 22:57 -0400, Bob Carpenter wrote:
>> On 9/12/13 8:34 PM, Ross Boylan wrote:

>>> I'm using rstan from git as of this morning.

>> Replace this loop with the vectorized form:
>>
>> beta ~ normal(0,5);
> That didn't work:
> Error in stanc(file = file, model_code = model_code, model_name =
> model_name, :
>
> failed to parse Stan model 'stan_model' with error message:
> EXPECTATION FAILURE LOCATION: file=input; line=16, column=25
>
> beta ~ normal(0, 5);
> ^-- here
>
>
> DIAGNOSTIC(S) FROM PARSER:
> no matches for function name="normal_log"
> arg 0 type=matrix
> arg 1 type=int
> arg 2 type=int
>
> I'm guessing that one can't set the prior distribution as a matrix operation (beta is a matrix).

Right --- you need to do a loop if beta is a matrix. I
obviously didn't look closely enough at your variable declaration.

>> Stan 2 (the current dev branch), will let you do this:
>>
>> pEthnic[n] ~ categorical_logit(beta_x);
>>
>> It should be a bit more efficient than categorical(softmax(beta_x)).
>>
>>
>> It'll also be more efficient if the covariate type is:
>>
>> row_vector[J] x[N]; // covariates
>>
>> Then there's no memory allocation for x[n].

> I made this change before the 2nd run. Run time was 58s vs 65 before,
> though the elapsed time reported by stan was very slightly higher.

Was that for the whole model? We should figure out how to
micro-benchmark some of these functions so we can measure
the functions themselves. So much is context dependent that
it's hard to do well.

- Bob

Ross Boylan

unread,
Sep 13, 2013, 1:54:26 PM9/13/13
to stan-...@googlegroups.com
On Fri, 2013-09-13 at 13:37 -0400, Bob Carpenter wrote:
>
> - Bob
>
> On 9/13/13 11:18 AM, Ross Boylan wrote:
> > I think there's a leak at the C++ level, associated with
> > categorical(softmax()).
>
> Thanks for looking into this further.
>
> > I reset the optimization level
>
> To 0 or 3?
3

Since the bug doesn't seem to depend on the optimization level I figured
I might as well encounter it quickly.
[snip]
> >> Stan 2 (the current dev branch), will let you do this:
> >>
> >> pEthnic[n] ~ categorical_logit(beta_x);
> >>
> >> It should be a bit more efficient than categorical(softmax(beta_x)).
> >>
> >>
> >> It'll also be more efficient if the covariate type is:
> >>
> >> row_vector[J] x[N]; // covariates
> >>
> >> Then there's no memory allocation for x[n].
>
> > I made this change before the 2nd run. Run time was 58s vs 65 before,
> > though the elapsed time reported by stan was very slightly higher.
>
> Was that for the whole model?
Yes, including compiling the model. I'm not sure how micro you are
looking for, but here are the full reports
first try (x as matrix)
Iteration: 200 / 200 [100%] (Sampling)
Elapsed Time: 23.18 seconds (Warm-up)
13.22 seconds (Sampling)
36.4 seconds (Total)

user system elapsed
58.791 1.348 65.683

2nd try (x as array of row vectors):
Iteration: 200 / 200 [100%] (Sampling)
Elapsed Time: 21.85 seconds (Warm-up)
14.67 seconds (Sampling)
36.52 seconds (Total)

user system elapsed
57.651 1.236 58.890

Both used categorical(softmax()).

Actually, it was only wall-clock time that went down.

If I do 200 iterations and throw away the first 100, does (Sampling)
refer to the last 100, or to all 200?

I later did multiple chains in one run; it looked as stan repeats the
tuning with each chain. Is that how it operates? Is that desirable?

> We should figure out how to
> micro-benchmark some of these functions so we can measure
> the functions themselves. So much is context dependent that
> it's hard to do well.
>
> - Bob

Ross


Bob Carpenter

unread,
Sep 13, 2013, 2:00:18 PM9/13/13
to stan-...@googlegroups.com


On 9/13/13 1:22 AM, Ross Boylan wrote:
> On Thu, 2013-09-12 at 22:57 -0400, Bob Carpenter wrote:
>> On 9/12/13 8:34 PM, Ross Boylan wrote:

>> Can you do that for 4 chains and if so, does it produce enough
>> effective samples? If so, you may not need to run longer.
> Effective sample sizes for the 1 chain ranged from 10 through 22; even
> times 4 that's pretty low. Plus 100 burn-in is not large either, though
> Rhat wasn't bad (1.1 or less).

In our applications we rarely need n_eff > 100, because there's
not enough data or enough trust in the model to support tighter
estimates.
>>
>>> I based this on the "Speeding Up Multinomial Logistic Regressions in Stan" thread started April 19. I tried to follow
>>> Bob's advice about using vector and matrix arithmetic when possible, but my efforts to imitate the code revisions he
>>> gave were not successful. The manual suggests (x[n] * beta)' is not the most efficient way to go. (I skipped the stuff
>>> about beta_raw to simplify).
>>
>> If you have specific questions here, we're happy to help.

> I was wondering what the optimal way to set things up was; I think you
> answered that with the suggestion about x. Though I'm a little
> confused, since in the earlier thread you suggested making x a matrix.

There are multiple ways to do thing, as you pointed out
in your later message.

>> If you have specific questions here, we're happy to help.

> I was wondering what the optimal way to set things up was; I think you
> answered that with the suggestion about x. Though I'm a little
> confused, since in the earlier thread you suggested making x a matrix.

There are more ways to do it. As long as the dimensions are all the
same, the I/O is the same. I'm going to try to clarify this in the
manual along with the rest of your suggestions. Thanks again. They're
in our issue tracker now so they won't get lost.

>> Presumably the R CMD INSTALL is configured with optimization, too.
> I didn't do anything special, but the generated commands from the
> makefile had -o3.
>
> Next steps? valgrind (likely really slow)? Running via free-standing
> stan rather than R? For the latter, is there a simple way to right the
> data out from R in the necessary format?

Yes, there's a dump command in R that produces the right format
output. R's own dump is too unpredictable. Our plan's to move
to a JSON format in the future.

> P.S. I was suprised to see in the manual that MSVC can't handle the
> templates. Template handling was a weak spot in MSVC when I used it ~ a
> decade ago, but I would have expected them to fix it by now. Maybe C#
> is getting all the love.

We're about to make another push here on Windows. But none of us are
really that good with Windows.

- Bob

Bob Carpenter

unread,
Sep 13, 2013, 2:03:22 PM9/13/13
to stan-...@googlegroups.com


On 9/13/13 1:54 PM, Ross Boylan wrote:
>> Was that for the whole model?
> Yes, including compiling the model. I'm not sure how micro you are
> looking for, but here are the full reports

By "micro" I mean function specific. So that would be measuring C++ calls
in various configurations for just the categorical commands.

- Bob

Bob Carpenter

unread,
Sep 13, 2013, 2:07:43 PM9/13/13
to stan-...@googlegroups.com


On 9/13/13 1:54 PM, Ross Boylan wrote:
> If I do 200 iterations and throw away the first 100, does (Sampling)
> refer to the last 100, or to all 200?

I'm not sure what context you're asking about, but when
the message changes from adapting to sampling, it means
you're past warmup. So the default in running 200 iterations
is to use the first 100 for warmup and keep the last 200
for inference.

For many problems, Stan converges much faster than that, but
it's a good default because it's at most going to cost you
a factor of two in time. If you need a gazillion samples,
it makes sense to set the warmup period lower.

> I later did multiple chains in one run; it looked as stan repeats the
> tuning with each chain. Is that how it operates?

Yes.

> Is that desirable?

Maybe. The motivation is partly practical. If we do this all
independently, we don't need any inter-chain communication, so it's
easy to parallelize chains. Another motivation is to diagnose
convergence problems that might not otherwise arise if you only
run a single adaptation. Often, at least in Stan 1.3 (hopefully a
bit less in Stan 2.0), a single chain will get stuck because of
adaptation failing. (We can also get stuck by transitioning to
low probability areas and having a tough time transitioning back ---
detailed balance is a tough law to respect :-))

Having said that, we are thinking about ways to do inter-chain
communication on adaptation.

Any suggestions anybody might have on how to do this would be
appreciated.

- Bob

Ross Boylan

unread,
Sep 13, 2013, 2:36:38 PM9/13/13
to stan-...@googlegroups.com
Since it sounds as if Bob knows where to look for the leak, I'm not
planning on doing the valgrind/free-standing stan runs. Let me know if
you think that would still be valuable.


More below.
On Fri, 2013-09-13 at 14:07 -0400, Bob Carpenter wrote:
>
> On 9/13/13 1:54 PM, Ross Boylan wrote:
> > If I do 200 iterations and throw away the first 100, does (Sampling)
> > refer to the last 100, or to all 200?
>
> I'm not sure what context you're asking about,
The run I just did.
> but when
> the message changes from adapting to sampling, it means
> you're past warmup. So the default in running 200 iterations
> is to use the first 100 for warmup and keep the last 200
> for inference.

I thought there were 3 phases
1) tune algorithmic parameters (if not pre-specified)
2) run chain to get it to the "long-run" state, tossing results
3) run chain to get target distribution, keeping results.

Do 1 + 2 happen at the same time? Does "warmup" = 2 or 1+2.
>
> For many problems, Stan converges much faster than that, but
> it's a good default because it's at most going to cost you
> a factor of two in time. If you need a gazillion samples,
> it makes sense to set the warmup period lower.
>
> > I later did multiple chains in one run; it looked as stan repeats the
> > tuning with each chain. Is that how it operates?
>
> Yes.
>
> > Is that desirable?
>
> Maybe. The motivation is partly practical. If we do this all
> independently, we don't need any inter-chain communication, so it's
> easy to parallelize chains. Another motivation is to diagnose
> convergence problems that might not otherwise arise if you only
> run a single adaptation. Often, at least in Stan 1.3 (hopefully a
> bit less in Stan 2.0), a single chain will get stuck because of
> adaptation failing. (We can also get stuck by transitioning to
> low probability areas and having a tough time transitioning back ---
> detailed balance is a tough law to respect :-))
>
> Having said that, we are thinking about ways to do inter-chain
> communication on adaptation.
>
> Any suggestions anybody might have on how to do this would be
> appreciated.

I was thinking of a simpler to implement (for stan developers, maybe not
simpler for users) approach: doing an initial run (phases 1 and maybe 2
in my list above) on one machine, and then parallelize step 3 with every
parallel process getting the results of the initial run. In particular,
model compilation and tuning would happen only once.

In this model the only inter-chain communication is from the master to
the slave at the start and and from the slave to the master at the end,
i.e., same pattern as now.

Whether that would be a good idea depends partly on how safe it is to
tune once.

Ross



Ross Boylan

unread,
Sep 13, 2013, 6:18:28 PM9/13/13
to stan-...@googlegroups.com
On Fri, 2013-09-13 at 14:07 -0400, Bob Carpenter wrote:
> On 9/13/13 1:54 PM, Ross Boylan wrote:
> > If I do 200 iterations and throw away the first 100, does (Sampling)
> > refer to the last 100, or to all 200?
>
> I'm not sure what context you're asking about, but when
> the message changes from adapting to sampling, it means
> you're past warmup. So the default in running 200 iterations
> is to use the first 100 for warmup and keep the last 200
> for inference.
>
> For many problems, Stan converges much faster than that, but
> it's a good default because it's at most going to cost you
> a factor of two in time. If you need a gazillion samples,
> it makes sense to set the warmup period lower
Since stan's default was 2,000 iterations, I figured 200 was grossly
inadequate, and probably not even sufficient for convergence.

I guess the main thing to do is watch Rhat and effective sample sizes.
For the more complicated model with clustering I just did 6 parallel
runs with 500 iterations each. The Rhats look as if they are almost all
1.0--suggesting I might get away with a shorter burn-in?--and n_eff is
around 100 for my key parameters, often much higher for the cluster
effects, but only 29 for sigma and 27 for lp__. The latter 2 have rhat
of 1.2.

Are the values for lp__ important? That is, if sigma had worked out
better but lp__ still had the low effective size and elevated Rhat,
would that be a sign that more iterations were in order?

I guess the values for sigma and lp_ indicate I need more iterations and
more burn-in. Is that a reasonable inference?

The model parameters are of secondary interest for my current project
which is to assess "causal" effects within a system for which the model
is one equation.

As you may recall, I've also been working on imputation. I speculate
that is the setting in which the standards for "good enough" estimates
are loosest.

Bob Carpenter

unread,
Sep 13, 2013, 6:35:11 PM9/13/13
to stan-...@googlegroups.com


On 9/13/13 6:18 PM, Ross Boylan wrote:
> On Fri, 2013-09-13 at 14:07 -0400, Bob Carpenter wrote:
>> On 9/13/13 1:54 PM, Ross Boylan wrote:
>>> If I do 200 iterations and throw away the first 100, does (Sampling)
>>> refer to the last 100, or to all 200?
>>
>> I'm not sure what context you're asking about, but when
>> the message changes from adapting to sampling, it means
>> you're past warmup. So the default in running 200 iterations
>> is to use the first 100 for warmup and keep the last 200
>> for inference.
>>
>> For many problems, Stan converges much faster than that, but
>> it's a good default because it's at most going to cost you
>> a factor of two in time. If you need a gazillion samples,
>> it makes sense to set the warmup period lower

> Since stan's default was 2,000 iterations, I figured 200 was grossly
> inadequate, and probably not even sufficient for convergence.

We based that on JAGS, but then Stan tends to converge faster and
mix better, so we're cranking the default number of iterations down
to 100. It should probably be 20 --- just enough to get the model
adapting a little and sampling a little to make sure there are no
big problems in the model itself.

It's problem specific as to how fast convergence is and then what
the mixing rate is, and then application specific as to how many effective
samples you need.

> I guess the main thing to do is watch Rhat and effective sample sizes.
> For the more complicated model with clustering I just did 6 parallel
> runs with 500 iterations each. The Rhats look as if they are almost all
> 1.0--suggesting I might get away with a shorter burn-in?--and n_eff is
> around 100 for my key parameters, often much higher for the cluster
> effects, but only 29 for sigma and 27 for lp__. The latter 2 have rhat
> of 1.2.

Look at the traceplots in R. The warmup (what was previously called "burnin")
is the greyed over first part. You can usually eyeball about how many
iterations convergence takes --- then pad it a bit (say a factor of 2)
to be safe for future runs.

> Are the values for lp__ important? That is, if sigma had worked out
> better but lp__ still had the low effective size and elevated Rhat,
> would that be a sign that more iterations were in order?

Not really, because lp__ isn't something we're using for inference.
It's the unnormalized log probability of the model. You do want to
look at it across runs to compare how well the model fits. You do want
it to mix, but it doesn't need a high effective sample size unless you're
doing inference on it.

> I guess the values for sigma and lp_ indicate I need more iterations and
> more burn-in. Is that a reasonable inference?

What do you need the output for? If you're just trying to estimate
posterior means, we report MCMC error --- that's the accuracy of
the mean estimate. If you need tail estimates for probabilities, you
need many more samples.

Usually when MCMC error is an order of magnitude less than the quantity
being estimated you're in pretty good shape using the samples for inference.

> The model parameters are of secondary interest for my current project
> which is to assess "causal" effects within a system for which the model
> is one equation.

The goal's always the same --- look at what you're trying to compute and
see how much that varies.

> As you may recall, I've also been working on imputation. I speculate
> that is the setting in which the standards for "good enough" estimates
> are loosest.

Same deal, but I can see why this wouldn't require tight estimates.

- Bob

Bob Carpenter

unread,
Sep 13, 2013, 6:39:30 PM9/13/13
to stan-...@googlegroups.com
Thanks for the offer, but we can do that from here.

- Bob

Bob Carpenter

unread,
Sep 13, 2013, 6:43:51 PM9/13/13
to stan-...@googlegroups.com


On 9/13/13 2:36 PM, Ross Boylan wrote:
> Since it sounds as if Bob knows where to look for the leak, I'm not
> planning on doing the valgrind/free-standing stan runs. Let me know if
> you think that would still be valuable.
>
>
> More below.

Oops -- missed that due to bad mailer formatting (on my end).

> On Fri, 2013-09-13 at 14:07 -0400, Bob Carpenter wrote:
>>
>> On 9/13/13 1:54 PM, Ross Boylan wrote:
>>> If I do 200 iterations and throw away the first 100, does (Sampling)
>>> refer to the last 100, or to all 200?
>>
>> I'm not sure what context you're asking about,
> The run I just did.
>> but when
>> the message changes from adapting to sampling, it means
>> you're past warmup. So the default in running 200 iterations
>> is to use the first 100 for warmup and keep the last 200
>> for inference.
>
> I thought there were 3 phases
> 1) tune algorithmic parameters (if not pre-specified)
> 2) run chain to get it to the "long-run" state, tossing results
> 3) run chain to get target distribution, keeping results.
>
> Do 1 + 2 happen at the same time? Does "warmup" = 2 or 1+2.

"Warmup" is just Andrew's new coinage for "burnin". It's a better
physical analogy, but is confusing our users. The new edition of
BDA no longer mentions "burnin".

Warmup = 1 + 2. This is just like in JAGS and BUGS, which also
do adaptation during "burnin".

In RStan, we keep the warmup "samples" too, so you can diagnose
how quickly we converge to the high mass part of the posterior.
One issue is that Andrew's very keen to start at diffuse starting points.

This interacts badly with adaptation. The problem we're facing with
adaptation is that the settings that let you move from a far-away initial
point to something more typical are different than the settings you want to
use once you've gotten to the high-mass part of the posterior.

The other issue is that if we only run a single chain, then say, randomize
around where it left off, we won't have as good a diagnostic on poor
mixing (my interpretation of what Andrew keeps saying).

> In this model the only inter-chain communication is from the master to
> the slave at the start and and from the slave to the master at the end,
> i.e., same pattern as now.

Right.

> Whether that would be a good idea depends partly on how safe it is to
> tune once.

My understanding is not very --- see above.

What we're likely to do, since tuning is global, is to run multiple
chains then make sure they converge to the same tuning parameters, and
somehow allow them to share info across chains.

- Bob

Shira Mitchell

unread,
Sep 20, 2013, 8:40:38 PM9/20/13
to stan-...@googlegroups.com
Hi all,

I'm brand new to RStan, and I just got my hierarchical model to run.  It can run just fine once or twice, but once I put it in a loop (for the purpose of a sample size calculation), R quits unexpectedly.  Is this likely a memory issue?  I am of course happy to supply any code or details that would help.

Thank you so much!

Shira

Ross Boylan

unread,
Sep 20, 2013, 9:10:19 PM9/20/13
to stan-...@googlegroups.com
What exactly are you doing in the loop? You can just tell stan how many
chains and how many iterations you want.
You can use gc() to see how much memory you are using in R, or whatever
tool your OS has to watch the process.

Ross Boylan

Shira Mitchell

unread,
Sep 20, 2013, 9:12:33 PM9/20/13
to stan-...@googlegroups.com
Thanks! 

I got assistance from Jessy Hwang (see below) and I'm going to try that. Thanks again!

if you're doing
for (i in 1:N){
    results[i] <- stan(model_code = blahblahblah, ... )
}
you can compile the model just one time, outside the loop, with a very small number of iterations, then use the "fit" argument to bypass the compilation in all other iterations, i.e.
my_model_fit <- stan(model_code = blahblahblah, ...)
for (i in 1:N){
    results[i] <- stan(fit = my_model_fit, ...)
}





--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/5FWw4hwoMVQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
---------------------------------------------
Shira Mitchell
http://scholar.harvard.edu/shira

Shira Mitchell

unread,
Sep 20, 2013, 10:07:22 PM9/20/13
to stan-...@googlegroups.com
Hmm, R still quits unexpectedly.  I include my R code below, let me know if the stan code is useful as well.  Like I said, it runs fine for 3 or so loops, then quits. How would you recommend using gc()? 

Thank you!
Shira

M = 20
J = M*4
N = J*300
mu_gamma = 0
sigma_gamma = 4
beta = 3
theta_A = 0.6
theta_B = 0.6
theta_AB = 0.6
sigma_alpha = 1
A = rep(c(0,1), J/2)
B = rep(c(0,0,1,1), J/4)
numchains = 3

signif = rep(NA, n.sims)

gamma_pre = rnorm(M, mean = 0, sd = 3)
Xbeta_pre = as.vector(sapply(gamma_pre, rep, N/M))
x = rbinom(N, size = 1, prob = exp(Xbeta_pre)/(1+ exp(Xbeta_pre)))

gamma = rnorm(M, mean = mu_gamma, sd = sigma_gamma)
alpha = rnorm(J, mean = as.vector(sapply(gamma, rep, 4)) + theta_A * A + theta_B * B + theta_AB * A * B, sd = sigma_alpha)
Xbeta = as.vector(sapply(alpha, rep, N/J)) + beta * x
y = rbinom(N, size = 1, prob = exp(Xbeta)/(1+ exp(Xbeta)))

MVP_dat <- list(N = N, J = J, M = M,
village = as.vector(sapply(1:J, rep, N/J)),
block = as.vector(sapply(1:M, rep, J/M)),
y = y, x = x, A = A, B = B)

my_model_fit <- stan(model_code = MVP_code, data = MVP_dat, iter = 3, chains = numchains)

for (s in 1:n.sims) {

###### Generate fake data:
gamma_pre = rnorm(M, mean = 0, sd = 3)
Xbeta_pre = as.vector(sapply(gamma_pre, rep, N/M))
x = rbinom(N, size = 1, prob = exp(Xbeta_pre)/(1+ exp(Xbeta_pre)))
gamma = rnorm(M, mean = mu_gamma, sd = sigma_gamma)
alpha = rnorm(J, mean = as.vector(sapply(gamma, rep, 4)) + theta_A * A + theta_B * B + theta_AB * A * B, sd = sigma_alpha)
Xbeta = as.vector(sapply(alpha, rep, N/J)) + beta * x
y = rbinom(N, size = 1, prob = exp(Xbeta)/(1+ exp(Xbeta)))

MVP_dat <- list(N = N, J = J, M = M,
village = as.vector(sapply(1:J, rep, N/J)),
block = as.vector(sapply(1:M, rep, J/M)),
y = y, x = x, A = A, B = B)

fit <- stan(fit = my_model_fit, data = MVP_dat, iter = 700, chains = numchains)

theta_AB_sims = extract(object = fit, pars = "theta_AB")$theta_AB
lower = quantile(theta_AB_sims,probs = 0.025)
upper = quantile(theta_AB_sims,probs = 0.975)

if (lower < 0 & upper > 0) { signif[s] = 0 } else { signif[s] = 1 }

}

power = mean(signif)

Bob Carpenter

unread,
Sep 21, 2013, 2:45:04 AM9/21/13
to stan-...@googlegroups.com
Yes, we need more info in order to help. We
need to be able to reproduce what you're doing.

So we'd need the model and R code that defines n.sims,
etc. Enough that we can run what you have.

Are you using 32-bit or 64-bit R? Are you running
from a shell, from the R GUI or from a wrapper like
RStudio?

How much memory do you have on your machine? Can you
observe that to see if more memory's being used?

If you run for fewer simulations and run each simulation
for fewer chains, do you still run out of memory?

It doesn't look like you're saving the fit objects across
loops, so R should be able to recover the memory.

If you call gc() in R, it will try to free up available
memory and also report on how much memory you're using.

- Bob
> To unsubscribe from this topic, visit https://groups.google.com/d/__topic/stan-users/5FWw4hwoMVQ/__unsubscribe
> <https://groups.google.com/d/topic/stan-users/5FWw4hwoMVQ/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@__googlegroups.com
> <mailto:stan-users%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.
>
>
>
>
> --
> ---------------------------------------------
> Shira Mitchell
> http://scholar.harvard.edu/shira
>
>
>
>
> --
> ---------------------------------------------
> Shira Mitchell
> http://scholar.harvard.edu/shira
>
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Sergio Polini

unread,
Sep 21, 2013, 4:48:06 AM9/21/13
to stan-...@googlegroups.com
I'd try:

## translation to C++ code
ret <- stanc("my_model.stan", model_name = "my_model")
## compilation
my_model.sm <- stan_model(stanc_ret = ret)

or:

my_model.sm <- stan_model("my_model.stan", model_name = "my_model")

then:

for (s in 1:n.sims) {
...
my_model.sf <- sampling(my_model.sm, data=..., iter=..., ...)
...
}

Sergio
> https://groups.google.com/d/__topic/stan-users/5FWw4hwoMVQ/__unsubscribe
> <https://groups.google.com/d/topic/stan-users/5FWw4hwoMVQ/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email
> to stan-users+unsubscribe@__googlegroups.com
> <mailto:stan-users%2Bunsu...@googlegroups.com>.
> For more options, visit
> https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
>
> --
> ---------------------------------------------
> Shira Mitchell
> http://scholar.harvard.edu/shira
>
>
>
>
> --
> ---------------------------------------------
> Shira Mitchell
> http://scholar.harvard.edu/shira
>
> --
> You received this message because you are subscribed to the Google
> Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stan-users+...@googlegroups.com.

Shira Mitchell

unread,
Sep 21, 2013, 12:46:38 PM9/21/13
to stan-...@googlegroups.com
Thanks! See below:

Are you using 32-bit or 64-bit R?  Are you running
from a shell, from the R GUI or from a wrapper like
RStudio?

R 2.15.1 GUI 1.52 Leopard build 32-bit (6188)
 
How much memory do you have on your machine?  Can you
observe that to see if more memory's being used?

Memory  4 GB 1600 MHz DDR3

I'm not sure how to see how much is being used? 

If you run for fewer simulations and run each simulation
for fewer chains, do you still run out of memory?

Nope, it runs great with only 3 iterations and 3 chains. 
 
It doesn't look like you're saving the fit objects across
loops, so R should be able to recover the memory.

If you call gc() in R, it will try to free up available
memory and also report on how much memory you're using.

Thanks! I'll also try Sergio's suggestion.  See my code here below:

library(rstan)
set_cppo("fast") # for best running speed
#set_cppo('debug') # make debug easier

MVP_code <- '
data {
     int<lower=0> N; // number of households
     int<lower=0> J; // number of villages
     int<lower=0> M; // number of blocks
     int<lower=1, upper=J> village[N]; // indicates what villages each household is in
     int<lower=1, upper=M> block[J]; // indicates which block each village is in
     int<lower=0, upper=1> y[N];
     int<lower=0, upper=1> x[N]; // baseline value of y
     int<lower=0, upper=1> A[J]; // treatment bundle A
     int<lower=0, upper=1> B[J]; // treatment bundle B
}
parameters {
     vector[J] alpha; // village effect
     real beta;
     vector[M] gamma; // block effect
     real theta_A;
     real theta_B;
     real theta_AB;
     real mu_gamma;
     real<lower=0,upper=100> sigma_alpha;
     real<lower=0,upper=100> sigma_gamma;
}
transformed parameters {
     vector[N] Xbeta;
     vector[J] alpha_hat;

     for (i in 1:N)
        Xbeta[i] <- alpha[village[i]] + beta*x[i];
     for (j in 1:J)
        alpha_hat[j] <- gamma[block[j]] + theta_A * A[j] + theta_B * B[j] + theta_AB * A[j] * B[j];
}
model {
     mu_gamma ~ normal(0, 100); // flat priors
     beta ~ normal(0,100);
     theta_A ~ normal(0,100);
     theta_B ~ normal(0,100);
     theta_AB ~ normal(0,100);
     sigma_alpha ~ uniform(0,100);
     sigma_gamma ~ uniform(0,100);
     alpha ~ normal(alpha_hat, sigma_alpha); // village level
     gamma ~ normal(mu_gamma, sigma_gamma); // block level
     y ~ bernoulli_logit(Xbeta); // household level
}'

n.sims = 5
# iter includes the burnin = warmup (which defaults to half the number of iter)
fit <- stan(fit = my_model_fit, data = MVP_dat, iter = 3, chains = numchains)

#print(fit)

Shira Mitchell

unread,
Sep 22, 2013, 2:46:04 PM9/22/13
to stan-...@googlegroups.com
Thanks Sergio, I tried your suggestion. After a few loops I get:

Inline image 1


Screen Shot 2013-09-22 at 2.24.36 PM.png
Reply all
Reply to author
Forward
0 new messages