inconsistent run-times

Ross Boylan

unread,

Sep 16, 2013, 6:08:44 PM9/16/13

to stan-users

A stan model took a bit under 6 minutes to run 200 iterations but 2,000
iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
expected the big run to take about 10 times as long, i.e., 1 hour.

Can anyone help me understand why my expectation was violated?

The other difference is that the big run had 3 computations running in
parallel; I have 4 physical cores and wasn't doing anything else heavy.

My first thought was that at least one of the 3 computations chose a
much different set of values for the tuning parameters than the trial
run. Is that a reasonable supposition? If so, what should I make of
the wide variability in times?

Another possibility is that the jobs interfered with each other. *If*
the user time is accurately summed from all threads, it is only about
2/3 of what I would expect with 100% utilization on all 3 threads. Of
course, that could also be a sign that some finished well before the
others. In round numbers the user time from the parallel run was 20,000
seconds and the wall time was 10,000 seconds. If all 3 jobs used 100%
CPU the whole time it would be 30,000, not 20,000, seconds of total CPU
time. In contrast, scaling up the CPU time for the trial run (343s) by
30 (3 chains x 10 times the iterations) gives about 10,000s, so the
total CPU time was "only" double what I would have expected.

> system.time(r <- stan(model_code = stan_model,
data=stanDat,
+ iter=200,
chains=1))

TRANSLATING MODEL 'stan_model' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'stan_model' NOW.

date()
SAMPLING FOR MODEL 'stan_model' NOW (CHAIN 1).
Iteration: 200 / 200 [100%] (Sampling)
Elapsed Time: 224.13 seconds (Warm-up)
96.41 seconds (Sampling)
320.54 seconds (Total)

user system elapsed
343.417 0.788 345.196

> system.time(sflist <- mclapply(1:3, function(i) stan(fit=r, data=stanDat, iter=2000, chains=1,
+ seed=seed, chain_id=i, refresh=-1),
+ mc.cores=3)
+ )

user system elapsed
20633.038 5.144 10423.746

> 10423/60/60
[1] 2.895278

Ross Boylan

P.S. I don't think get_sampler_params is performing as advertised in
the documentation of stanfit-class.
> x2 <- get_sampler_params(sflist[[1]])
> class(x2)
[1] "list"
> length(x2)
[1] 1
> class(x2[[1]])
[1] "matrix"
> dim(x2[[1]])
[1] 2000 3
> colnames(x2[[1]])
[1] "accept_stat__" "stepsize__" "treedepth__"
!> head(x2[[1]])
accept_stat__ stepsize__ treedepth__
[1,] 1.0 0.000781250 2
[2,] 0.0 0.019391133 -1
[3,] 0.0 0.007812500 -1
[4,] 0.5 0.002061385 0
[5,] 0.5 0.001872274 1
[6,] 0.5 0.001759471 1
The docs say

‘get_sampler_params’ ‘signature(object = "stanfit")’: obtain the
parameters used for the sampler such as ‘stepsize’ and
‘treedepth’. The results are returned as a list, each element
of which is an array for a chain. The array has number of
columns corresponding to the number of parameters used in the
sampler and its column names provide the parameter names.
Optional parameter ‘inc_warmup’ indicates whether to include
the warmup period.
The number of columns does not match the number of parameters; it's not
clear what the rows should be, but I seem to have gotten one row per
iteration. I assume that tuning parameters are not changed at each
iteration.

Bob Carpenter

unread,

Sep 16, 2013, 6:42:19 PM9/16/13

to stan-...@googlegroups.com

On 9/16/13 6:08 PM, Ross Boylan wrote:
> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
> expected the big run to take about 10 times as long, i.e., 1 hour.

Did adaptation wind up in the same place for both runs?

Did you start with the same seed?

Was this in RStan or command line?

Were you using the default diagonal mass matrix (varying step sizes)?

> Can anyone help me understand why my expectation was violated?

I would guess that it's because you wound up in different places
from adaptation and that Stan was taking more leapfrog steps in
the longer run. Answers to the above would clarify.

> The other difference is that the big run had 3 computations running in
> parallel; I have 4 physical cores and wasn't doing anything else heavy.
>
> My first thought was that at least one of the 3 computations chose a
> much different set of values for the tuning parameters than the trial
> run. Is that a reasonable supposition?

Yes, that's well within the variability we've seen. Depending
on where the chains are initialized and what the random seed is,
it can more or less time to converge and simultaneously adapt
during warmup. Usually once you get past warmup and have converged,
sampling is pretty consistent. That is, most variability we've
observed is during warmup.

> If so, what should I make of
> the wide variability in times?

That we need to do some more work to speed up adaptation and make
it more consistent?

> Another possibility is that the jobs interfered with each other. *If*
> the user time is accurately summed from all threads, it is only about
> 2/3 of what I would expect with 100% utilization on all 3 threads.

Did you really mean different threads or is it different processes?
If you're using RStan's parallelism suggestions, it's different processes.

If you're talking about RStan, then it's definitely the time for all 3.
If you're going in parallel, each should report its own timing.

The reason things can go faster is the same reason they can go slower.
In 200 iteration runs, you spend 100 iterations warming up. Warmup is
slower when it's far away from the posterior mass. Usually when Stan's
converged and adapted, things speed up. So it may take 100 iterations
to converge, then everything would go much faster. You can look at
traceplots to see how long convergence is taking in practice.

> Of
> course, that could also be a sign that some finished well before the
> others.

Yes, that can happen if adaptation's much faster in one chain than

I don't know how timing works in this context. Jiqiang or Ben
should know.

> ï¿½get_sampler_paramsï¿½ ï¿½signature(object = "stanfit")ï¿½: obtain the
> parameters used for the sampler such as ï¿½stepsizeï¿½ and
> ï¿½treedepthï¿½. The results are returned as a list, each element

> of which is an array for a chain. The array has number of
> columns corresponding to the number of parameters used in the
> sampler and its column names provide the parameter names.

> Optional parameter ï¿½inc_warmupï¿½ indicates whether to include

> the warmup period.
> The number of columns does not match the number of parameters; it's not
> clear what the rows should be, but I seem to have gotten one row per
> iteration. I assume that tuning parameters are not changed at each
> iteration.

They're not tuning parameters, per se. Tree depth (log base 2 of the
number of leapfrog steps in the Hamiltonian simulations) is adapted on
an iteration-by-iteration basis using NUTS.

Stepsize is adapted during warmup, but unless you added variability on top of
that, will stay the same after adaption.

The accept_stat is the number of states in the Hamiltonian simulation that
were above the slice sampling threshold. This is the stat that's tuned
during warmup the same way that a Metropolis acceptance rate might be tuned.
The NUTS paper has more info on how the slice sampling works.

The -1 tree depth is explained by the accept stat of 0.0. It tried to
take a step, but that step was rejected, so the sampler quit and tried again.
(Maybe -1 isn't the clearest thing to report here!)

Hope that helps, but please feel free to followup if it's not clear!

- Bob

Ross Boylan

unread,

Sep 16, 2013, 8:04:02 PM9/16/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>
>
> On 9/16/13 6:08 PM, Ross Boylan wrote:
>> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
>> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
>> expected the big run to take about 10 times as long, i.e., 1 hour.
>
> Did adaptation wind up in the same place for both runs?

What should I look at to determine the answer? I played around with
get_sampler_params and get_adaptation_info. For the former I was
stumped by the mismatch between the documentation and what I saw (as
described at the end of the original message) and for the latter the
info was just a giant string.

It seemed in either case there was not one or two tuning parameters
but a whole bunch, and I am not sure how to compare them between
chains in some summary way.

Hmm, maybe sum(2^treedepth), including the warmup period?
(treedepth from get_sampler_params--more below)

>
> Did you start with the same seed?

No. I forgot to set the seed in my trial run. My understanding is that
the use of chain_id in the parallel runs assures the chains will differ.

>
> Was this in RStan or command line?

RStan.

>
> Were you using the default diagonal mass matrix (varying step
sizes)?

Yes.

>
>> Can anyone help me understand why my expectation was violated?
>
> I would guess that it's because you wound up in different places
> from adaptation and that Stan was taking more leapfrog steps in
> the longer run. Answers to the above would clarify.
>

The results below are that treedepth after burn-in was the same for
each chain. That's what controls leapfrog steps, right?

>> The other difference is that the big run had 3 computations running in
>> parallel; I have 4 physical cores and wasn't doing anything else heavy.
>>
>> My first thought was that at least one of the 3 computations chose a
>> much different set of values for the tuning parameters than the trial
>> run. Is that a reasonable supposition?
>
> Yes, that's well within the variability we've seen. Depending
> on where the chains are initialized and what the random seed is,
> it can more or less time to converge and simultaneously adapt
> during warmup. Usually once you get past warmup and have converged,
> sampling is pretty consistent. That is, most variability we've
> observed is during warmup.

I guess I'm learning the hard way why you said the "tune once, run
many chains" strategy is not advisable.

>
> > If so, what should I make of
>> the wide variability in times?
>
> That we need to do some more work to speed up adaptation and make
> it more consistent?
>
>> Another possibility is that the jobs interfered with each other. *If*
>> the user time is accurately summed from all threads, it is only about
>> 2/3 of what I would expect with 100% utilization on all 3 threads.
>
> Did you really mean different threads or is it different processes?

I think I misread "forking" in the mclapply documentation to mean
threads, but it forks processes.

> If you're using RStan's parallelism suggestions, it's different processes.
>
> If you're talking about RStan, then it's definitely the time for all 3.
> If you're going in parallel, each should report its own timing.
>
> The reason things can go faster is the same reason they can go slower.
> In 200 iteration runs, you spend 100 iterations warming up. Warmup is
> slower when it's far away from the posterior mass. Usually when Stan's
> converged and adapted, things speed up.

By that logic, wouldn't we expect that 10x more iterations would need
less than 10x the time? That's the opposite direction from the
timings I got.

>So it may take 100 iterations
> to converge, then everything would go much faster. You can look at
> traceplots to see how long convergence is taking in practice.

To an extent traceplots have the same problem as the evaluation of the
tuning parameters: too many parameters to take in easily.

>
>> Of
>> course, that could also be a sign that some finished well before the
>> others.
>
> Yes, that can happen if adaptation's much faster in one chain than
> the others.
>
>> In round numbers the user time from the parallel run was 20,000
>> seconds and the wall time was 10,000 seconds. If all 3 jobs used 100%
>> CPU the whole time it would be 30,000, not 20,000, seconds of total CPU
>> time. In contrast, scaling up the CPU time for the trial run (343s) by
>> 30 (3 chains x 10 times the iterations) gives about 10,000s, so the
>> total CPU time was "only" double what I would have expected.

Going with your statement about how the timing works in R (and
ignoring your statement below that your're not sure)...

The wall vs cumulative CPU time seems to indicate that the chains
took much different times to complete; at least one chain took the
wall-clock time (10k seconds), leaving only 10k seconds for the other
chains to split between them, e.g., 5k each. Even the "fast" ones in
that scenario are slower than the 3.5k seconds one gets my multiplying
the trial run time by 10.

>> ‘get_sampler_params’ ‘signature(object = "stanfit")’: obtain the
>> parameters used for the sampler such as ‘stepsize’ and
>> ‘treedepth’. The results are returned as a list, each element

>> of which is an array for a chain. The array has number of
>> columns corresponding to the number of parameters used in the
>> sampler and its column names provide the parameter names.

>> Optional parameter ‘inc_warmup’ indicates whether to include

>> the warmup period.
>> The number of columns does not match the number of parameters; it's not
>> clear what the rows should be, but I seem to have gotten one row per
>> iteration. I assume that tuning parameters are not changed at each
>> iteration.
>
> They're not tuning parameters, per se. Tree depth (log base 2 of the
> number of leapfrog steps in the Hamiltonian simulations) is adapted on
> an iteration-by-iteration basis using NUTS.
>
> Stepsize is adapted during warmup, but unless you added variability on top of
> that, will stay the same after adaption.
>

I see that it does settle down.
> tail(x2[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.6362132 0.000760579 10
[1996,] 0.7824013 0.000760579 10
[1997,] 0.3782625 0.000760579 10
[1998,] 0.9890936 0.000760579 10
[1999,] 0.5919631 0.000760579 10
[2000,] 0.3530545 0.000760579 10

Here are the other chains
> tail(get_sampler_params(sflist[[2]])[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.9715171 0.0005428345 10
[1996,] 0.9134190 0.0005428345 10
[1997,] 0.9028156 0.0005428345 10
[1998,] 0.9333573 0.0005428345 10
[1999,] 0.7910091 0.0005428345 10
[2000,] 0.5037319 0.0005428345 10
> tail(get_sampler_params(sflist[[3]])[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.6505754 0.0004937197 10
[1996,] 0.9521053 0.0004937197 10
[1997,] 0.4135974 0.0004937197 10
[1998,] 0.4073594 0.0004937197 10
[1999,] 0.1488046 0.0004937197 10
[2000,] 0.3109802 0.0004937197 10

So there's fairly wide variation in stepsize, but treedepth, which I
would expect to control how long the algorithm takes, is the same for
all.

> The accept_stat is the number of states in the Hamiltonian simulation that
> were above the slice sampling threshold. This is the stat that's tuned
> during warmup the same way that a Metropolis acceptance rate might be tuned.
> The NUTS paper has more info on how the slice sampling works.
>
> The -1 tree depth is explained by the accept stat of 0.0. It tried to
> take a step, but that step was rejected, so the sampler quit and tried again.
> (Maybe -1 isn't the clearest thing to report here!)

Not sure what my sum(2^treedepth) should do with the negative
numbers--use the most recent non-negative value?

Is treedepth on iteration i the number of leapfrog steps used for
iteration i, or the number of leapfrog steps determined to be optimal
after iteration i, and used in iteration i+1?

Ross

Bob Carpenter

unread,

Sep 17, 2013, 1:45:09 AM9/17/13

to stan-...@googlegroups.com

On 9/16/13 8:04 PM, Ross Boylan wrote:
> On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>>
>>
>> On 9/16/13 6:08 PM, Ross Boylan wrote:
>>> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
>>> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
>>> expected the big run to take about 10 times as long, i.e., 1 hour.
>>
>> Did adaptation wind up in the same place for both runs?
>
> What should I look at to determine the answer?

In the command line version of RStan, it's in the .csv file printed
out. For RStan, I'm not sure if it's accessible in RStan 1.3. It
should be in RStan 2.0.

You can also just look at the traceplots. If one chain had trouble
adapting, it will be slow to converge.

> I played around with
> get_sampler_params and get_adaptation_info. For the former I was
> stumped by the mismatch between the documentation and what I saw (as
> described at the end of the original message)

I tried in the last message to explain what was going on.
I take it that didn't help?

> and for the latter the
> info was just a giant string.

It's probably a CSV file. If you have a lot of parameters, it's
going to be hard to tell, because there's a step size for each one.

> It seemed in either case there was not one or two tuning parameters
> but a whole bunch, and I am not sure how to compare them between
> chains in some summary way.
>
> Hmm, maybe sum(2^treedepth), including the warmup period?
> (treedepth from get_sampler_params--more below)
>
>>
>> Did you start with the same seed?
>
> No. I forgot to set the seed in my trial run. My understanding is that
> the use of chain_id in the parallel runs assures the chains will differ.

It will. But each chain is reproducible from the chain_id and
the random number seed. The actual starting point takes the
seed and then advances the RNG based on the chain_id.

>> Was this in RStan or command line?
>
> RStan.
>
>>
>> Were you using the default diagonal mass matrix (varying step
> sizes)?
>
> Yes.
>
>>
>>> Can anyone help me understand why my expectation was violated?
>>
>> I would guess that it's because you wound up in different places
>> from adaptation and that Stan was taking more leapfrog steps in
>> the longer run. Answers to the above would clarify.
>>
>
> The results below are that treedepth after burn-in was the same for
> each chain. That's what controls leapfrog steps, right?

If adaptation winds up in the same place, you can still get a huge
amount of variation during warmup (what other packages call "burn-in").

>>> The other difference is that the big run had 3 computations running in
>>> parallel; I have 4 physical cores and wasn't doing anything else heavy.
>>>
>>> My first thought was that at least one of the 3 computations chose a
>>> much different set of values for the tuning parameters than the trial
>>> run. Is that a reasonable supposition?
>>
>> Yes, that's well within the variability we've seen. Depending
>> on where the chains are initialized and what the random seed is,
>> it can more or less time to converge and simultaneously adapt
>> during warmup. Usually once you get past warmup and have converged,
>> sampling is pretty consistent. That is, most variability we've
>> observed is during warmup.
>
> I guess I'm learning the hard way why you said the "tune once, run
> many chains" strategy is not advisable.

It's mostly just to diagnose convergence. We want to guard
against false positives on convergence that can arise in a single
chain getting stuck in a small part of the posterior.

>>
>>> If so, what should I make of
>>> the wide variability in times?
>>
>> That we need to do some more work to speed up adaptation and make
>> it more consistent?
>>
>>> Another possibility is that the jobs interfered with each other. *If*
>>> the user time is accurately summed from all threads, it is only about
>>> 2/3 of what I would expect with 100% utilization on all 3 threads.
>>
>> Did you really mean different threads or is it different processes?
>
> I think I misread "forking" in the mclapply documentation to mean
> threads, but it forks processes.

We really do want to get Stan ramped up for multi-threading.
Our memory model will support it, but given that we can run in
multiple processes and memory for data hasn't seemed to be an issue,
we haven't gotten around to it.

>
>> If you're using RStan's parallelism suggestions, it's different processes.
>>
>> If you're talking about RStan, then it's definitely the time for all 3.
>> If you're going in parallel, each should report its own timing.
>>
>> The reason things can go faster is the same reason they can go slower.
>> In 200 iteration runs, you spend 100 iterations warming up. Warmup is
>> slower when it's far away from the posterior mass. Usually when Stan's
>> converged and adapted, things speed up.
>
> By that logic, wouldn't we expect that 10x more iterations would need
> less than 10x the time? That's the opposite direction from the
> timings I got.

It can go either way. It may be that in 100 iterations, you didn't
have enough time to do warmup properly. What often happens is that
you start off in a tail somewhere and need to take large
step sizes to get to high mass volumes of the posterior. Then when
the sampler gets there, it might have to crank down the step size
in order to sample properly. If you never converge in 100 warmup
iterations,

>> So it may take 100 iterations
>> to converge, then everything would go much faster. You can look at
>> traceplots to see how long convergence is taking in practice.
>
> To an extent traceplots have the same problem as the evaluation of the
> tuning parameters: too many parameters to take in easily.

If adaptation is struggling, you usually see it on
all of the parameters. Do:

traceplot(fit,"lp__")

for instance, and that's just the overall log probability. Usually
that won't converge until the other parameters have converged (though
it can, especially if the model's unidentifiable or only weakly identified
by priors).

>>> Of
>>> course, that could also be a sign that some finished well before the
>>> others.
>>
>> Yes, that can happen if adaptation's much faster in one chain than
>> the others.
>>
>>> In round numbers the user time from the parallel run was 20,000
>>> seconds and the wall time was 10,000 seconds. If all 3 jobs used 100%
>>> CPU the whole time it would be 30,000, not 20,000, seconds of total CPU
>>> time. In contrast, scaling up the CPU time for the trial run (343s) by
>>> 30 (3 chains x 10 times the iterations) gives about 10,000s, so the
>>> total CPU time was "only" double what I would have expected.
>
> Going with your statement about how the timing works in R (and
> ignoring your statement below that your're not sure)...
>
> The wall vs cumulative CPU time seems to indicate that the chains
> took much different times to complete; at least one chain took the
> wall-clock time (10k seconds), leaving only 10k seconds for the other
> chains to split between them, e.g., 5k each. Even the "fast" ones in
> that scenario are slower than the 3.5k seconds one gets my multiplying
> the trial run time by 10.

Not surprising --- we see that much difference all the time.

It's why we're looking into timing-based chains that will
allow us to stop poorly converging chains short. The trick is
to make it fast and preserve the fact that we're sampling from
the posterior.

>>> ï¿½get_sampler_paramsï¿½ ï¿½signature(object = "stanfit")ï¿½: obtain the
>>> parameters used for the sampler such as ï¿½stepsizeï¿½ and

>>> ï¿½treedepthï¿½. The results are returned as a list, each element

>>> of which is an array for a chain. The array has number of
>>> columns corresponding to the number of parameters used in the
>>> sampler and its column names provide the parameter names.

>>> Optional parameter ï¿½inc_warmupï¿½ indicates whether to include

This is bad. Treedepth of 10 means the algorithm's
maxed out and is taking 1000 leapfrog steps in the NUTS
HMC simulation (each one involving a log probability and gradient
eval) per iteration.

You also see a very tiny stepsize.

This kind of thing can happen if the model's very tightly
constrained in some dimension. It usually signals either a very
complicated model or some issue with the model specification.

Michael might have more advice here on how to diagnose what's
going on.

>> The accept_stat is the number of states in the Hamiltonian simulation that
>> were above the slice sampling threshold. This is the stat that's tuned
>> during warmup the same way that a Metropolis acceptance rate might be tuned.
>> The NUTS paper has more info on how the slice sampling works.
>>
>> The -1 tree depth is explained by the accept stat of 0.0. It tried to
>> take a step, but that step was rejected, so the sampler quit and tried again.
>> (Maybe -1 isn't the clearest thing to report here!)
> Not sure what my sum(2^treedepth) should do with the negative
> numbers--use the most recent non-negative value?

I'd treat it as 0. It definitely has to evaluate at
least one log probability, and that's what this is a proxy
for.

> Is treedepth on iteration i the number of leapfrog steps used for
> iteration i, or the number of leapfrog steps determined to be optimal
> after iteration i, and used in iteration i+1?

The number used in iteration i. The depth expands each iteration
until either (a) the trajectory does a U-turn [see the NUTS paper
for details], or (b) you hit the limit [10 by default].

- Bob

P.S. If there are some RStan specific questions, it'd probably help to
call them out in a separate message with "RStan" on the subject
line. We're getting bombed with message board traffice, so things
may slip through the cracks.

Michael Betancourt

unread,

Sep 17, 2013, 2:57:10 AM9/17/13

to stan-...@googlegroups.com

> This is bad. Treedepth of 10 means the algorithm's
> maxed out and is taking 1000 leapfrog steps in the NUTS
> HMC simulation (each one involving a log probability and gradient
> eval) per iteration.
>
> You also see a very tiny stepsize.
>
> This kind of thing can happen if the model's very tightly
> constrained in some dimension. It usually signals either a very
> complicated model or some issue with the model specification.
>
> Michael might have more advice here on how to diagnose what's
> going on.

This is very, very bad. I wouldn't trust any of your results in the slightest.

At the very least you'll have to expand the maximum tree depth until you're
no longer saturating the maximum. Also you'll want to look at the elements
of the mass matrix -- if the step size is really small and you're hitting the
max tree depth then there's either (a) huge ill-conditioning that the
adaptation is taking too long to find or (b) huge off-diagonal elements
for which the default NUTS algorithm is not particularly suited.

I'm going to go folk theorem here -- simplify your model until you have
something that fits decently and then add the complex bits until you
start to identify where the fits go crazy.

>>> The accept_stat is the number of states in the Hamiltonian simulation that
>>> were above the slice sampling threshold. This is the stat that's tuned
>>> during warmup the same way that a Metropolis acceptance rate might be tuned.
>>> The NUTS paper has more info on how the slice sampling works.
>>>
>>> The -1 tree depth is explained by the accept stat of 0.0. It tried to
>>> take a step, but that step was rejected, so the sampler quit and tried again.
>>> (Maybe -1 isn't the clearest thing to report here!)
>> Not sure what my sum(2^treedepth) should do with the negative
>> numbers--use the most recent non-negative value?
>
> I'd treat it as 0. It definitely has to evaluate at
> least one log probability, and that's what this is a proxy
> for.

This is an artifact of the old implementation. It should not occur in 2.0.

Sean O'Riordain

unread,

Sep 17, 2013, 6:27:53 AM9/17/13

to stan-...@googlegroups.com

I know nothing of this problem. However I might check some intermediate values of N and plot time against N and see how nonlinear it is.
Sean

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Sep 17, 2013, 9:44:31 AM9/17/13

to stan-...@googlegroups.com

On 9/17/13 2:57 AM, Michael Betancourt wrote:
>> This is bad. Treedepth of 10 means the algorithm's
>> maxed out and is taking 1000 leapfrog steps in the NUTS
>> HMC simulation (each one involving a log probability and gradient
>> eval) per iteration.
>>
>> You also see a very tiny stepsize.
>>
>> This kind of thing can happen if the model's very tightly
>> constrained in some dimension. It usually signals either a very
>> complicated model or some issue with the model specification.
>>
>> Michael might have more advice here on how to diagnose what's
>> going on.
>
>
> This is very, very bad. I wouldn't trust any of your results in the slightest.

This used to seem very mysterious to me, so let me try
to elaborate.

The basic idea is that if each iteration doesn't go all
the way out to a U-turn, then the sampler reverts to a random
walk. Random-walk Metropolis mixes very slowly for most
complicated models.

> At the very least you'll have to expand the maximum tree depth until you're
> no longer saturating the maximum.

That should be more efficient than the random walk you get from
truncating.

We had a long discussion up front about whether to truncate the
treedepth or not, with Andrew being pro, Matt being con, and the
rest of us shrugging our shoulders, not being able to decide whether
random-walk Metropolis or an effecctively infinite loop was better.

> Also you'll want to look at the elements
> of the mass matrix --

This is the varying step size part. It gives you the step size
multiplier for each dimension.

> if the step size is really small and you're hitting the
> max tree depth then there's either (a) huge ill-conditioning that the

Ill conditioning means the sampler is much more constrained in
one direction compared to others.

> adaptation is taking too long to find or (b) huge off-diagonal elements
> for which the default NUTS algorithm is not particularly suited.
>
> I'm going to go folk theorem here -- simplify your model until you have
> something that fits decently and then add the complex bits until you
> start to identify where the fits go crazy.

Andrew's folk theorem is that if you have computational problems,
it could very well signal a problem with the model, not the
computation. (He developed his "theorem" before Stan -- it wasn't
a retrofitted excuse.)

- Bob

Ross Boylan

unread,

Sep 17, 2013, 1:26:23 PM9/17/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

Thanks, everyone, for the feedback.

On Tue, Sep 17, 2013 at 07:57:10AM +0100, Michael Betancourt wrote:
> > This is bad. Treedepth of 10 means the algorithm's
> > maxed out and is taking 1000 leapfrog steps in the NUTS
> > HMC simulation (each one involving a log probability and gradient
> > eval) per iteration.
> >
> > You also see a very tiny stepsize.
> >
> > This kind of thing can happen if the model's very tightly
> > constrained in some dimension. It usually signals either a very
> > complicated model or some issue with the model specification.
> >
> > Michael might have more advice here on how to diagnose what's
> > going on.
>
>
> This is very, very bad. I wouldn't trust any of your results in the slightest.

Bummer.

Some of the parameter estimates did appear to have the wrong sign (not
only vs theory but vs the trial run), though the model has enough
interactions reading off individual coefficients is hazardous.

>
> At the very least you'll have to expand the maximum tree depth until you're
> no longer saturating the maximum.

Does that require modifying the stan source? At the moment that's for
future reference, since I'm more inclined to try to figure out and fix
why it's misbehaving.

> Also you'll want to look at the elements
> of the mass matrix -- if the step size is really small and you're hitting the
> max tree depth then there's either (a) huge ill-conditioning that the
> adaptation is taking too long to find or (b) huge off-diagonal elements
> for which the default NUTS algorithm is not particularly suited.

I thought the adaptation on the mass matrix (just the diagonal) would
prevent situations in which the step size was limited by one
particular parameter.

If step size is x and there are k leapfrog steps, does that mean each
individual leapfrog step is size x or x/k? How does the mass modify
that?

Would it be worth trying allowing non-zero off-diagonal mass elements?

>
> I'm going to go folk theorem here -- simplify your model until you have
> something that fits decently and then add the complex bits until you
> start to identify where the fits go crazy.
>

The current model has interactions with respondent race (3 categories)
for every term. The correlations between many terms are substantial
as a result. Is there a chance that changing the basis of the
contrasts could help?

R's model.matrix seems to have produced data that do not have
redundant columns; the interaction terms only involve 2 of the 3
groups.

There are 2 other peculiarities of the data. Many clusters are
singletons. And respondent race is correlated with cluster number.

The only term that varies within cluster is the outcome, partner race.

>
> >>> The accept_stat is the number of states in the Hamiltonian simulation that
> >>> were above the slice sampling threshold. This is the stat that's tuned
> >>> during warmup the same way that a Metropolis acceptance rate might be tuned.
> >>> The NUTS paper has more info on how the slice sampling works.
> >>>
> >>> The -1 tree depth is explained by the accept stat of 0.0. It tried to
> >>> take a step, but that step was rejected, so the sampler quit and tried again.
> >>> (Maybe -1 isn't the clearest thing to report here!)
> >> Not sure what my sum(2^treedepth) should do with the negative
> >> numbers--use the most recent non-negative value?
> >
> > I'd treat it as 0. It definitely has to evaluate at
> > least one log probability, and that's what this is a proxy
> > for.
>
> This is an artifact of the old implementation. It should not occur in 2.0.

I'm working off the development version of rstan. Is that 2.0 enough?

Ross

Michael Betancourt

unread,

Sep 17, 2013, 1:43:40 PM9/17/13

to stan-...@googlegroups.com

> Does that require modifying the stan source? At the moment that's for
> future reference, since I'm more inclined to try to figure out and fix
> why it's misbehaving.

It's a configurable option, --max_treedepth=<int>, unless you're using the the new argument structure,
in which case it's sample algorithm=hmc engine=nuts max_treedepth=<int>

> I thought the adaptation on the mass matrix (just the diagonal) would
> prevent situations in which the step size was limited by one
> particular parameter.

That assumes the adaptation converged to the true variances. Models
that don't mix well stress the adaptation.

> If step size is x and there are k leapfrog steps, does that mean each
> individual leapfrog step is size x or x/k? How does the mass modify
> that?

Each leapfrog is x, so the total time integrated is x * k. The mass matrix
scales the update of the parameters, so some parameters might move more
in each iterations while some might move less.

> Would it be worth trying allowing non-zero off-diagonal mass elements?

My guess is that if you can't adapt to a diagonal mass matrix, you won't
have much luck with an off-diagonal mass matrix. This is where having a
simpler model and building up helps.

> The current model has interactions with respondent race (3 categories)
> for every term. The correlations between many terms are substantial
> as a result. Is there a chance that changing the basis of the
> contrasts could help?

Matt trick, Matt trick, Matt trick. See the manual under "Reparameterizations".

> I'm working off the development version of rstan. Is that 2.0 enough?

That should be enough. I think the only way that should happen is if the sampler bails upon
encountering a NaN.

Ross Boylan

unread,

Sep 17, 2013, 2:06:45 PM9/17/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

On Tue, Sep 17, 2013 at 01:45:09AM -0400, Bob Carpenter wrote:
>
>
> On 9/16/13 8:04 PM, Ross Boylan wrote:
>> On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>>>
>>>
>>> On 9/16/13 6:08 PM, Ross Boylan wrote:
>>>> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
>>>> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
>>>> expected the big run to take about 10 times as long, i.e., 1 hour.
>>>
>>> Did adaptation wind up in the same place for both runs?
>>
>> What should I look at to determine the answer?
>
> In the command line version of RStan, it's in the .csv file printed
> out. For RStan, I'm not sure if it's accessible in RStan 1.3. It
> should be in RStan 2.0.

Is this ability in the current development code (develop branch) for rstan?

>
> You can also just look at the traceplots. If one chain had trouble
> adapting, it will be slow to converge.
>
>> I played around with
>> get_sampler_params and get_adaptation_info. For the former I was
>> stumped by the mismatch between the documentation and what I saw (as
>> described at the end of the original message)
>
> I tried in the last message to explain what was going on.
> I take it that didn't help?

It did, but I'm still not sure what I should be looking for.
Is sum(2^pmax(treedepth, 0)) informative?

>
>> and for the latter the
>> info was just a giant string.
>
> It's probably a CSV file. If you have a lot of parameters, it's
> going to be hard to tell, because there's a step size for each one.

Does that have the info that you mentioned earlier would be in the .csv file?

It starts
# Adaptation terminated\n# Step size = 0.000760579\n# Diagonal elements of inverse mass matrix:\n# 7.43178, 9.32494, 5.90983

>
>> It seemed in either case there was not one or two tuning parameters
>> but a whole bunch, and I am not sure how to compare them between
>> chains in some summary way.
>>
>> Hmm, maybe sum(2^treedepth), including the warmup period?
>> (treedepth from get_sampler_params--more below)
>>
>>>

....

>
> We really do want to get Stan ramped up for multi-threading.
> Our memory model will support it, but given that we can run in
> multiple processes and memory for data hasn't seemed to be an issue,
> we haven't gotten around to it.
>

I think R is not architected for threads, though if your C code uses
threads behind the scenes that's OK.

...

>> To an extent traceplots have the same problem as the evaluation of the
>> tuning parameters: too many parameters to take in easily.
>
> If adaptation is struggling, you usually see it on
> all of the parameters. Do:
>
> traceplot(fit,"lp__")
>
> for instance, and that's just the overall log probability. Usually
> that won't converge until the other parameters have converged (though
> it can, especially if the model's unidentifiable or only weakly identified
> by priors).

For the first chain traceplot is dominated by the first few steps
being way off. Tossing them to allow closer inspection suggests the
chain doesn't get in range til about iteration 500 (of 2000).

The value is not constant after that, but bounces around a
range--which I think is what is expected. It shows high
autocorrelation, which perhaps is not expected if the Hybrid MC is
working well?

The second chain settles down some after 500, but appears to be
trending down throughout. By my inexperienced eye it may not have
converged.

The third chain appears to have converged for iterations 400-1800, but
then seems to be heading up at the end.

All 3 have high autocorrelation.

Ross

Bob Carpenter

unread,

Sep 17, 2013, 3:27:46 PM9/17/13

to stan-...@googlegroups.com

On 9/17/13 2:06 PM, Ross Boylan wrote:
> On Tue, Sep 17, 2013 at 01:45:09AM -0400, Bob Carpenter wrote:
>>
>>
>> On 9/16/13 8:04 PM, Ross Boylan wrote:
>>> On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>>>>
>>>>
>>>> On 9/16/13 6:08 PM, Ross Boylan wrote:
>>>>> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
>>>>> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
>>>>> expected the big run to take about 10 times as long, i.e., 1 hour.
>>>>
>>>> Did adaptation wind up in the same place for both runs?
>>>
>>> What should I look at to determine the answer?
>>
>> In the command line version of RStan, it's in the .csv file printed
>> out. For RStan, I'm not sure if it's accessible in RStan 1.3. It
>> should be in RStan 2.0.
>
> Is this ability in the current development code (develop branch) for rstan?

It must be --- you're dumping it out below. :-)

>>
>> You can also just look at the traceplots. If one chain had trouble
>> adapting, it will be slow to converge.
>>
>>> I played around with
>>> get_sampler_params and get_adaptation_info. For the former I was
>>> stumped by the mismatch between the documentation and what I saw (as
>>> described at the end of the original message)
>>
>> I tried in the last message to explain what was going on.
>> I take it that didn't help?
>
> It did, but I'm still not sure what I should be looking for.
> Is sum(2^pmax(treedepth, 0)) informative?

In general, it's usually more informative to
look at a scatterplot of the tree depths than a summary
statistic. For simple models, depth is shallow. If
it's inconsistent across chains, there's probably an
issue with adaptation.

>>> and for the latter the
>>> info was just a giant string.
>>
>> It's probably a CSV file. If you have a lot of parameters, it's
>> going to be hard to tell, because there's a step size for each one.
> Does that have the info that you mentioned earlier would be in the .csv file?
>
> It starts
> # Adaptation terminated\n# Step size = 0.000760579\n# Diagonal elements of inverse mass matrix:\n# 7.43178, 9.32494, 5.90983

Right --- that's a small overall step size. What you want to do
is pull some of these out across chains and make sure they're
the same. As a start, make sure the chains all wind up with roughly
the same step size.

>>
>>> It seemed in either case there was not one or two tuning parameters
>>> but a whole bunch, and I am not sure how to compare them between
>>> chains in some summary way.
>>>
>>> Hmm, maybe sum(2^treedepth), including the warmup period?
>>> (treedepth from get_sampler_params--more below)
>>>
>>>>
> ....
>>
>> We really do want to get Stan ramped up for multi-threading.
>> Our memory model will support it, but given that we can run in
>> multiple processes and memory for data hasn't seemed to be an issue,
>> we haven't gotten around to it.
>>
>
> I think R is not architected for threads, though if your C code uses
> threads behind the scenes that's OK.

That's also my understanding. We also have use cases
outside of R.

> ...
>
>>> To an extent traceplots have the same problem as the evaluation of the
>>> tuning parameters: too many parameters to take in easily.
>>
>> If adaptation is struggling, you usually see it on
>> all of the parameters. Do:
>>
>> traceplot(fit,"lp__")
>>
>> for instance, and that's just the overall log probability. Usually
>> that won't converge until the other parameters have converged (though
>> it can, especially if the model's unidentifiable or only weakly identified
>> by priors).
>
> For the first chain traceplot is dominated by the first few steps
> being way off. Tossing them to allow closer inspection suggests the
> chain doesn't get in range til about iteration 500 (of 2000).
>
> The value is not constant after that, but bounces around a
> range--which I think is what is expected.

Right.

> It shows high
> autocorrelation, which perhaps is not expected if the Hybrid MC is
> working well?

Ideally, there's no autocorrelation and n_eff = number of iterations.
But that rarely happens. See below for a bit more on why.

> The second chain settles down some after 500, but appears to be
> trending down throughout. By my inexperienced eye it may not have
> converged.

> The third chain appears to have converged for iterations 400-1800, but
> then seems to be heading up at the end.

These are both bad and the reason Andrew wanted the split-R-hat
he and Kenny Shirley developed. If one chain is trending up and another
trending down during sampling, R-hat should not be good.

> All 3 have high autocorrelation.

Sometimes that's inevitable with HMC. HMC is rotation invariant,
but not scale invariant. The diagonal mass matrix (varying step sizes
in 1.3) just adjusts the scales per dimension. The result is then neither
technically rotation nor scale invariant. You need a full mass matrix
to deal with full generality.

But even that's still only a global adaptation. If you look at the funnel
example in the manual, you'll see a simple example of a hierarchical model where
the posterior scale varies dramatically by position. That we can't adapt
to.

Riemann-Manifold HMC will do everything here at the cost of some fairly
expensive matrix operations (solving, essentially) that apply to
a P x P matrix, where P is the number of parameters.

- Bob

Ross Boylan

unread,

Sep 17, 2013, 3:40:00 PM9/17/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

On Tue, Sep 17, 2013 at 03:27:46PM -0400, Bob Carpenter wrote:
> On 9/17/13 2:06 PM, Ross Boylan wrote:
>> On Tue, Sep 17, 2013 at 01:45:09AM -0400, Bob Carpenter wrote:
>>>
>>>
>>> On 9/16/13 8:04 PM, Ross Boylan wrote:
>>>> On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>>>>>
>>>>>
>>>>> On 9/16/13 6:08 PM, Ross Boylan wrote:

...
[from get_adaptation_info]

>>
>> It starts
>> # Adaptation terminated\n# Step size = 0.000760579\n# Diagonal elements of inverse mass matrix:\n# 7.43178, 9.32494, 5.90983
>
>
> Right --- that's a small overall step size. What you want to do
> is pull some of these out across chains and make sure they're
> the same. As a start, make sure the chains all wind up with roughly
> the same step size.

The step sizes ended up with close to 3:2 variation; is that considered "rougly the same"?

Sizes were 0.0004937197, 0.0005428345, 0.000760579.

Is it reasonable to look at step size without considering mass?

Ross

Michael Betancourt

unread,

Sep 17, 2013, 4:59:45 PM9/17/13

to stan-...@googlegroups.com

> The step sizes ended up with close to 3:2 variation; is that considered "rougly the same"?
>
> Sizes were 0.0004937197, 0.0005428345, 0.000760579.
>
> Is it reasonable to look at step size without considering mass?

Not really. But if the step sizes are this low AND the elements of the mass matrix are roughly the same then you must have huge correlations that a diagonal mass matrix will not help. If you're building a hierarchical model then you really should look at the Matt trick.

Ross Boylan

unread,

Sep 17, 2013, 5:45:01 PM9/17/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

I tried it, and am trying to interpret the diagnostics. It took about
twice as long as my non-trick version, though since the seed is random
(now fixed) that may not mean much. The step size was 5 to 7 x bigger
than any of the ones above, which I guess is progress. treedepth
still maxed out.
> tail(get_sampler_params(r)[[1]])
accept_stat__ stepsize__ treedepth__
[195,] 0.8915498 0.003433122 10
[196,] 0.8669619 0.003433122 10
[197,] 0.8636663 0.003433122 10
[198,] 0.9534177 0.003433122 10
[199,] 0.7526947 0.003433122 10
[200,] 0.6689675 0.003433122 10

I set some of the priors arbitrarily; perhaps I need to add them as
parameters? In particular, I simply set the beta sd to 5.

The estimated sd of the cluster parameters is tiny:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000394 0.0078710 0.0201600 0.0279500 0.0437700 0.1239000
which seems substantively implausible. OTOH there are 5 of these per
cluster (one for each possible categorical outcome) and without
priors they are only determined up to a constant.

Many clusters are singletons and almost all will lack one or more of
the partner race categories. There is some degeneracy lurking
there. A singleton is the simplest case: set alpha[k] = Infinity and
for k the race of the partner and alpha[j] = -Infinity for j != k and
you've fit the data perfectly. Maybe that's screwing things up.

data {
int<lower=0> N; // number of partners
int<lower=2> K; // number of types of partner race
// K is the number of categories in the outcome pEthnic
int<lower=0> J; //number of covariates
int<lower=1,upper=K> pEthnic[N]; // partner race
int<lower=0> NCLUSTER; //number of respondents
int iCluster[N]; // cluster id
row_vector[J] x[N]; // covariates
}
parameters {
matrix[J, K] beta_raw; // standard covariate coefficients
vector[K] alpha_raw[NCLUSTER]; // cluster effects
real<lower=0> sigma;
}
transformed parameters {
matrix [J, K] beta;
vector[K] alpha[NCLUSTER];
for(ir in 1:J)
beta[ir] <- beta_raw[ir]*5;
for(i in 1:NCLUSTER)
alpha[i] <- alpha_raw[i]*sigma;
}
model {
vector[K] beta_x;
sigma ~ cauchy(0, 5);
for(ir in 1:J)
beta_raw[ir] ~ normal(0, 1);
for(i in 1:NCLUSTER)
alpha_raw[i] ~ normal(0, 1);
for(n in 1:N) {
beta_x <- (x[n] * beta)'+alpha[iCluster[n]];
pEthnic[n] ~ categorical_logit(beta_x);
}
}

Ross

Ross Boylan

unread,

Sep 17, 2013, 6:34:31 PM9/17/13

to Ross Boylan, stan-...@googlegroups.com

I let sigma_beta be a parameter parallel to the existing sigma (which
is for alpha) and the number of leapfrog steps has finally come off
the bound.

> tail(get_sampler_params(rb)[[1]])
accept_stat__ stepsize__ treedepth__
[195,] 0.9814381 0.01364035 7
[196,] 0.5849223 0.01364035 7
[197,] 0.8015708 0.01364035 8
[198,] 0.9844606 0.01364035 7
[199,] 0.8069554 0.01364035 7
[200,] 0.4474741 0.01364035 8

Variances of alpha are much larger
> summary(x[[1]]) # sigma for alpha

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.01375 1.43600 1.52000 1.42600 1.58500 1.71400
> summary(x[[2]])# sigma for beta

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.09503 0.12450 0.14640 0.15080 0.18880 0.20500

Since alpha (cluster effects) and beta(other effects) are both
centered at 0, this suggests most of the beta are small. That's
plausible, even though we'd like some of them to matter!

Trying to think through why the sigma_beta had such an effect on the
sigma estimate, all I can come up with is that most of the beta in the
chain would have been relatively extreme values. Perhaps it's noise
that tends to drive other estimates toward 0?

Bob Carpenter

unread,

Sep 17, 2013, 8:13:48 PM9/17/13

to stan-...@googlegroups.com

On 9/17/13 3:40 PM, Ross Boylan wrote:

> The step sizes ended up with close to 3:2 variation; is that considered "rougly the same"?

We don't try to get much closer than that. It's one of
those effort vs. utility tradeoffs.

> Sizes were 0.0004937197, 0.0005428345, 0.000760579.
>
> Is it reasonable to look at step size without considering mass?

There are really two issues here. I believe the mass matrix is
normalized, so that wildly different step sizes would signal
differing adaptations.

That's still true in 2.0, isn't it Michael?

- Bob

Bob Carpenter

unread,

Sep 17, 2013, 8:41:08 PM9/17/13

to stan-...@googlegroups.com

On 9/17/13 5:45 PM, Ross Boylan wrote:
> On Tue, Sep 17, 2013 at 09:59:45PM +0100, Michael Betancourt wrote:
>>> The step sizes ended up with close to 3:2 variation; is that considered "rougly the same"?
>>>
>>> Sizes were 0.0004937197, 0.0005428345, 0.000760579.
>>>
>>> Is it reasonable to look at step size without considering mass?
>>
>> Not really. But if the step sizes are this low AND the elements of the mass matrix are roughly the same then you must have huge correlations that a diagonal mass matrix will not help. If you're building a hierarchical model then you really should look at the Matt trick.
>
> I tried it, and am trying to interpret the diagnostics. It took about
> twice as long as my non-trick version, though since the seed is random
> (now fixed) that may not mean much.

Is this measured in n_eff / time? We only really care about
the rate of producing effective samples.

Basic time per iteration should be about the same.

> The step size was 5 to 7 x bigger
> than any of the ones above, which I guess is progress. treedepth
> still maxed out.

You need to expand treedepth until it doesn't max out to investigate
what's going on. This is indicative that the HMC dynamics wasn't run
long enough. It may slow down iterations even further, but it's going
to be

>> tail(get_sampler_params(r)[[1]])
> accept_stat__ stepsize__ treedepth__
> [195,] 0.8915498 0.003433122 10
> [196,] 0.8669619 0.003433122 10
> [197,] 0.8636663 0.003433122 10
> [198,] 0.9534177 0.003433122 10
> [199,] 0.7526947 0.003433122 10
> [200,] 0.6689675 0.003433122 10
>
>
> I set some of the priors arbitrarily; perhaps I need to add them as
> parameters? In particular, I simply set the beta sd to 5.

It can make a difference if you have too wide a prior.

> The estimated sd of the cluster parameters is tiny:
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.0000394 0.0078710 0.0201600 0.0279500 0.0437700 0.1239000
> which seems substantively implausible. OTOH there are 5 of these per
> cluster (one for each possible categorical outcome) and without
> priors they are only determined up to a constant.

Is that the parameter group that has a prior sd of 5?

What this can mean is that if you have a hierarchical model, it
wants to fully pool because there's not enough data to separate
the entries.

> Many clusters are singletons and almost all will lack one or more of
> the partner race categories. There is some degeneracy lurking
> there. A singleton is the simplest case: set alpha[k] = Infinity and
> for k the race of the partner and alpha[j] = -Infinity for j != k and
> you've fit the data perfectly. Maybe that's screwing things up.

This is definitely an issue. If the model is separable in this
sense, then you need stronger priors to control the alpha value.

- Bob

>
> data {
> int<lower=0> N; // number of partners
> int<lower=2> K; // number of types of partner race
> // K is the number of categories in the outcome pEthnic
> int<lower=0> J; //number of covariates
> int<lower=1,upper=K> pEthnic[N]; // partner race
> int<lower=0> NCLUSTER; //number of respondents
> int iCluster[N]; // cluster id
> row_vector[J] x[N]; // covariates
> }
> parameters {
> matrix[J, K] beta_raw; // standard covariate coefficients
> vector[K] alpha_raw[NCLUSTER]; // cluster effects
> real<lower=0> sigma;
> }
> transformed parameters {
> matrix [J, K] beta;
> vector[K] alpha[NCLUSTER];
> for(ir in 1:J)
> beta[ir] <- beta_raw[ir]*5;
> for(i in 1:NCLUSTER)
> alpha[i] <- alpha_raw[i]*sigma;
> }

You don't need to put these in the transformed
parameters block, which will cause them to be written out
each iteration. You can make them local variables in the model.

> model {
> vector[K] beta_x;
> sigma ~ cauchy(0, 5);
> for(ir in 1:J)
> beta_raw[ir] ~ normal(0, 1);
> for(i in 1:NCLUSTER)
> alpha_raw[i] ~ normal(0, 1);

These can be vectorized to just

beta_raw ~ normal(0,1);

and

alpha_raw ~ normal(0,1);

> for(n in 1:N) {
> beta_x <- (x[n] * beta)'+alpha[iCluster[n]];
> pEthnic[n] ~ categorical_logit(beta_x);
> }
> }

It won't matter for speed, but Stan lets you put that beta_x
expression inside categorical_logit (unlike BUGS, but like
JAGS in that respect, I believe).

- Bob

Ross Boylan

unread,

Sep 17, 2013, 9:35:26 PM9/17/13

to stan-...@googlegroups.com, ro...@biostat.ucsf.edu

On Tue, Sep 17, 2013 at 08:41:08PM -0400, Bob Carpenter wrote:
>
>
> On 9/17/13 5:45 PM, Ross Boylan wrote:
>> On Tue, Sep 17, 2013 at 09:59:45PM +0100, Michael Betancourt wrote:
>>>> The step sizes ended up with close to 3:2 variation; is that considered "rougly the same"?
>>>>
>>>> Sizes were 0.0004937197, 0.0005428345, 0.000760579.
>>>>
>>>> Is it reasonable to look at step size without considering mass?
>>>
>>> Not really. But if the step sizes are this low AND the elements of the mass matrix are roughly the same then you must have huge correlations that a diagonal mass matrix will not help. If you're building a hierarchical model then you really should look at the Matt trick.
>>
>> I tried it, and am trying to interpret the diagnostics. It took about
>> twice as long as my non-trick version, though since the seed is random
>> (now fixed) that may not mean much.
>
> Is this measured in n_eff / time? We only really care about
> the rate of producing effective samples.

Good point. I did not consider n_eff.

Since there's one n_eff per parameter some simplification is
necessary; do you think taking the min is reasonable?

I'm also slightly inhibitted by the (relative) complexity of getting
n_eff, though I did figure out how to get it.

>
> Basic time per iteration should be about the same.

The same as what? Different chains should have about the same time?
Or the different models should run in about the same time?

>
>> The step size was 5 to 7 x bigger
>> than any of the ones above, which I guess is progress. treedepth
>> still maxed out.
>
> You need to expand treedepth until it doesn't max out to investigate
> what's going on. This is indicative that the HMC dynamics wasn't run
> long enough. It may slow down iterations even further, but it's going
> to be
>
>>> tail(get_sampler_params(r)[[1]])
>> accept_stat__ stepsize__ treedepth__
>> [195,] 0.8915498 0.003433122 10
>> [196,] 0.8669619 0.003433122 10
>> [197,] 0.8636663 0.003433122 10
>> [198,] 0.9534177 0.003433122 10
>> [199,] 0.7526947 0.003433122 10
>> [200,] 0.6689675 0.003433122 10
>>
>>
>> I set some of the priors arbitrarily; perhaps I need to add them as
>> parameters? In particular, I simply set the beta sd to 5.
>
> It can make a difference if you have too wide a prior.

Boy was that true. Things got much better when I estimated the sd of
beta, getting a very small number (described in a followup to the
message we're talking about here). The treedepth was 7 or 8 and the
stepsize was more like .01.

In my little 200 iteration trial the plot of treedepth vs iteration
looked a little funky. It bounced around a low number for awhile,
then shifted to 10 for a good stretch, adn then went down to a low
number before it got to the 100'th iteration. After the burn in, it
was almost always 7 or 8, which are values that occurred rarely during
burn-in.

That's one reason I figured I'd try 2000 iterations.

I was going to ask if stepsize < 10 is enough to declare victory, but
I think I already know the answer is no :( I'm running more chains
with more iterations now.

>
>> The estimated sd of the cluster parameters is tiny:
>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>> 0.0000394 0.0078710 0.0201600 0.0279500 0.0437700 0.1239000
>> which seems substantively implausible. OTOH there are 5 of these per
>> cluster (one for each possible categorical outcome) and without
>> priors they are only determined up to a constant.
>
> Is that the parameter group that has a prior sd of 5?

No. These are for the cluster-specific effects, alpha. The
distribution above is the distribution of the estimated sigma
parameter for the alpha.

These alpha sd changed radically (to ~ 1.5) when I allowed the beta
parameters' sd to be estimated rather than assuming it was 5.

>
> What this can mean is that if you have a hierarchical model, it
> wants to fully pool because there's not enough data to separate
> the entries.
>
>> Many clusters are singletons and almost all will lack one or more of
>> the partner race categories. There is some degeneracy lurking
>> there. A singleton is the simplest case: set alpha[k] = Infinity and
>> for k the race of the partner and alpha[j] = -Infinity for j != k and
>> you've fit the data perfectly. Maybe that's screwing things up.
>
> This is definitely an issue. If the model is separable in this
> sense, then you need stronger priors to control the alpha value.

I'm not sure what you mean by stronger. I had explicit, proper, prior
distributions on the parameters. I guess a prior with a narrower
dispersion is stronger than one with a wider dispersion.

>
> - Bob
>
>>
>> data {
>> int<lower=0> N; // number of partners
>> int<lower=2> K; // number of types of partner race
>> // K is the number of categories in the outcome pEthnic
>> int<lower=0> J; //number of covariates
>> int<lower=1,upper=K> pEthnic[N]; // partner race
>> int<lower=0> NCLUSTER; //number of respondents
>> int iCluster[N]; // cluster id
>> row_vector[J] x[N]; // covariates
>> }
>> parameters {
>> matrix[J, K] beta_raw; // standard covariate coefficients
>> vector[K] alpha_raw[NCLUSTER]; // cluster effects
>> real<lower=0> sigma;
>> }
>> transformed parameters {
>> matrix [J, K] beta;
>> vector[K] alpha[NCLUSTER];
>> for(ir in 1:J)
>> beta[ir] <- beta_raw[ir]*5;
>> for(i in 1:NCLUSTER)
>> alpha[i] <- alpha_raw[i]*sigma;
>> }
>
> You don't need to put these in the transformed
> parameters block, which will cause them to be written out
> each iteration. You can make them local variables in the model.

Just copying the manual! I know: written for clarity, not speed.

>
>> model {
>> vector[K] beta_x;
>> sigma ~ cauchy(0, 5);
>> for(ir in 1:J)
>> beta_raw[ir] ~ normal(0, 1);
>> for(i in 1:NCLUSTER)
>> alpha_raw[i] ~ normal(0, 1);
>
> These can be vectorized to just
>
> beta_raw ~ normal(0,1);
>
> and
>
> alpha_raw ~ normal(0,1);

Didn't we have this exchange before? beta_raw is a matrix and alpha
is an array of vectors, and so I have to subscript.

>
>> for(n in 1:N) {
>> beta_x <- (x[n] * beta)'+alpha[iCluster[n]];
>> pEthnic[n] ~ categorical_logit(beta_x);
>> }
>> }

So would this become (with the sigma_beta I introduced later)
beta_x <- (x[n] *beta_raw*sigma_beta)'+alpha[iCluster[n]]*sigma
?

>
> It won't matter for speed, but Stan lets you put that beta_x
> expression inside categorical_logit (unlike BUGS, but like
> JAGS in that respect, I believe).

Are you saying I could just as well skip beta_x and do
categorical_logit((x[n]*beta_raw*sigma_beta)'+alpha[iCluster[n]]*sigma)
?

I thought some earlier discussion said it was faster to compute it
separately, I guess because there's less need to allocate temporaries.

Ross

Michael Betancourt

unread,

Sep 18, 2013, 3:39:37 AM9/18/13

to stan-...@googlegroups.com

> There are really two issues here. I believe the mass matrix is
> normalized, so that wildly different step sizes would signal
> differing adaptations.
>
> That's still true in 2.0, isn't it Michael?

Mass matrix isn't normalized anymore, but given that the mass matrix is targeting the global covariance it should be constant across chains. Same with the step size. Consistency is still a good diagnostic.

Bob Carpenter

unread,

Sep 18, 2013, 4:05:58 AM9/18/13

to stan-...@googlegroups.com

I see --- it's identified, but not explicitly normalized.

I like that this gives the mass matrix an interpretation.

(And I think I've asked you this and you've answered before.)

- Bob

Reply all

Reply to author

Forward