On Mon, Sep 16, 2013 at 06:42:19PM -0400, Bob Carpenter wrote:
>
>
> On 9/16/13 6:08 PM, Ross Boylan wrote:
>> A stan model took a bit under 6 minutes to run 200 iterations but 2,000
>> iterations took almost 3 (2.9) hours. Since 2,000 = 10 x 200 I had
>> expected the big run to take about 10 times as long, i.e., 1 hour.
>
> Did adaptation wind up in the same place for both runs?
What should I look at to determine the answer? I played around with
get_sampler_params and get_adaptation_info. For the former I was
stumped by the mismatch between the documentation and what I saw (as
described at the end of the original message) and for the latter the
info was just a giant string.
It seemed in either case there was not one or two tuning parameters
but a whole bunch, and I am not sure how to compare them between
chains in some summary way.
Hmm, maybe sum(2^treedepth), including the warmup period?
(treedepth from get_sampler_params--more below)
>
> Did you start with the same seed?
No. I forgot to set the seed in my trial run. My understanding is that
the use of chain_id in the parallel runs assures the chains will differ.
>
> Was this in RStan or command line?
RStan.
>
> Were you using the default diagonal mass matrix (varying step
sizes)?
Yes.
>
>> Can anyone help me understand why my expectation was violated?
>
> I would guess that it's because you wound up in different places
> from adaptation and that Stan was taking more leapfrog steps in
> the longer run. Answers to the above would clarify.
>
The results below are that treedepth after burn-in was the same for
each chain. That's what controls leapfrog steps, right?
>> The other difference is that the big run had 3 computations running in
>> parallel; I have 4 physical cores and wasn't doing anything else heavy.
>>
>> My first thought was that at least one of the 3 computations chose a
>> much different set of values for the tuning parameters than the trial
>> run. Is that a reasonable supposition?
>
> Yes, that's well within the variability we've seen. Depending
> on where the chains are initialized and what the random seed is,
> it can more or less time to converge and simultaneously adapt
> during warmup. Usually once you get past warmup and have converged,
> sampling is pretty consistent. That is, most variability we've
> observed is during warmup.
I guess I'm learning the hard way why you said the "tune once, run
many chains" strategy is not advisable.
>
> > If so, what should I make of
>> the wide variability in times?
>
> That we need to do some more work to speed up adaptation and make
> it more consistent?
>
>> Another possibility is that the jobs interfered with each other. *If*
>> the user time is accurately summed from all threads, it is only about
>> 2/3 of what I would expect with 100% utilization on all 3 threads.
>
> Did you really mean different threads or is it different processes?
I think I misread "forking" in the mclapply documentation to mean
threads, but it forks processes.
> If you're using RStan's parallelism suggestions, it's different processes.
>
> If you're talking about RStan, then it's definitely the time for all 3.
> If you're going in parallel, each should report its own timing.
>
> The reason things can go faster is the same reason they can go slower.
> In 200 iteration runs, you spend 100 iterations warming up. Warmup is
> slower when it's far away from the posterior mass. Usually when Stan's
> converged and adapted, things speed up.
By that logic, wouldn't we expect that 10x more iterations would need
less than 10x the time? That's the opposite direction from the
timings I got.
>So it may take 100 iterations
> to converge, then everything would go much faster. You can look at
> traceplots to see how long convergence is taking in practice.
To an extent traceplots have the same problem as the evaluation of the
tuning parameters: too many parameters to take in easily.
>
>> Of
>> course, that could also be a sign that some finished well before the
>> others.
>
> Yes, that can happen if adaptation's much faster in one chain than
> the others.
>
>> In round numbers the user time from the parallel run was 20,000
>> seconds and the wall time was 10,000 seconds. If all 3 jobs used 100%
>> CPU the whole time it would be 30,000, not 20,000, seconds of total CPU
>> time. In contrast, scaling up the CPU time for the trial run (343s) by
>> 30 (3 chains x 10 times the iterations) gives about 10,000s, so the
>> total CPU time was "only" double what I would have expected.
Going with your statement about how the timing works in R (and
ignoring your statement below that your're not sure)...
The wall vs cumulative CPU time seems to indicate that the chains
took much different times to complete; at least one chain took the
wall-clock time (10k seconds), leaving only 10k seconds for the other
chains to split between them, e.g., 5k each. Even the "fast" ones in
that scenario are slower than the 3.5k seconds one gets my multiplying
the trial run time by 10.
>> ‘get_sampler_params’ ‘signature(object = "stanfit")’: obtain the
>> parameters used for the sampler such as ‘stepsize’ and
>> ‘treedepth’. The results are returned as a list, each element
>> of which is an array for a chain. The array has number of
>> columns corresponding to the number of parameters used in the
>> sampler and its column names provide the parameter names.
>> Optional parameter ‘inc_warmup’ indicates whether to include
>> the warmup period.
>> The number of columns does not match the number of parameters; it's not
>> clear what the rows should be, but I seem to have gotten one row per
>> iteration. I assume that tuning parameters are not changed at each
>> iteration.
>
> They're not tuning parameters, per se. Tree depth (log base 2 of the
> number of leapfrog steps in the Hamiltonian simulations) is adapted on
> an iteration-by-iteration basis using NUTS.
>
> Stepsize is adapted during warmup, but unless you added variability on top of
> that, will stay the same after adaption.
>
I see that it does settle down.
> tail(x2[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.6362132 0.000760579 10
[1996,] 0.7824013 0.000760579 10
[1997,] 0.3782625 0.000760579 10
[1998,] 0.9890936 0.000760579 10
[1999,] 0.5919631 0.000760579 10
[2000,] 0.3530545 0.000760579 10
Here are the other chains
> tail(get_sampler_params(sflist[[2]])[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.9715171 0.0005428345 10
[1996,] 0.9134190 0.0005428345 10
[1997,] 0.9028156 0.0005428345 10
[1998,] 0.9333573 0.0005428345 10
[1999,] 0.7910091 0.0005428345 10
[2000,] 0.5037319 0.0005428345 10
> tail(get_sampler_params(sflist[[3]])[[1]])
accept_stat__ stepsize__ treedepth__
[1995,] 0.6505754 0.0004937197 10
[1996,] 0.9521053 0.0004937197 10
[1997,] 0.4135974 0.0004937197 10
[1998,] 0.4073594 0.0004937197 10
[1999,] 0.1488046 0.0004937197 10
[2000,] 0.3109802 0.0004937197 10
So there's fairly wide variation in stepsize, but treedepth, which I
would expect to control how long the algorithm takes, is the same for
all.
> The accept_stat is the number of states in the Hamiltonian simulation that
> were above the slice sampling threshold. This is the stat that's tuned
> during warmup the same way that a Metropolis acceptance rate might be tuned.
> The NUTS paper has more info on how the slice sampling works.
>
> The -1 tree depth is explained by the accept stat of 0.0. It tried to
> take a step, but that step was rejected, so the sampler quit and tried again.
> (Maybe -1 isn't the clearest thing to report here!)
Not sure what my sum(2^treedepth) should do with the negative
numbers--use the most recent non-negative value?
Is treedepth on iteration i the number of leapfrog steps used for
iteration i, or the number of leapfrog steps determined to be optimal
after iteration i, and used in iteration i+1?
Ross