diagonal adaptation failing

Matt Hoffman

unread,

May 4, 2013, 2:54:12 PM5/4/13

to stan-dev

Hi stan-dev,

I'm rerunning some experiments from the NUTS paper, and discovered
that diagonal adaptation is failing quite badly in some cases on a
simple logistic regression. Here are minimum effective sample sizes
(obtained from bin/print) for 16 independent runs of 1K warmup/1K
sampling:

0
3
6
7
7
8
9
24
59
77
92
122
144
205
212
227

Averages are similar. Compare with the results for --equal_step_sizes:

6
35
39
43
49
59
88
103
122
147
158
162
167
172
203
217

I don't like that 6, but at least there's only one single-digit
result, compared with almost half when we use diagonal adaptation!

I checked out an old version from December, and that version of
diagonal adaptation produced much, much more sensible results:

153
198
198
204
221
224
224
228
231
248
250
253
265
286
300
307

So it seems like something's been broken in the new version of
diagonal adaptation. My question is: what's changed, and how are we
going to fix it?

Matt

Matt Hoffman

unread,

May 4, 2013, 2:56:18 PM5/4/13

to stan-dev

PS: This highlights the need for standardized performance testing! We
can't just assume that an algorithmic change that seems to make sense
will actually produce good results.

Bob Carpenter

unread,

May 4, 2013, 4:17:13 PM5/4/13

to stan...@googlegroups.com

Agreed. Michael --- do you have time to look
into this? I don't think we can release something
that's this much worse than our last version (assuming
Matt's results are reproducible).

I've also been very anxious about our master branch being
a not-ready-to-release-anytime-soon state.

Maybe we could revert master back to 1.3, patch
the bugs we know about, release, and give Michael
all the time he needs to refactor and do speed
tests before refactoring everything. This'd probably
also put Metropolis on hold, not that anyone's waiting
for it.

- Bob

Ben Goodrich

unread,

May 4, 2013, 4:21:31 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

On Saturday, May 4, 2013 2:54:12 PM UTC-4, Matt Hoffman wrote:

So it seems like something's been broken in the new version of
diagonal adaptation. My question is: what's changed, and how are we
going to fix it?

If Mike doesn't know off the top of his head, we could do a git bisect. Essentially, what is needed is a script that returns 0 if the ESS is acceptable, 125 if the model can't be compiled or doesn't run, and returns something else if the ESS is unacceptable. Then git bisect run should be able to find the first broken commit in a few hours with no supervision.

Ben

Ben Goodrich

unread,

May 4, 2013, 4:31:41 PM5/4/13

to stan...@googlegroups.com

On Saturday, May 4, 2013 4:17:13 PM UTC-4, Bob Carpenter wrote:

I've also been very anxious about our master branch being
a not-ready-to-release-anytime-soon state.

Maybe we could revert master back to 1.3, patch
the bugs we know about, release, and give Michael
all the time he needs to refactor and do speed
tests before refactoring everything. This'd probably
also put Metropolis on hold, not that anyone's waiting
for it.

Branching from v1.3.0 is easy, but cherry-picking so many commits from master is hard, so I think we should only do this if the problem with diagonal adaptation is really time consuming to identify and fix.

I've resisted suggesting we adopt the gitflow workflow for a long time

http://nvie.com/posts/a-successful-git-branching-model/

The main criticism that I have heard of it is that it is just additional syntax and structure that only enforces what good developers would do anyway. But we've proven ourselves to be incapable of that on our own, so maybe gitflow should be considered.

Ben

Bob Carpenter

unread,

May 4, 2013, 4:39:50 PM5/4/13

to stan...@googlegroups.com

Ironically, we did use that model this time. The problem
is that we didn't test the refactored branch enough
either for efficiency of sampling or for backward
compatibility before merging it into master.

I think most of the work since 1.3 has gone into
the refactoring, so I don't think it'd be that hard
to revert. I've been trying to keep my fingers out of
the code until it stabilizes.

Alternatively, we could start working on a branch from
1.3 and release from there and hope that we can stabilize
master enough to eventually release it.

I have zero experience in managing this kind of development.
So I really appreciate any suggestions!

- Bob

Ben Goodrich

unread,

May 4, 2013, 4:48:03 PM5/4/13

to stan...@googlegroups.com

On Saturday, May 4, 2013 4:39:50 PM UTC-4, Bob Carpenter wrote:

Ironically, we did use that model this time. The problem
is that we didn't test the refactored branch enough
either for efficiency of sampling or for backward
compatibility before merging it into master.

Then, we didn't follow the model and at a minimum, didn't utilize the gitflow software extensions. Under the gitflow model, the master branch is always in a production ready state.

I think most of the work since 1.3 has gone into
the refactoring, so I don't think it'd be that hard
to revert. I've been trying to keep my fingers out of
the code until it stabilizes.

Alternatively, we could start working on a branch from
1.3 and release from there and hope that we can stabilize
master enough to eventually release it.

The second way is easier, but still a last resort. Bisecting shouldn't be too bad, but it is hard to say how hard it will be to fix once the problem is found.

Ben

Daniel Lee

unread,

May 4, 2013, 4:56:29 PM5/4/13

to stan...@googlegroups.com

I think this is related to the differences I'm seeing between Windows and Mac.

Matt, do you have access to a Windows machine? I'm curious if it works under Windows.

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Betancourt

unread,

May 4, 2013, 5:28:51 PM5/4/13

to stan...@googlegroups.com, stan...@googlegroups.com

To be fair, the branch passed all unit tests before merging. Matt, is it the variances or the stepsize that's failing? I have seen cases where the dual averaging seems to fail randomly (the acceptance rate during sampling is way too low, but the variances are spot on) but I don't understand what's causing it. If we can assemble some problematic models I can look into it more.

Matt Hoffman

unread,

May 4, 2013, 5:30:24 PM5/4/13

to stan-dev

I don't have easy access to a Windows machine, I've been running my
experiments under Linux.

I've attached the .stan and data file, if someone wants to try to
reproduce. Remember that it only fails sometimes.

Matt

logistic_regression.stan

germancredit.Rdump

Daniel Lee

unread,

May 4, 2013, 5:32:52 PM5/4/13

to stan...@googlegroups.com

Can you take a look at some of the output files and give us a problematic seed? I can run it in Windows with that seed to see if we have differences.

Daniel

Matt Hoffman

unread,

May 4, 2013, 5:32:55 PM5/4/13

to stan-dev

That's part of the problem, we don't have any unit tests that screen
for this sort of thing.

I don't know exactly what's causing the problem, I only know that it
goes away when I checkout a version from December and that it's much
less severe when I turn off diagonal adaptation.

Matt Hoffman

unread,

May 4, 2013, 5:34:37 PM5/4/13

to stan-dev

Sure. It'll take a minute though.

Daniel Lee

unread,

May 4, 2013, 5:37:24 PM5/4/13

to stan...@googlegroups.com

No rush. I'm not sitting and waiting for it. I won't get to it until tomorrow.

Bob Carpenter

unread,

May 4, 2013, 5:58:18 PM5/4/13

to stan...@googlegroups.com

On 5/4/13 5:28 PM, Michael Betancourt wrote:
> To be fair, the branch passed all unit tests before merging.

Right --- that's why I said it was ironic that this is the merge
that killed us and we followed the Git branching paradigm and
unit tested. But then now that I see Ben mentioning "gitflow"
tools, I realize I don't even know what it means to follow the
proposed paradigm.

At root, I think the problem we had (and continue to have) are
caused by:

1. too few unit tests for the command line. what's there
now only does weak tests that various options don't break
the command line, not that they do the right thing

2. no unit tests for the samplers themselves

3. no performance-based regression tests.

I believe (3) is what Matt was talking about, but we really need
to collectively address all these issues before we can stably
test new additions.

We could probably also do with some more written specs in
these situations of major changes so everyone knows what's
proposed. There was a bit of misunderstanding in rolling out
the new changes and what kind of backward compatibility was
being assumed other than what was being tested by unit tests.

- Bob

Matt Hoffman

unread,

May 4, 2013, 6:06:50 PM5/4/13

to stan-dev

Ok, here we go:
2024218701
2024256492
2024318542
are all pretty bad.

Also, Mike's right, the diagonal adaptation seems to be giving
basically reasonable results, insofar as it's roughly matching the
marginal variances. Which leads me to a hypothesis, one that we've
discussed previously on this list.

The mass matrix is effectively adjusting the overall step size,
insofar as multiplying the mass matrix by a scalar is equivalent to
multiplying the step size by the square root of that scalar. This
wasn't a problem in the old implementation, where the overall scale of
the mass matrix was held constant. Here it is a problem, because every
change to the mass matrix's overall scale is effectively a shock to
the step size adaptation problem that the dual averaging algorithm is
trying to solve. And if there's one final change at the end of the
adaptation period, then the step size adaptation never has a chance to
recover.

One of two things needs to happen to remedy this: either the mass
matrix adaptation has to keep its overall scale constant, or the step
size returned by the step size adaptation has to be adjusted to
account for the overall scale of the mass matrix.

Also, I'm not sure that the refactored stepsize_adaptation.hpp
correctly implements the dual averaging algorithm. The new
implementation seems to be returning the averaged iterates ("xbar" in
the old implementation) instead of the iterates ("xk") themselves. The
way it's supposed to work is that during adaptation a bunch of
iterates are generated. Then, once adaptation's over, we use a
weighted average of that sequence of iterates. Using the average
prematurely undermines the algorithm's proof of correctness, I think.

Matt

Bob Carpenter

unread,

May 4, 2013, 6:56:12 PM5/4/13

to stan...@googlegroups.com

On 5/4/13 6:06 PM, Matt Hoffman wrote:
> Ok, here we go:
> 2024218701
> 2024256492
> 2024318542
> are all pretty bad.
>
> Also, Mike's right, the diagonal adaptation seems to be giving
> basically reasonable results, insofar as it's roughly matching the
> marginal variances. Which leads me to a hypothesis, one that we've
> discussed previously on this list.
>
> The mass matrix is effectively adjusting the overall step size,
> insofar as multiplying the mass matrix by a scalar is equivalent to
> multiplying the step size by the square root of that scalar. This
> wasn't a problem in the old implementation, where the overall scale of
> the mass matrix was held constant.

Thanks -- this is what I was struggling to articulate recently.

> Here it is a problem, because every
> change to the mass matrix's overall scale is effectively a shock to
> the step size adaptation problem that the dual averaging algorithm is
> trying to solve. And if there's one final change at the end of the
> adaptation period, then the step size adaptation never has a chance to
> recover.

Does the scale change that much during adaptation? And won't it
stabilize at some point as the warmup iterations converge?

> One of two things needs to happen to remedy this: either the mass
> matrix adaptation has to keep its overall scale constant, or the step
> size returned by the step size adaptation has to be adjusted to
> account for the overall scale of the mass matrix.

How do you measure the overall scale of a diagonal or full mass
matrix?

I thought previously that the varying step sizes (~ diagonal mass)
were standardized to unit size (can't recall if it was L1 or L2).
So for L1, I guess that'd be the trace if I could generalize from
the diagonal to full mass matrices.

- Bob

Ben Goodrich

unread,

May 4, 2013, 7:48:53 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

On Saturday, May 4, 2013 6:06:50 PM UTC-4, Matt Hoffman wrote:

Ok, here we go:
2024218701
2024256492
2024318542
are all pretty bad.

Attached is a (possibly unportable) script that tests whether the average ESS --- excluding _stepsize --- is less (greater) than 50 (200) with 7 chains and seed 2024218701 so that one can do

git checkout master
git bisect start
git bisect bad
git bisect good v1.2.0
git bisect run ./test.sh

Let me know if you think it should be tweaked or tweak it yourself and do the git bisect stuff (you will also need to change some paths in the script).

Ben

test.sh

Ben Goodrich

unread,

May 4, 2013, 7:50:49 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

On Saturday, May 4, 2013 7:48:53 PM UTC-4, Ben Goodrich wrote:

On Saturday, May 4, 2013 6:06:50 PM UTC-4, Matt Hoffman wrote:
Ok, here we go:
2024218701
2024256492
2024318542
are all pretty bad.

Attached is a (possibly unportable) script that tests whether the average ESS --- excluding _stepsize --- is less (greater) than 50 (200) with 7 chains and seed 2024218701

Average ESS values between 50 and 200 are considered ambiguous so the script exists with the magic 125 code.

Ben

Ben Goodrich

unread,

May 4, 2013, 8:41:30 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

On Saturday, May 4, 2013 2:54:12 PM UTC-4, Matt Hoffman wrote:

I'm rerunning some experiments from the NUTS paper, and discovered
that diagonal adaptation is failing quite badly in some cases on a
simple logistic regression. Here are minimum effective sample sizes
(obtained from bin/print) for 16 independent runs of 1K warmup/1K
sampling:

Argh, we have a separate bug on master where bin/print does not consistently get the spacing right between n_eff and Rhat, as in

goodrich@CYBERPOWERPC:/opt/stan$ ./bin/print /tmp/SU/MH/samples*.csv
Inference for Stan model: 
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

              mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff   Rhat
beta[1]        1.3     0.0  0.2    0.9    1.2    1.3    1.4    1.7    711.1e+00
beta[2]        0.5     0.0  0.1    0.2    0.4    0.5    0.5    0.7    361.1e+00
...

This is preventing my script from parsing the effective sample size correctly when it git bisects. Does anyone know anything about this?

Ben

Ben Goodrich

unread,

May 4, 2013, 8:57:42 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

Okay, attached is a modified version of the script to pass to git bisect run that

-- Cleans up more
-- Only averages the ESS of the betas
-- Only does 1 chain to work around the print bug
-- Changes the thresholds for failure and success to 20 and 50 respectively.

Ben

test.sh

Matt Hoffman

unread,

May 4, 2013, 9:22:06 PM5/4/13

to stan-dev

> Does the scale change that much during adaptation? And won't it
> stabilize at some point as the warmup iterations converge?

Yes, it'll stabilize eventually. But we want it to stabilize quickly.
And it'll stabilize more slowly if the warmup iterations are going
badly, which they will if the adaptation is bad.

> How do you measure the overall scale of a diagonal or full mass
> matrix?
>
> I thought previously that the varying step sizes (~ diagonal mass)
> were standardized to unit size (can't recall if it was L1 or L2).
> So for L1, I guess that'd be the trace if I could generalize from
> the diagonal to full mass matrices.

There's no single right way to do it. I think I was using L1 before,
keeping the average per-variable step size equal to 1. We should
definitely be normalizing for the number of parameters.

Matt

Ben Goodrich

unread,

May 4, 2013, 9:42:36 PM5/4/13

to stan...@googlegroups.com, mdho...@cs.princeton.edu

On Saturday, May 4, 2013 8:57:42 PM UTC-4, Ben Goodrich wrote:

Okay, attached is a modified version of the script to pass to git bisect run that

-- Cleans up more
-- Only averages the ESS of the betas
-- Only does 1 chain to work around the print bug
-- Changes the thresholds for failure and success to 20 and 50 respectively.

And git bisect points to

https://github.com/stan-dev/stan/commit/30c7e04cd2973c409a4d187ee60f393b40adccdb

30c7e04cd2973c409a4d187ee60f393b40adccdb is the first bad commit
commit 30c7e04cd2973c409a4d187ee60f393b40adccdb
Author: Michael Betancourt <betanalpha@gmail.com>
Date:   Sun Apr 28 17:36:42 2013 +0100

    Snap final var/covar adaptation window to warmup period

:040000 040000 8a68d5f951ab3702fa3204876af775a0f8af85bc 0ddba1a8e57542dbbf04d350372c30606b5bc1ee M      src
bisect run success
goodrich@CYBERPOWERPC:/opt/stan$ git bisect log
git bisect start
# bad: [e3a6abd21814a28b95f74e293bb2255dea4bd1a3] shuffle function_signatures*.stan in test-unit
git bisect bad e3a6abd21814a28b95f74e293bb2255dea4bd1a3
# good: [1cde932d15a348df17e9dafce17a79299b9d8a2d] upgrade rstan verion to 1.2.0
git bisect good 1cde932d15a348df17e9dafce17a79299b9d8a2d
# good: [e8cc58cf487193b21f2235ee9570bc92accc46c1] updating release notes with proper dates: 2013 instead of 2012
git bisect good e8cc58cf487193b21f2235ee9570bc92accc46c1
# good: [fa4764f9c1300e25aabff02cde2082b66a66aaa9] fixed inv_cloglog in fvar with test
git bisect good fa4764f9c1300e25aabff02cde2082b66a66aaa9
# good: [2514d39580b0f88123526636855f7e84802f96d8] added transpose_test for fvar/matrix
git bisect good 2514d39580b0f88123526636855f7e84802f96d8
# good: [1611ac0515dd73d3332755c58ccd861b9c62d357] rstan, add option of delta and gamma
git bisect good 1611ac0515dd73d3332755c58ccd861b9c62d357
# bad: [f524331401dfa8895f63b9fa021f6746d566c8af] More detailed output during BFGS.
git bisect bad f524331401dfa8895f63b9fa021f6746d566c8af
# good: [9ef525b3cbad99d383185ada191235dc1013a7a6] fixed type inference problem for int with multiply; need to do elsewhere
git bisect good 9ef525b3cbad99d383185ada191235dc1013a7a6
# bad: [81803474d0b84d976f3ffb7410bd1bcd817daed1] Revert to quadratic interpolation when selecting initial step-size.
git bisect bad 81803474d0b84d976f3ffb7410bd1bcd817daed1
# bad: [30c7e04cd2973c409a4d187ee60f393b40adccdb] Snap final var/covar adaptation window to warmup period
git bisect bad 30c7e04cd2973c409a4d187ee60f393b40adccdb
# good: [c422874ffff99394953ed479a2fbd3d4517d0fc5] more block operations, patch in block to return row vec
git bisect good c422874ffff99394953ed479a2fbd3d4517d0fc5
# good: [fcdf6b4be181a7e49f334fe903e84c3ae84dd25d] Woops, have to actually implement a destructor
git bisect good fcdf6b4be181a7e49f334fe903e84c3ae84dd25d

Ben

Bob Carpenter

unread,

May 5, 2013, 1:47:50 PM5/5/13

to stan...@googlegroups.com

Thanks for pointing this out. Daniel already knows about
the issue and was looking into a fix. I'm happy to
help out if I can.

- Bob

Bob Carpenter

unread,

May 5, 2013, 1:52:40 PM5/5/13

to stan...@googlegroups.com

Awesome. This seems like a likely culprit.

- Bob

Michael Betancourt

unread,

May 5, 2013, 2:14:33 PM5/5/13

to stan...@googlegroups.com, stan...@googlegroups.com

Yeah - during the refactor I made sure to account for the issue during adaption, but I completely forgot about after adaptation! The fix is simple and works on my local machine, and I'll push as soon as I get into work tomorrow (no Internet at home for a few weeks).

Matt was also correct in that the dual averaging is slightly off. It works fine in practice, but to be technically correct I'll fix that as well.

Michael Betancourt

unread,

May 6, 2013, 6:09:10 AM5/6/13

to stan...@googlegroups.com

I pushed a fix -- now the adaptation spends the last bit of warmup, max(10% total iterations, 100 iterations), adapting the step size. I also fixed the minor misimplementation of the dual averaging. Everything looks great on the few examples I ran (including logistic regression with the German Credit data for all there seeds Matt posted) -- let me know if there are any remaining performance issues.

-Mike

Reply all

Reply to author

Forward