implementing Stan's HMC sampler in MATLAB (automatic parameter tuning & finite differences)

Luigi Acerbi

unread,

Mar 27, 2016, 12:20:25 PM3/27/16

to Stan users mailing list

Dear all,

since the current version of Stan is not compatible with the models I deal with (see this post), I decided to implement some of the features of Stan in MATLAB.

In particular, I am now interested in implementing its state-of-the-art HMC sampler, and I have a couple of questions about it (I've been reading Chapter 58 of Stan's reference guide).

For the core algorithm, I can start with Matt Hoffman's MATLAB code for NUTS (which includes the dual-averaging algorithm for choosing the step-size during burn-in). However, I am also considering to implement the general three-stage procedure for automatic parameter tuning used in Stan, which seems pretty robust. Stan's manual (58.2) reports a fairly good description of the procedure, but it omits some implementation details (e.g., how is the diagonal covariance matrix regularized?). Is there a more detailed reference for this? Or, what is the code I should look at?
Not having access to the analytical gradient, in the first instance I will approximate derivatives of the log posterior with finite differences (FD). I appreciate that this will introduce errors and slow down the algorithm due to loss in precision -- and there are pathological cases in which FD will not work -- but I guess that for low-dimensional, non-pathological posteriors it should be fine (or at least better than nothing). Is there any particular problem I should be aware of when using FD for this?

Thanks,

Luigi

Bob Carpenter

unread,

Mar 27, 2016, 3:36:01 PM3/27/16

to stan-...@googlegroups.com

Did you try the auxiliary variable approach?

To answer your questions,

1) I don't know of a better writeup, but the code's
all in src/stan/mcmc on the stan-dev/stan repo.

2) There are ways to improve some of the errors in finite
using multiple terms, but the main issue's going to be instability
and speed. We don't know what the effect is going to be on the
Hamiltonian simulator in HMC to lower arithmetic accuracy several
orders of magnitude.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jan Tünnermann

unread,

Mar 27, 2016, 4:31:19 PM3/27/16

to stan-...@googlegroups.com

Maybe it would be a good idea for Luigi to build a version of original
Stan crippled arithmetic accuracy and see if the HMC still works for
relatively simple problems before investing much time implementing the
suggested MATLAB version just to find that it is not usable at all?

Jan

Luigi Acerbi

unread,

Mar 27, 2016, 5:07:19 PM3/27/16

to Stan users mailing list

On Sunday, 27 March 2016 15:36:01 UTC-4, Bob Carpenter wrote:

Did you try the auxiliary variable approach?

No -- I have two nested integrals; only the external integral can be solved with the auxiliary variable approach.

1) I don't know of a better writeup, but the code's

all in src/stan/mcmc on the stan-dev/stan repo.

OK, thanks, I'll start digging in there.

2) There are ways to improve some of the errors in finite
using multiple terms, but the main issue's going to be instability
and speed.

I don't have issues with increasing computation time due to additional function evaluations for the gradient -- my models are low-dimensional (2-20 parameters).

I still think (hope) that the gain in effective number of samples would more than counterbalance it.

We don't know what the effect is going to be on the
Hamiltonian simulator in HMC to lower arithmetic accuracy several
orders of magnitude.

Isn't the worst thing that can happen that there will be a lot of rejections in the Metropolis step due to violation of the Liouville property?

Or something worse could happen?

@Jan: Thanks -- good idea. I can test NUTS MATLAB's implementation with finite-difference gradient on some simple problem.

Thanks!
Luigi

Bob Carpenter

unread,

Mar 27, 2016, 5:56:01 PM3/27/16

to stan-...@googlegroups.com

The issue is going to be the error on longer trajectories and
the ability to reduce the step size to better follow the curvature
of the posterior.

Also, NUTS uses slice sampling along the trajectory, not a Metropolis
accept/reject step.

- Bob

Luigi Acerbi

unread,

Mar 27, 2016, 6:53:29 PM3/27/16

to Stan users mailing list

Also, NUTS uses slice sampling along the trajectory, not a Metropolis
accept/reject step.

Ah, right, I had forgotten.

So, I think I will try and see how NUTS works with numerical gradient for a toy example, as suggested by Jan.

Since I am interested in low-dimensional problems, I might start testing with a 20-dimensional correlated Gaussian.

Thanks,

Luigi

Bob Carpenter

unread,

Mar 27, 2016, 10:26:01 PM3/27/16

to stan-...@googlegroups.com

Let us know how it turns out.

- Bob

Luigi Acerbi

unread,

Mar 28, 2016, 12:28:45 AM3/28/16

to Stan users mailing list

So, I did some very quick tests with a 15-D correlated Gaussian with unit variance and various degree of correlation (up to rho = 0.99).

I compared NUTS with numerical gradient vs NUTS with exact gradient vs an adaptive slice sampling method that estimates the covariance matrix and subsequent slicing direction using warmup samples (the method I would be otherwise using; so both NUTS and slice sampling have an adaptive component).

Up to rho = 0.9, the numerical gradient does okay at least on this very simple case.

The number of effective samples (Neff) per "gradient evaluation" is comparable between the numerical and analytical gradients (perhaps with a tiny reduction in efficiency of the former, less than 5-10%).

Of course the numerical method takes a much larger number of function evaluations (2*D function evals per gradient evaluation, if using a central difference scheme).

However, performance of the numerical gradient drops for the largest value of rho that I tried (rho = 0.99); there is a difference of almost a factor of 10 in Neff per gradient evaluation between numerical and exact gradient.

For values of rho of 0.9 or less, the comparison of NUTS with numerical gradient with the slice sampler is in strong favour of the slice sampler, which is up to 20 times better in terms of Neff per function evaluation. The advantage of slice sampler becomes astronomical for rho = 0.99.

I tried using HMC (with dual-averaging) and numerical gradient and it seems to perform better than NUTS with numerical gradient (I just picked a unit-length trajectory) -- this suggests that the trajectories in NUTS end up being too long and accumulate too large error.

This was very quick and dirty -- I just wanted to get some ballpark.

Clearly this is not a fair comparison since the adaptive slice sampler can estimate quite easily the shape of the covariance matrix.

NUTS would get a bigger advantage for complex density shapes. However, if the shape of the distribution is particularly complex, NUTS with numerical differentiation would also start to perform poorly.

Things can get better with better differentiation schemes, but they would require more function evaluations.

In conclusion, I think it might not be worth for me to implement HMC/NUTS with naive numerical differentiation.

It seems that any gain might be swamped by the loss in efficiency due to accumulated numerical error and the increased cost in function evaluations to estimate the gradient numerically.

Of course there might be clever solutions to this problem, but I'd rather focus on other methods at this point -- coming back to HMC/NUTS when I can compute an exact gradient.

Best,

Luigi

Bob Carpenter

unread,

Mar 28, 2016, 12:37:00 AM3/28/16

to stan-...@googlegroups.com

Thanks for reporting back. I believe that's in line with
what Michael thought would happen.

It makes me wonder if we could do better going beyond double-precision
arithmetic, which is something Ben's talked about in the past, and
something that's at least possible in principle (though it would require
rewriting the autodiff library to template out base type in reverse mode).

- Bob

Ben Goodrich

unread,

Mar 28, 2016, 12:55:51 AM3/28/16

to Stan users mailing list

On Monday, March 28, 2016 at 12:37:00 AM UTC-4, Bob Carpenter wrote:

It makes me wonder if we could do better going beyond double-precision
arithmetic, which is something Ben's talked about in the past, and
something that's at least possible in principle (though it would require
rewriting the autodiff library to template out base type in reverse mode).

We might need something like that for randommm's numeric integration branch.

Reply all

Reply to author

Forward