Rosenbrock test function with NUTS

365 views
Skip to first unread message

Nigel/essence

unread,
Sep 2, 2012, 9:02:47 AM9/2/12
to stan...@googlegroups.com
Has anybody tried NUTS with standard optimisation test functions such as Rosenbrock?

I am doing some tests, using these functions as the negative log likelihood.

Preliminary tests, with N = 16, and without any mass weight, shows that DE methods are vastly superior to NUTS, and indeed random walk is better than NUTS, but this is expected to change as soon as I implement the exact Hessian weights in NUTS.

My evaluation is very simplistic - whether the sampler gets anywhere near the minimum in a few million function/gradient evals.

As a cross check against my Java code, I will also try the original Matlab NUTS code on this example.

I'm not sure whether there are analytical integations for these functions, against which MCMC sampled cdf's can be compared.

I guess my question is - how many test functions has NUTS been evaluated against? In the optimisation, world, there is a standard set of test problems, but I haven't seen a similar set for MCMC methods - might be a useful exercise to generate a standard set so the community can properly evaluate different approaches.

Not very exciting work, but extremely valuable, and IMHO a necessary condition for a mature technology.

I will shortly test NUTS with the exact Rosenbrock Hessian and report back, and then some time soon try some BFGS methods for comparison.

I also need to examine whether my actual problem has an analytical Hessian - it probably does, but has some lengthy algebra to evaluate, and I am sure it is not positive definite everywhere - but maybe I can see a reasonable analytic approximation which is.

Ben Goodrich

unread,
Sep 2, 2012, 12:11:02 PM9/2/12
to stan...@googlegroups.com
On Sunday, September 2, 2012 9:02:47 AM UTC-4, Nigel/essence wrote:
Has anybody tried NUTS with standard optimisation test functions such as Rosenbrock?

I am doing some tests, using these functions as the negative log likelihood.

Stan includes many tests, although not this one. One difficulty with subjecting Stan (or any NUTS algorithm) to some difficult minimization problem is that you have to ensure that the "posterior" is a proper distribution. In many situations, the objective function would not integrate to a finite number in the typical case where each parameter can take any value on the real line. So, I would think that a necessary condition for such a test is that you would have to bound the parameters to [-100,100] or some other wide interval that includes the optimum.

The original plan for Stan was to include support for optimization using automatic differentiation to obtain the gradient and / or Hessian in situations where one or both is not easy to calculate analytically. Indeed, optimization is mentioned in TO-DO.txt. I don't know that optimization algorithms are a top priority for any of us at the moment, but if someone else wanted to push forward with them, great. Although I would think we should start with a design for an optimization module and then implement the textbook optimization algorithms that wouldn't work well Rosenbrock's function.

Ben

Bob Carpenter

unread,
Sep 2, 2012, 1:28:43 PM9/2/12
to stan...@googlegroups.com
Speaking of Rosenbrock, I'm going to implement
Vrugt et al.'s DREAM:

http://jasper.eng.uci.edu/pdf/47.pdf

It should handle the Rosenbrock reasonably well.
It's a multi-point method that's related to
Nelder-Mead optimization (an analogy brought up
by Goodman and Weare in their ensemble sampling
paper, which is very very closely related
to DREAM's precursor, which the above paper
calls DE-MC).

The DREAM algorithm's very straightforward
and it doesn't require gradients. It will
require

- templating the log prob function in the
model for efficiency, and

- coming up with a multi_sampler and
adaptive_multi_sampler interface.

I'm thinking something simple for the latter folloiwng
adaptive_sampler, but with a method signature

void multi_sample(vector<double>& thetas);

I could probably drop the "multi_" as it's implicit
in the argument.

Speaking of muti-point samplers, does anyone know how
to deal with ESS calcs for something like DREAM
where the different chains are correlated? There's
some discussion in Goodman and Weare's paper:

http://msp.berkeley.edu/camcos/2010/5-1/camcos-v5-n1-p04-p.pdf

Also, does anyone know if there's something I should
read about implementation or parameters, or are
the defaults in that paper still standard operating
procedure?

- Bob
> --
>
>

Bob Carpenter

unread,
Sep 2, 2012, 1:47:49 PM9/2/12
to stan...@googlegroups.com


On 9/2/12 9:02 AM, Nigel/essence wrote:
> Has anybody tried NUTS with standard optimisation test functions such as Rosenbrock?
>
> Preliminary tests, with N = 16, and without any mass weight,
> shows that DE methods are vastly superior to NUTS, and indeed random walk
> is better than NUTS, but this is expected to change as soon as I implement the exact Hessian weights in NUTS.

I'm not surprised. I responsed about DREAM and its
predecessor DE. In short, we can add DREAM to Stan.

Stan's basic modeling and sampling framework was intended
to be flexible enough to let us play with different
samplers. We started with HMC and NUTS, but we never
assumed those would be the final word in sampling!

We've talked about adding auto-dif for Hessians, but
the main drawback's scalability.

> My evaluation is very simplistic - whether the sampler gets anywhere near the minimum in a few million function/gradient evals.
>
> As a cross check against my Java code, I will also try the original Matlab NUTS code on this example.
>
> I'm not sure whether there are analytical integations for these functions, against which MCMC sampled cdf's can be compared.
>
> I guess my question is - how many test functions has NUTS been evaluated against? In the optimisation,
> world, there is a standard set of test problems, but I haven't seen a similar set for MCMC methods -

We implemented and tested almost all of the BUGS
models, which are a kind of MCMC de facto test
standard.

> might be a useful exercise to generate a standard set so the community can properly evaluate different approaches.
>
> Not very exciting work, but extremely valuable, and IMHO a necessary condition for a mature technology.

We completely agree. Building out the model set is one of
the deliverables for our NSF grant for Stan, so we plan to
do much more of it beyond the basic BUGS models. There are
entire textbooks devoted to BUGS examples that we can port.

We're mostly focused on practical statistical models we
care about involving multilevel regressions, particularly
ones requiring multivariate priors.

I also think it helps to have toy test problems like the Rosenbrock
to understand the theoretical properties of an algorithm.

Other simple cases to look at in addition to banana-shaped
densities like the Rosenbrock are thin and thick tails, as
well as the funnel-shaped distrbutions that approximate what
happens in some multilevel models.

- Bob

Andrew Gelman

unread,
Sep 2, 2012, 4:06:36 PM9/2/12
to stan...@googlegroups.com
I recommend global replication: just replicate the entire Dream thing 2 or 4 times to get convergence monitoring. This will require a small bit of extra overhead (as there will be #chains per Dream and #Dreams) but I think it will be well worth it.

Nigel/essence

unread,
Sep 2, 2012, 5:09:59 PM9/2/12
to stan...@googlegroups.com
I am only using these optimisation examples because I know they are hard, and my actual problem is hard. I am doing very simple tests, and just hoping the sampler will at least sample somewhere near the minimum at some stage. This seems to me to be a pretty necessary condition! In my actual problem, I often start a sampler at the minimum using various optimisation codes, heuristic and quasi newton.

The reference you have for DREAM is the one I have been using, it has evolved a bit from the original. This reference tries to avoid the need for 2xD chains, but in my tests 2xD chains are certainly beneficial if possible.

The algorithm needed a bit of head scratching on my part to work out how it gets the effective dimensions - I decided that it chooses the dimensions first and calculates the effective dimension, and then does the sampling. It may be helpful to ask the authors for the exact code.

It does indeed work well on Rosenbrock. I implemented the quoted algorithm, but, as ever, with some of my own tweaks.

I have tried Rosenbrock on the original NUTS and NUTS fails badly. I used the code (which I hope is free from any bug - I am not familiar with Matlab, and I am differentiating late on Sunday after a glass of wine):


function [logp grad] = rosenbrock(theta)
D = length(theta);
logp = 0.0;
grad = zeros(1,D);
for iMod = 1:D-1
tmp1 = (1.0 - theta(iMod));
tmp2 = theta(iMod + 1) - theta(iMod) * theta(iMod);
logp = logp - tmp1 * tmp1 - 100.0 * tmp2 * tmp2;
grad(iMod) = grad(iMod) + 2.0 * tmp1 + 100.0 * 2.0 * tmp2 * 2.0 * theta(iMod);
end
tmp2 = theta(D) - theta(D-1) * theta(D-1);
grad(D) = grad(D) - 100.0 * 2.0 * tmp2;

end

Nigel/essence

unread,
Sep 2, 2012, 5:50:31 PM9/2/12
to stan...@googlegroups.com
I'm an idiot. The correct derivative is of course:


function [logp grad] = rosenbrock(theta)

global maxValue;


D = length(theta);
logp = 0.0;
grad = zeros(1,D);
for iMod = 1:D-1
tmp1 = (1.0 - theta(iMod));
tmp2 = theta(iMod + 1) - theta(iMod) * theta(iMod);
logp = logp - tmp1 * tmp1 - 100.0 * tmp2 * tmp2;

grad(iMod) = grad(iMod) + 2.0 * tmp1 + 400.0 * tmp2 * theta(iMod);
grad(iMod + 1) = grad(iMod + 1) - 200.0 * tmp2;
end
end

This drastically improves NUTS. It is now similar performance to DREAM.

Still a worthwhile test for comparison.

Ignore all my previous comments on NUTS v DREAM.

Bob Carpenter

unread,
Sep 2, 2012, 6:00:30 PM9/2/12
to stan...@googlegroups.com


On 9/2/12 4:06 PM, Andrew Gelman wrote:
> I recommend global replication: just replicate the entire Dream thing 2 or 4 times to
> get convergence monitoring. This will require a small bit of extra overhead
> (as there will be #chains per Dream and #Dreams) but I think it will be well worth it.

Let's say we do K runs of dream, each with N
particles run for M draws. So we have K * N * M
total draws.

What constitutes a chain in the usual ESS calc?

Would we just concatenate the N * M draws and
thus effectively ignore the correlation among
the N particles of each run of DREAM?

Or do you mean just take a single one of the N
particles? That would be kosher, but it seems
wasteful.

- Bob

Nigel/essence

unread,
Sep 2, 2012, 6:32:17 PM9/2/12
to stan...@googlegroups.com
ps. but DREAM is still remarkably good given that it doesn't require derivatives.

One last comment - could the differential concepts in DREAM be used to improve the sampling of momenta in NUTS, so that we tend to 'kick' in a good direction? Is there any possible synergy between DE and HMC ? The differential concept replaces the need for approximate Hessian?

I've implemented diagonal weights for Rosenbrock, and it improves it somewhat. For positive definiteness I simply check if diagonals fall below a small number or not. Tomorrow, as a test case, I will implement a full Hessian with appropriate better adjustments for positive definiteness. This can then be used as a comparison with various different approximate Hessian approaches.

Matt Hoffman

unread,
Sep 2, 2012, 7:28:19 PM9/2/12
to stan...@googlegroups.com
Hi Nigel,

This is absolutely worth thinking about. In fact, I'd say that the
LBFGS-HMC method of Zhang and Sutton is doing something quite similar,
although it's inspired from a different direction.

One point to note about using the (approximate) Hessian in the
sampler—it's not valid to let the Hessian change from iteration to
iteration unless you're very careful (as in Girolami and Calderhead's
RM-HMC or Zhang and Sutton's LBFGS-HMC). This is potentially a big
issue for a nasty distribution like the exponentiated Rosenbrock
function, where the local curvature information shows substantial
local variation.

Matt
> --
>
>

Nigel/essence

unread,
Sep 2, 2012, 7:28:47 PM9/2/12
to stan...@googlegroups.com
Hmmm...apples and oranges....

I was using NUTS with a single chain starting at the origin.

The Rosenbrock minimum is at (1,1,1,...). The values at my starting point is D (dimension).

I was using DREAM with multiple chains based on a Latin Hypercube (LHC) design. Initial function was much higher.

So NUTS was starting at a much better point.

If I started NUTS with multiple chains like DREAM with a LHC design, it does much worse than DREAM again. Never gets anywhere.

Just goes to show how difficult it is to compare apples and oranges.

Certainly Rosenbrock is a good test function, and it can behave very badly depending on the starting point.

In short, I'm tending to favour DREAM again, until I try NUTS with a good Hessian.

Andrew Gelman

unread,
Sep 2, 2012, 10:17:35 PM9/2/12
to stan...@googlegroups.com
I haven't been following the details of Dream, I'm just saying that whatever version of Dream we do, we replicate the entire thing 2 or 4 times, and use these to get the between-chain variance that we need to get R-hat and ESS.
> --
>

Matt Hoffman

unread,
Sep 3, 2012, 12:15:27 AM9/3/12
to stan...@googlegroups.com
It's totally possible that DREAM is better during the search phase of
warmup/burn-in than NUTS for many problems. NUTS was designed to
maximize effective sample sizes rather than search speed, after all.

Matt
> --
>
>

Andrew Gelman

unread,
Sep 3, 2012, 4:05:26 PM9/3/12
to stan...@googlegroups.com
M3,
When you say "warmup/burn-in," I think you mean "warmup."
A
> --
>

Nigel/essence

unread,
Sep 3, 2012, 5:00:01 PM9/3/12
to stan...@googlegroups.com
I came across a paper in a recent evolutionary computing journal by Kok and Sandrock

'Locating and characterizing the stationary points of the extended Rosenbrock function'

which explains the problem quite well. It is easy to get stuck at saddle points, and the authors found that simple random walk MCMC got stuck easily.

They got better results with particle swarm, and I got better results with DE.

Whatever, you need to jolt the system away from saddle points, and it seems most saddle points only have one negative eigenvalue for the hessian, so you have to kick (that is a technical term, OK?) in the right direction.

I can't see why an approximate hessian which had a small positive eigenvalue to make it positive definite could not, in a HMCMC scheme, give a suitable kick in approximately the right direction.

In contrast, a diagonal weight matrix would stay stuck.

Also, a hessian approximation which was not reflective of the saddle point would not help either.

DE helps because it can project over the saddle point. Similarly particle swarm (which behaves under certain conditions like a spring) will oscillate over the saddle point. Like pushing a swing and the child goes flying over the fence.

Bob Carpenter

unread,
Sep 3, 2012, 6:42:44 PM9/3/12
to stan...@googlegroups.com
Let's talk about this at the meeting if we're
not too bogged down with Stan technicalities.

For a problem with D parameters, the default
setting of DREAM produces a Markov chain over
a space of dimensionality D*D. Each draw is
made up of D replications of each parameter.
Every replicated parameter bundle has the right
stationary marginal distribution, but there's some
dependence among the replicates.

- Bob

Matt Hoffman

unread,
Sep 3, 2012, 7:37:11 PM9/3/12
to stan...@googlegroups.com
I was a little confused by this discussion at first, since it seems
like we shouldn't have to worry so much about saddle points in MCMC.
After all, it's not like in optimization where we stop dead when we
find a point with 0 gradient—in (say) HMC we'll just keep going
merrily on our way without altering the momentum if we ever find such
a point (which for most distributions is a 0 probability event
regardless). But then I realized that the issue might have to do with
the region around the sort of saddle point you describe, where there's
only one (possibly hard to find) direction that leads to the mode. Is
that the problem you're describing, where the system gets stuck in a
"false optimum" that happens to have a very difficult-to-find escape
direction?

Matt
> --
>
>

Nigel/essence

unread,
Sep 4, 2012, 8:47:45 AM9/4/12
to stan...@googlegroups.com, mdho...@cs.princeton.edu
Matt, I don't know whether that is the problem, but the symptoms I see are the symptoms you would see if that is the case. It is, as you say, difficult to go in the direction of the single negative eigenvector when in high dimension space.

Have you tried Rosenbrock yourself in NUTS/Matlab? I sent the Rosenbrock function and gradient above.

BTW, I have done some initial DE/NUTS integration, and it looks promising. They complement each other.

Not sure how much sharing I can do on this at present - but if anybody is interested in discussing this I am interested in talking. Maybe not in a public forum at present.

Bob Carpenter

unread,
Sep 4, 2012, 3:28:56 PM9/4/12
to stan...@googlegroups.com


On 9/2/12 6:32 PM, Nigel/essence wrote:
> ps. but DREAM is still remarkably good given that it doesn't require derivatives.
>
> One last comment - could the differential concepts in DREAM be used to improve the sampling of
> momenta in NUTS, so that we tend to 'kick' in a good direction?

Yes, that's what the HMC-BFGS stuff was doing. Using
multiple chains to get some handle on covariance more
cheaply than going to RM-HMC.

> Is there any possible
> synergy between DE and HMC ? The differential concept replaces the need for approximate Hessian?

The synergy could also go the other way. You could
take something like DREAM and instead of adding
a tiny bit of random noise via a Norm(0,epsilon) after
the interpoloation, you could start HMC from that point.
I think you get detailed balance, but I haven't worked
it out in detail.

As I said, I added an implementation of DREAM to the
to-do list. The algorithm's easier than the interface
at this point.

- Bob

Nigel/essence

unread,
Sep 9, 2012, 9:25:02 AM9/9/12
to stan...@googlegroups.com
Couple of points:

I came across this

www.maths.uq.edu.au/~kroese/ps/generalized_splitting.pdf

which includes an example using Rosenbrock, and also introduced me to splitting methods for rare events, and many other articles, which looks interesting if I have time to digest it. It seems to be a fairly well known technique but I had not come across it.

Secondly, I can see a way to integrate NUTS and DREAM and BFGS into a single sampler, and will report back when or if I get some decent results. If anybody wants to know more, contact me privately at goo...@essence.co.uk

Bob Carpenter

unread,
Sep 9, 2012, 11:07:07 PM9/9/12
to stan...@googlegroups.com


On 9/9/12 9:25 AM, Nigel/essence wrote:
> Couple of points:
>
> I came across this
>
> www.maths.uq.edu.au/~kroese/ps/generalized_splitting.pdf

Thanks for the ref.

I'll have to look up basic splitting methods
first.

>
> which includes an example using Rosenbrock, and also introduced me to splitting methods for rare events, and many other articles, which looks interesting if I have time to digest it. It seems to be a fairly well known technique but I had not come across it.
>
> Secondly, I can see a way to integrate NUTS and DREAM and BFGS into a single sampler, and will report back when or if I get some decent results. If anybody wants to know more, contact me privately at goo...@essence.co.uk

Andrew is very keen of interleaving samplers within
other samplers. You can think of DREAM as doing that
with its epsilon term -- it's like interleaving Metropolis
and the part of differential evolution that mixes particles.

We had thought about using something like (L-)BFGS to
get near a mode, but it turned out for everything we looked
at that the Hamiltonians did pretty well on that front.

Are you aware of the Welling and Teh paper on interleaving
optimization and sampling?

http://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf

It describes a system that starts with SGD-style optimization
and then gradually morphs to Langevin.

- Bob


Nigel/essence

unread,
Sep 10, 2012, 5:11:12 AM9/10/12
to stan...@googlegroups.com
Thanks for the reference I will check it out.

Some arm waving:

- with multiple chains, different chains can use different samplers.

- there needs to be some way of cross fertilising between different chains, e.g. as in parallel tempering, or using 'other' chains to generate Hessians.

- different samplers have different overheads/function evaluations per sample. You can have some chains heavy duty, some chains light duty.

- you don't necessarily have to integrate different techniques into a single sampler, different techniques can be spread across different chains.

- adapting more than one parameter in a chain (epsilon plus something else) starts to get tricky and can easily become unstable. I tried once to adapt temperatures and epsilon in a single tempered scheme, and it was a disaster.

Concerning Hessians:

- there are (at least) two distinct problems.

- One is how to move between different modes. In this case, too much forcing of positive definiteness can hinder such moves in a Hamiltonian scheme. When at a saddle point, we want to move across the saddle point to another mode, so we want a Hessian with a suitable very small positive eigenvalue in that direction. But it can remain very difficult to get up to the saddle point in the first place, so some scheme which allows the Hessian to be ill conditioned and do long steps is perversely useful.

- I'm not sure how it works to have negative weights .... I once had a student friend at Cambridge who did a PhD in what happens to the laws of physics if you go faster than the speed of light, and he never recovered his sanity.

- Another is how to sample efficiently within a mode. In this case a good Hessian can be useful, but we don't want it contaminated by information from chains which are in some different mode.

- It is almost as if we want both a local positive definite Hessian and a long range ill conditioned Hessian.

- this is similar to proxy modelling which has a linear regression component which models long range effects, and a more local radial basis function component to ensure interpolation.

- As we know, defining Hessian approximations which preserve balance etc. are tricky.

- maybe stochastic techniques such as CMAES could be useful - I have tried and rejected this well known method for optimisation simply because I didn't like the way it handled simple bound constraints.

In short, BFGS or other Hessian schemes which have only been tested on relatively simple single mode problems are probably going to demonstrate some shortcomings i more complex problems, which their proponents already acknowledge. The background of these methods is in local optimisation, not global optimisation.

The splitting papers referred to above have a lot of examples of application to multi mode problems.

Nigel/essence

unread,
Sep 10, 2012, 5:40:53 AM9/10/12
to stan...@googlegroups.com
ps. a very extensive review of saddle point issues in optimisation and linear algebra may be found in

http://www.mathcs.emory.edu/~benzi/Web_papers/bgl05.pdf


Of course, when modelling PDE's, indefinite Hessians are not unusual, so presumably a Hamiltonian HMCMC formulation should be able to handle indefinite Hessians?

Multi level/multi grid methods are also discussed.

Bob Carpenter

unread,
Sep 10, 2012, 1:52:15 PM9/10/12
to stan...@googlegroups.com


On 9/10/12 5:11 AM, Nigel/essence wrote:
> Some arm waving:
>...
> - with multiple chains, different chains can use different samplers.
> ...
> - you don't necessarily have to integrate different techniques into a single sampler, different techniques can be spread across different chains.

We want to run cross-chain convergence diagnostics, so you'd
have to be careful with this. If you just threw in a slowly
moving random-walk metropolis algorithm, it would be right in
theory, but would get rejected by the convergence diagnostics
as not mixing well.

> Concerning Hessians:
> ...
> - Another is how to sample efficiently within a mode. In this case a good Hessian can be useful, but we don't want it contaminated by information from chains which are in some different mode.

This isn't just modes, but parts of the space with very
different geometries.

> - It is almost as if we want both a local positive definite Hessian and a long range ill conditioned Hessian.
>...
> - As we know, defining Hessian approximations which preserve balance etc. are tricky.

:-)

- Bob

Michael Betancourt

unread,
Sep 10, 2012, 2:41:28 PM9/10/12
to stan...@googlegroups.com
> - One is how to move between different modes. In this case, too much forcing of positive definiteness can hinder such moves in a Hamiltonian scheme. When at a saddle point, we want to move across the saddle point to another mode, so we want a Hessian with a suitable very small positive eigenvalue in that direction. But it can remain very difficult to get up to the saddle point in the first place, so some scheme which allows the Hessian to be ill conditioned and do long steps is perversely useful.

There are two issues here: Hamiltonian evolution and MCMC.

Indefinite metrics are a problem for Hamiltonian evolution because the
trajectories will never be able to cross the boundary where the metric
changes signature. (Remember the log determinant term! If the metric
changes signature smoothly then there has to be a point where one of
the eigenvalues is zero, hence the determinant vanishes, - log |
\Lambda | = \infty, and an infinite potential wall appears between the
regions of varying signature. The resulting joint distribution for
(p, q) is reducible and your chain is no longer ergodic). Not only do
indefinite metrics not help transition between regions of opposite
curvature they isolate them exactly!

On the MCMC side, If the metric is not positive definite (e.g. Hessian
outside of a convex support) then the conditional distribution for the
momenta is not a Gaussian and the Gibbs iterations don't make sense
anymore.

It's called "Riemannian manifold" HMC for a reason. Lorentzian
signature'd metrics mess up the MCMC side, and even if you could get
past that you'd have to worry about problems with Hamiltonian
evolution (really you want to move to a Poisson manifold, but those
things are crazy weird and I don't think anyone knows how to
simulation flow on them.)

Nigel/essence

unread,
Sep 10, 2012, 4:55:44 PM9/10/12
to stan...@googlegroups.com
On Monday, 10 September 2012 18:52:19 UTC+1, Bob Carpenter wrote:
> On 9/10/12 5:11 AM, Nigel/essence wrote:
>
> > Some arm waving:
>
> >...
>
> > - with multiple chains, different chains can use different samplers.
>
> > ...
>
> > - you don't necessarily have to integrate different techniques into a single sampler, different techniques can be spread across different chains.
>
>
>
> We want to run cross-chain convergence diagnostics, so you'd
>
> have to be careful with this. If you just threw in a slowly
>
> moving random-walk metropolis algorithm, it would be right in
>
> theory, but would get rejected by the convergence diagnostics
>
> as not mixing well.

who is talking about a random-walk metropolis algorithm? :-)


>
>
>
> > Concerning Hessians:
>
> > ...
>
> > - Another is how to sample efficiently within a mode. In this case a good Hessian can be useful, but we don't want it contaminated by information from chains which are in some different mode.
>
>
>
> This isn't just modes, but parts of the space with very
>
> different geometries.

agreed :-)

Nigel/essence

unread,
Sep 10, 2012, 4:58:25 PM9/10/12
to stan...@googlegroups.com
it would be right in
>
> theory, but would get rejected by the convergence diagnostics
>
> as not mixing well.
>
>

ps. but with suitable swapping, all the chains can mix nicely?
>


Marcus Brubaker

unread,
Sep 10, 2012, 5:08:09 PM9/10/12
to stan...@googlegroups.com
They would, but then the convergence diagnostics (Rhat at the very
least) are no longer applicable because the chains are not
independent. It's a non-trivial question to derive a convergence
diagnostic that can properly handle multiple, interacting chains. In
the worst case, you can try to treat all chains as a single chain and
use the single chain diagnostics, but that is decidedly
unsatisfying...

Cheers,
Marcus

Nigel/essence

unread,
Sep 10, 2012, 5:19:03 PM9/10/12
to stan...@googlegroups.com
OK -}

Let's remember the problem which simple Hamiltonian methods (or MH or Gibbs) are unlikely to handle adequately, from my limited understanding. Multi mode problems. Parellel tempering may be an approach, but is quite expensive IMHO, and self adaptation of temperatures, as well as epsilons, is still seeking a robust stable solution?

An example of the problem is given in section 4.4 of http://espace.uq.edu.au/eserv/UQ:178513/splitting_method.pdf.

The splitting method seems to give remarkable results.

Caveat - I am well known for jumping on the last paper I read as the silver bullet for everything. Always the optimist.

Bob Carpenter

unread,
Sep 10, 2012, 5:26:11 PM9/10/12
to stan...@googlegroups.com
We were discussing this issue at the Stan meeting last week
w.r.t. DE/DREAM/ensemble sampling. These samplers evolve
a bunch of parallel particles consisting of multiple copies of
the parameters, which together form a Markov chain. But
each copy of a parameter has some dependence on the other copies
introduced by the differential evolution (ensemble sampling).

If I understood Matt properly, he was saying the ESS calcs still work even
if you just concatenate a bunch of unrelated Markov chains.
I still don't follow, because then it would also
seem you could just scramble everything. Goodman and Weare
discuss the issue in their paper, but I didn't understand it.
They're local (NYU), so I could perhaps go talk to them and figure
out what's going on.

- Bob


Nigel/essence

unread,
Sep 10, 2012, 5:34:41 PM9/10/12
to stan...@googlegroups.com
The papers I refer to above on splitting methods discusses a different convergence diagnostic and compares it to (in remark 4.4.2) to the Gelman Rubin diagnostic. For what it is worth.

Matt Hoffman

unread,
Sep 10, 2012, 5:37:33 PM9/10/12
to stan...@googlegroups.com
Just a word of caution on multimodal problems: no method can handle
the general case. In general, finding the global maximum of (say) a
continuously differentiable objective function f simply cannot be done
with finite computation. Sampling from exp(f) is a harder problem than
finding the maximum of f, which means that in the general case
sampling is harder than impossible.

That's not to say that we shouldn't pay attention to multimodal
problems. But it's probably best to temper one's expectations when
thinking about methods that promise to solve global optimization.
(Sorry for the pun.)

Matt
> --
>
>

Michael Betancourt

unread,
Sep 10, 2012, 5:44:56 PM9/10/12
to stan...@googlegroups.com
I've long been a fan of nested sampling for these kinds of problems.
There's no guarantee it will find all of the modes without infinite
chains, but it has fewer pathologies than annealing/tempering (due to
the lack of phase transition as you pass points of inflection in the
tempered distributions) --and-- you can use HMC (shameless plug:
http://arxiv.org/abs/1005.0157).
> --
>
>

Matt Hoffman

unread,
Sep 10, 2012, 5:47:27 PM9/10/12
to stan...@googlegroups.com
You're right, you can't just scramble everything post-hoc when
computing ESS. The reason is a little subtle—the identity underlying
the ESS calculation assumes a stationary process, and the process "run
a Markov chain for N iterations and then scramble the result" is not
stationary (it only happens once). But as long as you're working with
a stationary process all is well. You basically just have to think of
the multiple interacting chains as one big Markov chain whose state
consists of a number of vectors each of whose marginal stationary
distributions is the target distribution.

Matt
> --
>
>

Nigel/essence

unread,
Sep 10, 2012, 5:49:40 PM9/10/12
to stan...@googlegroups.com, mdho...@cs.princeton.edu
Of course!

It is very easy to define a problem which no algorithm will solve in practice.

ps. I am not using any of these techniques to do an optimisation, I have much better optimisation methods to do that. My interest is Bayesian uncertainty analysis using difficult complex likelihood functions. I only introduced optimisation because it is my very firm belief that cross fertilisation from different disciplines will (almost) always lead to useful new ideas. It is one of the benefits of commerce (or real actual problems) that they jolt academia away from their silos.

As Boris Johnson, Mayor of London, said today, these Olympics have inspired a generation, and "have produced such paroxysms of tears and joy on the sofas of Britain that you probably not only inspired a generation but helped to create one as well."

http://www.telegraph.co.uk/sport/olympics/news/9533902/Olympics-2012-Parade-athletes-brought-this-country-together-says-Boris-Johnson.html

Marcus Brubaker

unread,
Sep 10, 2012, 6:03:45 PM9/10/12
to stan...@googlegroups.com
Also relevant to this discussion (and fitting well with the HMC nature
of NUTS) is Hyperdynamics Importance Sampling which augments HMC in an
attempt to encourage transitions between modes. See
http://www.springerlink.com/index/8F5UUKLKMYKFUWBR.pdf and
http://link.aps.org/doi/10.1103/PhysRevLett.78.3908 for reference.

However, I agree with Matt. The general problem of sampling from
multimodal posteriors is incredibly hard and no one has a silver
bullet which will work well for every problem. Stan is a great
platform on which these methods can be tested, but we can't expect
miracles.

Cheers,
Marcus
> --
>
>

Nigel/essence

unread,
Sep 10, 2012, 6:09:07 PM9/10/12
to stan...@googlegroups.com
Is Rhat applicable to DREAM, inasmuch as DREAM chains are not independent?

Andrew Gelman

unread,
Sep 10, 2012, 8:10:57 PM9/10/12
to stan...@googlegroups.com
I still think the solution here is to replicate the entire procedure (mixing and all) 2 or 4 times so that R-hat can be computed. A factor of 2 or 4 is a small price to pay . . . especially given that it produces 2 or 4 times the number of iterations.
A
> --
>

Nigel/essence

unread,
Sep 11, 2012, 5:10:43 AM9/11/12
to stan...@googlegroups.com
But if an engineer has an urgent meeting to discuss whether to dispose an oil reservoir or not, and needs the uncertainty analysis from the sampling, and the full sampling takes a day to perform - they are not going to wait 2 - 4 days. An hour is about their limit.

It reminds me, going back to the multi mode issue. I had a client a few years ago, who had an oil reservoir model, and they seemed to have two different modes, one with high permeability/low porosity, and the other with low permeability/high porosity. Both modes matched historical measurements equally well. One mode gave a high future production forecast, the other had production falling rapidly. But they had to make a decision whether to divest the field or not (worth many billions of USD), and had little way of knowing whether the sampling was properly weighting the likelihoods in the two modes.

In practice, they decided to perform further measurements to distinguish the modes. Turned out the low production forecast model was the correct one.

Of course, they should have presented the 'high production forecast' model as the truth and flogged the oil field off to somebody who didn't know how to do uncertainty analysis.

Does that help explain why multi mode and complex distributions are of great interest, even though they may pose many difficulties? HMCMC makes many intractable problems possible,which is why I use it as the basis.

To put it into perspective, proper Bayesian forecasting is still in its infancy in the oil and gas reservoir engineering industry.

The same issues arise, for example, in weather forecasting models.

Got to go catch a flight to Houston from London.

Andrew Gelman

unread,
Sep 11, 2012, 4:42:57 PM9/11/12
to stan...@googlegroups.com
I said 2 or 4 _replications_, not 2 or 4 _days_. If an hour is the limit, I'd run 2 replications for 30 minutes each or 4 replications for 15 minutes each. And I'd like an algorithm that gives reasonable results in 15 minutes!
> --
>

Bob Carpenter

unread,
Sep 11, 2012, 4:57:45 PM9/11/12
to stan...@googlegroups.com


On 9/11/12 5:10 AM, Nigel/essence wrote:
> But if an engineer has an urgent meeting to discuss whether to dispose an oil reservoir or not, and needs the uncertainty analysis from the sampling, and the full sampling takes a day to perform - they are not going to wait 2 - 4 days. An hour is about their limit.
>
> It reminds me, going back to the multi mode issue. I had a client a few years ago, who had an oil reservoir model, and they seemed to have two different modes, one with high permeability/low porosity, and the other with low permeability/high porosity. Both modes matched historical measurements equally well. One mode gave a high future production forecast, the other had production falling rapidly. But they had to make a decision whether to divest the field or not (worth many billions of USD), and had little way of knowing whether the sampling was properly weighting the likelihoods in the two modes.
>
> In practice, they decided to perform further measurements to distinguish the modes. Turned out the low production forecast model was the correct one.

Hopefully the decisions made with one-hour deadlines are
sub-billion dollar!

> Of course, they should have presented the 'high production forecast' model as the truth and flogged the oil field off to somebody who didn't know how to do uncertainty analysis.

On the model of "guns don't kill people, people kill
people", we should have "statisticians don't swindle customers,
businesspeople swindle customers".

> Does that help explain why multi mode and complex distributions are of great interest,

I think we all agree that multimodal posterior models
are of great interest.

Part of the utility of the R-hat statistic is in
diagnosing when a sampler has chains stuck in different
modes. Or when the distribution's so complex that it's
not mixing.

In some cases, like having two known modes,
it can even be tractable to sample if you can find the
weight of each mode somehow (for instance, through
appropriate MCMC methods that can jump between modes
while preserving stationarity).

In other cases, like the kinds of models Matt and I have
spent lots of time working on (HMMs, LDA, etc.),
the posterior is so multimodal, there's truly no hope
for MCMC to draw an effective sample. Further, some
algorithmic tricks are needed to even compute
basic expectations from HMMs. So despite the interest,
we're not really focusing on HMM-like structured models in
Stan, either.

- Bob

Nigel/essence

unread,
Sep 12, 2012, 5:40:33 PM9/12/12
to stan...@googlegroups.com
So would I, that is the holy grail for me, haven't found it yet!

I want an algorithm which can reliably sample a 400 dimensional complex likelihood function, multi modal with very high peaks and troughs, in, say, 100,000 function evals.

On a standard laptop or workstation, quad core CPU.

Fully adaptive, no user intervention or tuning.

Yes, I'm a dreamer!

At present 100,000 function evals takes several hours on my example test problem.

If I can speed up by a factor of ten from my current DREAM implementation, I should be OK, or at least ahead of the pack. NUTS is, I hope, a major contributor to achieve this.

I may wait until next year to implement a reliable method to handle large numbers of distinct modes....

Nigel/essence

unread,
Sep 12, 2012, 5:52:05 PM9/12/12
to stan...@googlegroups.com
You would be surprised!

The reservoir simulation and decision process is part of a much longer process, building geological models, determining critical parameters, etc etc etc.

So there may be a 3 month project, considering much more than just technical factors, safety etc is important, and also partners have to be kept on board.

Of which your simulation and uncertainty/optimisation project is squeezed into the last week.

You may still have a hard deadline, and only a few days before have an agreed reliable reservoir model, so the week before you try to do the final steps, find bugs in the software and/or model, and then hope and pray that when you restart it late on Friday (a) it will work (b) the mother in law isn't expecting you to entertain her (c) the IT department don't do a reboot of all the computers.

Then you present the P50, P10, P90 to the manager on Monday morning, and the manager asks what the realistic possible downside is, and you give some answer in a very confident voice, which is probably just the worst simulation you have done SO FAR, not at all the worst possible downside.

Ten years ago multi billion dollar decisions were made on the basis of a SINGLE reservoir model, which did not even match historical production and pressures.

Matt Hoffman

unread,
Sep 12, 2012, 6:21:05 PM9/12/12
to stan...@googlegroups.com
Also, these replications are very very easy to run in parallel. So
(depending on memory requirements and other I/O/resource limits)
running 4 replications on a quad-core machine should be close to as
fast as running a single replication.
> --
>
>

Nigel/essence

unread,
Sep 12, 2012, 9:54:29 PM9/12/12
to stan...@googlegroups.com, mdho...@cs.princeton.edu
Not so simple.

I'm already running all my hyperthreads at 100%, but use threads calculating the function rather than using threads for the different chains. I used to use threads for chains, but sometimes you want different samplers in different chains, and the samplers have different performance characteristics, so then splitting chains between threads does not fully utilise them. Eg a DREAM thread and NUTS thread - not the best way to do it.

Fear not, I am fully multi threaded from the ground up, at an appropriate level of granularity.

Of course, how to multi thread is very much problem specific. Most people would want to have a thread/chain architecture.

But even with a thread/chain architecture, you could have 21 simple chains using 7 threads and 1 complex chain (e.g. NUTS) using 1 thread, depending on how many leadfrogs NUTS takes per step.

I have also, btw, experimented with GPU, and have done a lot of profiling and benchmarking and tuning.

For example - the cdf of a Gaussian is very expensive, so I use a good enough approximation which is very fast. Similarly, I use Matern correlation functions, but not the exact function, a good approximation is good enough.

There is a general lesson - balance out your errors! If your sampling is going to give you something +/- 2%, why are you looking for exact values of special functions?

I even do an approximation of exp.

Of course, you have to be consistent and careful your approximations are smooth and differentiable etc., and you use them consistently.

But the biggest improvements, once you have done all the tuning, is to find better algorithms.

Matt Hoffman

unread,
Sep 12, 2012, 10:19:57 PM9/12/12
to stan...@googlegroups.com
Not so simple on one hand, but on another hand, much simpler! Just use
more computers!

Do we have a way in Stan to aggregate the results of independent runs?
> --
>
>

Daniel Lee

unread,
Sep 12, 2012, 10:40:10 PM9/12/12
to stan...@googlegroups.com
On Wed, Sep 12, 2012 at 10:19 PM, Matt Hoffman
<mdho...@cs.princeton.edu> wrote:
> Do we have a way in Stan to aggregate the results of independent runs?

Yes, there are methods to construct a chains object from independent
run. We don't have an easy way of accessing that from generated
models, but it's there. RStan does something similar too.

Nigel/essence

unread,
Sep 13, 2012, 2:05:49 PM9/13/12
to stan...@googlegroups.com, mdho...@cs.princeton.edu
Getting seriously off track, but -};!} (I'm not very good at smileys).

I am not in control of the computers, I am developing an end user application. Some of the clients have access to thousands of powerful computers in a cluster, some only have a stand alone laptop.

If it was just for me, I would but a 4 socket AMD 16 core machine, you can get them for under $10K. 64 threads. I don't have time and energy to build a simple cluster myself.

I would guess that a set of completely independent runs is less efficient than a set of chains which can mix and cross fertilise.

For example, with DREAM it is definitely beneficial to have a large number of chains in a run. 2xN is recommended.

So you can't just set off independent jobs to independent computers and aggregate.

I am only writing this now because I am taking a break as one of my clients has set off 200 simulation jobs to their cluster, and the results of the engineering jobs will be complete tomorrow morning. My own software controls and manages all these simulation runs, in a multi threaded environment using a producer/consumer paradigm (Ben at least will know what that means). The jobs can take different times, so as soon as one is finished I set off another one, and each time I do a MCMC simulation.

Which reminds me - potentially one of the biggest algorithmic improvements I envisage is when I learn about how to take full advantage of previous MCMC samples when generating a new set of samples - each time, the underlying likelihood function has changed, but gradually converges. I use this fact to have an intelligent way of initiating the chains when I start the MCMC again, but could do a lot more. One obvious thing is I can avid the burn-in after the first time, or at least make it much shorter. Can I sample from the previous set of samples? I'm sure I can do a lot in this area, but not this week.

To be clear I have a workflow:

(a) construct likelihood function
(b) perform MCMC (ideally << 1 hour)
(c) perform full engineering simulation study (0.1 - 24 hours)
(d) update likelihood function
(e) perform MCMC (ideally << 10 mins)
(f) go to (c) until finished


ps one of my clients did, indeed, use their heavyweight computer cluster to do the MCMC part, but that was with a basic RW algorithm back in 2005 or so. In the dark ages.

Seriously, tell me if my input is of interest or a distraction, and I will stop writing essays. There can't be many people who are developing commercial end user applications using advanced MCMC methods, are there?

Bob Carpenter

unread,
Sep 13, 2012, 3:19:48 PM9/13/12
to stan...@googlegroups.com


On 9/12/12 9:54 PM, Nigel/essence wrote:

> I'm already running all my hyperthreads at 100%, but use threads calculating the function rather than using threads for the different chains. I used to use threads for chains, but sometimes you want different samplers in different chains, and the samplers have different performance characteristics, so then splitting chains between threads does not fully utilise them. Eg a DREAM thread and NUTS thread - not the best way to do it.
>
> Fear not, I am fully multi threaded from the ground up, at an appropriate level of granularity.
>
> Of course, how to multi thread is very much problem specific. Most people would want to have a thread/chain architecture.
>
> But even with a thread/chain architecture, you could have 21 simple chains using 7 threads and 1 complex chain (e.g. NUTS) using 1 thread, depending on how many leadfrogs NUTS takes per step.

Are you using the executor framework in java.util.concurrent?
I'd think you'd just queue up Runnable objects and then bound the
number of concurrent threads to not overwhelm your machine.

Unless the objects needed to communicate with each other.
Then you'd wind up with a complicated scheduling problem.

Boy, do I ever miss util.concurrent in C++! Unfortunately,
Doug Lea wasn't happy with the new java community organization
under Oracle and opted out of participating.

I assume you're also cutting down all short-term object
allocation so one of those threads isn't just hammering away
full time on GC. The GC's so good now that it's not even
noticeable until you start getting contention across as many
threads as you have CPUs.

Do you find using the hyperthreads helps? I've found it
a very mixed bag. For instance, MATLAB works better with the
number of threads set to the number of cores, not the number
of "hyper"-cores. But it defaults to the number of hypercores.

Are you using an external matrix library in Java? I
never found one that looked good, but I stopped looking
many years ago.

> I have also, btw, experimented with GPU, and have done a lot of profiling and benchmarking and tuning.

Specialized architectures are a tradeoff in speed
versus flexibility. For instance, I'm pretty sure
most GPUs only support 32-bit arithmetic.

GPUs are effective if you can formulate your problem
as 32-bit single-instruction/multiple-data operations.

Everyone I've ever talked to has concluded they're
not worth the effort, but that's because I don't talk
to people who tend to need to do the same computation
every week, just really really fast.

> For example - the cdf of a Gaussian is very expensive, so I use a good enough approximation which is very fast. Similarly, I use Matern correlation functions, but not the exact function, a good approximation is good enough.
>
> There is a general lesson - balance out your errors!

Absolutely. That's why you don't need to usually run
very many effective samples to get estimates with
fractions of the deviation of the variable you're estimating.

> If your sampling is going to give you something +/- 2%, why are you
> looking for exact values of special functions?

Having said that, the reason to not do approximations in
NUTS/HMC is that the leapfrog step acceptance is pretty sensitive
to such errors.

The main reason not to accept +/- 2% approximation errors (as opposed
to something like 1e-10) is that the errors can compound.

In the best situation, the errors would all cancel out -- is
that what you mean by "balance"?

- Bob

Bob Carpenter

unread,
Sep 13, 2012, 4:02:15 PM9/13/12
to stan...@googlegroups.com


On 9/13/12 2:05 PM, Nigel/essence wrote:
> Getting seriously off track, but -};!} (I'm not very good at smileys).

> If it was just for me, I would but a 4 socket AMD 16 core machine, you can get them for under $10K. 64 threads. I don't have time and energy to build a simple cluster myself.

I also much prefer scaling in a single machine than
having to scale out across machines, which is a huge
pain.

> I would guess that a set of completely independent runs is less efficient than a set of chains which can mix and cross fertilise.

At least the independent runs are easier to run and
easier to analyze :-)

> For example, with DREAM it is definitely beneficial to have a large number of chains in a run. 2xN is recommended.

Yup. That's difficult to spread across machines.
Of course, if they were really difficult steps, then
you could map-reduce by partitioning the 2*N chains
each step. But I doubt the communication overhead's
worth it.

It then becomes an issue of whether it's a win to
partition, run for more than one iteration (still kosher
for stationarity) and then aggregate.

The machine learning community's been exploring issues
of breaking data across nodes, comuting estimates on each
chunk of data, then combining the estimates.

> So you can't just set off independent jobs to independent computers and aggregate.

You could just replicate the whole thing across machines.

> I am only writing this now because I am taking a break as one of my clients has set off 200 simulation jobs to their cluster, and the results of the engineering jobs will be complete tomorrow morning. My own software controls and manages all these simulation runs, in a multi threaded environment using a producer/consumer paradigm (Ben at least will know what that means). The jobs can take different times, so as soon as one is finished I set off another one, and each time I do a MCMC simulation.

Absolutely. It's what I was asking about in my
last e-mail about util.concurrent executor frameworks.

You can also do this higher up with message queues
in J2EE, and the nice thing about that, is they can
even be made transactional if you really care about
not losing data or corrupting state.

> Which reminds me - potentially one of the biggest algorithmic improvements I envisage is when I learn about how to take full advantage of previous MCMC samples when generating a new set of samples - each time, the underlying likelihood function has changed, but gradually converges. I use this fact to have an intelligent way of initiating the chains when I start the MCMC again, but could do a lot more. One obvious thing is I can avid the burn-in after the first time, or at least make it much shorter. Can I sample from the previous set of samples? I'm sure I can do a lot in this area, but not this week.
>
> To be clear I have a workflow:
>
> (a) construct likelihood function
> (b) perform MCMC (ideally << 1 hour)
> (c) perform full engineering simulation study (0.1 - 24 hours)
> (d) update likelihood function
> (e) perform MCMC (ideally << 10 mins)
> (f) go to (c) until finished
>
>
> ps one of my clients did, indeed, use their heavyweight computer cluster to do the MCMC part, but that was with a basic RW algorithm back in 2005 or so. In the dark ages.


> I am not in control of the computers, I am developing an end user application.
> Some of the clients have access to thousands of powerful computers in a cluster,
> some only have a stand alone laptop.

In my experience, customers are very illogical in what
they're willing to buy and what they're not. We've had
customers at our company buy way more hardware than necessary,
even when we told them we could run our tools over all
their data on a single box in a matter of minutes. Others
have money for consultants but not hardware or vice-versa.

It does seem like if you're making gigabuck decisions based
in part on stats models that you would mind a few kilobucks
on hardware. But companies usually don't run that way.

>
> Seriously, tell me if my input is of interest or a distraction, and I will stop writing essays. There can't be many people who are developing commercial end user applications using advanced MCMC methods, are there?

We've thought about lots of these issues, but insight from
outside parties is always welcome.

I think I'm more interested in these underlying engineering issues
than most of the team and have the most experience in varying
enterprise production environments. But Matt's done serious
production coding, too.

- Bob

Nigel/essence

unread,
Sep 13, 2012, 4:38:18 PM9/13/12
to stan...@googlegroups.com
I won;t respond to every comment, mainly because we are on the sme page.

Yes, I use the latest concurrent package in Java. I played around a lot with how to reduce overhead

- callable v runnable
- how to communicate back results (e.g. using Future's)
- new inner classes or reuse existing classes
- how to allocate 1000 jobs to 8 threads using queues (do it one job at a time with a fixed length thread pool, or group it into sets of 125 or whatever)

These things can make a very big difference if your threads are at a lower level of granularity. For example, I have a common pool which is reused throughout the life of the application.

I have tested threads at very fine granularity (e.g. per row in a matrix multiplication), but generally do it at a higher level.

Unravelling loops does not help much (cf C) but it is important to keep loops simple so the Java optimiser has a chance.

Java is very fast, but as far as I know is still restricted because the JVM cannot take advantage of vector processing on Intel chips. To be honest, it is difficult to find definitive answers about this - but I think I am correct. It would be so nice if the JVM was intelligent enough to be truly portable across different CPU's with different vector processing capabilities.

Bottom line is that java threads have little overhead IF AND ONLY IF you do it right. They can have a very big overhead if you do it according to some template or tutorial you find on the web. The only thing to do is to try ten different ways and benchmark. Also be very wary of profiling tools, they can seriously mislead. There is some very low level locking code which is normally very fast, but becomes a bottleneck as soon as you try to profile.

Some managers in oil and gas don't want to even hear the word probability or statistics. But it is changing.

Hardware for major oil companies moves slowly. They are sometimes a generation behind in operating systems (eg running Windows XP today).

Nigel/essence

unread,
Sep 13, 2012, 4:56:29 PM9/13/12
to stan...@googlegroups.com
On another topic you asked...

Yes, I write all my own matrix arithmetic. I only ever use (up to now) symmetric positive definite matrices.

I did a lot of benchmarking to find the fastest approach.

I likewise rejected all available Java matrix packages I could find. None of them deal with a symmetric matrix as a row (or column) ordered single array. Hopeless.

I translated L-BFGS-B from Fortran to Java, and so had to translate a little Fortran LAPACK (or it may be Linpack) from Fortran to Java, but those were the easy bits to translate.

I also do a lot of tricks with things like B*V-1*BT - I don't do any matrix multiplications, I construct U-1 * BT, and then construct the uppper triangular result matrix from that. The result is then, of course, factorised again.

Obvious tricks, but worth the effort, they can double speed.

[ B is a matrix, V-1 is the inverse of a positive definite matrix V, for which I have the cholesky factor U. ]

By default my number of threads is the number of hyperthreads. My threads tend to be small, which probably helps. My CPU is certainly showing 100% utilisation.

Memory bandwidth tends to become the bottleneck.

Unless I put my i7 Toshiba laptop under a couple of coffee mugs to aid cooling, I find that my MCMC calculations cause overheating 99% of the way through and I loose everything.

I had a similar problem a few years back with a Pentium laptop in Rio working for Petrobras, my laptop got even hotter than the friendly natives and used to crash. Must make sure I get some tickets for the next Olympics. I never got to go to any London event even though my home is London. The two tickets I got went to my sons. I maxed out in 1972 when I got to go to every single athletics event at Munich.

Reply all
Reply to author
Forward
0 new messages