I am doing some tests, using these functions as the negative log likelihood.
Preliminary tests, with N = 16, and without any mass weight, shows that DE methods are vastly superior to NUTS, and indeed random walk is better than NUTS, but this is expected to change as soon as I implement the exact Hessian weights in NUTS.
My evaluation is very simplistic - whether the sampler gets anywhere near the minimum in a few million function/gradient evals.
As a cross check against my Java code, I will also try the original Matlab NUTS code on this example.
I'm not sure whether there are analytical integations for these functions, against which MCMC sampled cdf's can be compared.
I guess my question is - how many test functions has NUTS been evaluated against? In the optimisation, world, there is a standard set of test problems, but I haven't seen a similar set for MCMC methods - might be a useful exercise to generate a standard set so the community can properly evaluate different approaches.
Not very exciting work, but extremely valuable, and IMHO a necessary condition for a mature technology.
I will shortly test NUTS with the exact Rosenbrock Hessian and report back, and then some time soon try some BFGS methods for comparison.
I also need to examine whether my actual problem has an analytical Hessian - it probably does, but has some lengthy algebra to evaluate, and I am sure it is not positive definite everywhere - but maybe I can see a reasonable analytic approximation which is.
Has anybody tried NUTS with standard optimisation test functions such as Rosenbrock?I am doing some tests, using these functions as the negative log likelihood.
The reference you have for DREAM is the one I have been using, it has evolved a bit from the original. This reference tries to avoid the need for 2xD chains, but in my tests 2xD chains are certainly beneficial if possible.
The algorithm needed a bit of head scratching on my part to work out how it gets the effective dimensions - I decided that it chooses the dimensions first and calculates the effective dimension, and then does the sampling. It may be helpful to ask the authors for the exact code.
It does indeed work well on Rosenbrock. I implemented the quoted algorithm, but, as ever, with some of my own tweaks.
I have tried Rosenbrock on the original NUTS and NUTS fails badly. I used the code (which I hope is free from any bug - I am not familiar with Matlab, and I am differentiating late on Sunday after a glass of wine):
function [logp grad] = rosenbrock(theta)
D = length(theta);
logp = 0.0;
grad = zeros(1,D);
for iMod = 1:D-1
tmp1 = (1.0 - theta(iMod));
tmp2 = theta(iMod + 1) - theta(iMod) * theta(iMod);
logp = logp - tmp1 * tmp1 - 100.0 * tmp2 * tmp2;
grad(iMod) = grad(iMod) + 2.0 * tmp1 + 100.0 * 2.0 * tmp2 * 2.0 * theta(iMod);
end
tmp2 = theta(D) - theta(D-1) * theta(D-1);
grad(D) = grad(D) - 100.0 * 2.0 * tmp2;
end
function [logp grad] = rosenbrock(theta)
global maxValue;
D = length(theta);
logp = 0.0;
grad = zeros(1,D);
for iMod = 1:D-1
tmp1 = (1.0 - theta(iMod));
tmp2 = theta(iMod + 1) - theta(iMod) * theta(iMod);
logp = logp - tmp1 * tmp1 - 100.0 * tmp2 * tmp2;
grad(iMod) = grad(iMod) + 2.0 * tmp1 + 400.0 * tmp2 * theta(iMod);
grad(iMod + 1) = grad(iMod + 1) - 200.0 * tmp2;
end
end
This drastically improves NUTS. It is now similar performance to DREAM.
Still a worthwhile test for comparison.
Ignore all my previous comments on NUTS v DREAM.
One last comment - could the differential concepts in DREAM be used to improve the sampling of momenta in NUTS, so that we tend to 'kick' in a good direction? Is there any possible synergy between DE and HMC ? The differential concept replaces the need for approximate Hessian?
I've implemented diagonal weights for Rosenbrock, and it improves it somewhat. For positive definiteness I simply check if diagonals fall below a small number or not. Tomorrow, as a test case, I will implement a full Hessian with appropriate better adjustments for positive definiteness. This can then be used as a comparison with various different approximate Hessian approaches.
I was using NUTS with a single chain starting at the origin.
The Rosenbrock minimum is at (1,1,1,...). The values at my starting point is D (dimension).
I was using DREAM with multiple chains based on a Latin Hypercube (LHC) design. Initial function was much higher.
So NUTS was starting at a much better point.
If I started NUTS with multiple chains like DREAM with a LHC design, it does much worse than DREAM again. Never gets anywhere.
Just goes to show how difficult it is to compare apples and oranges.
Certainly Rosenbrock is a good test function, and it can behave very badly depending on the starting point.
In short, I'm tending to favour DREAM again, until I try NUTS with a good Hessian.
'Locating and characterizing the stationary points of the extended Rosenbrock function'
which explains the problem quite well. It is easy to get stuck at saddle points, and the authors found that simple random walk MCMC got stuck easily.
They got better results with particle swarm, and I got better results with DE.
Whatever, you need to jolt the system away from saddle points, and it seems most saddle points only have one negative eigenvalue for the hessian, so you have to kick (that is a technical term, OK?) in the right direction.
I can't see why an approximate hessian which had a small positive eigenvalue to make it positive definite could not, in a HMCMC scheme, give a suitable kick in approximately the right direction.
In contrast, a diagonal weight matrix would stay stuck.
Also, a hessian approximation which was not reflective of the saddle point would not help either.
DE helps because it can project over the saddle point. Similarly particle swarm (which behaves under certain conditions like a spring) will oscillate over the saddle point. Like pushing a swing and the child goes flying over the fence.
Have you tried Rosenbrock yourself in NUTS/Matlab? I sent the Rosenbrock function and gradient above.
BTW, I have done some initial DE/NUTS integration, and it looks promising. They complement each other.
Not sure how much sharing I can do on this at present - but if anybody is interested in discussing this I am interested in talking. Maybe not in a public forum at present.
I came across this
www.maths.uq.edu.au/~kroese/ps/generalized_splitting.pdf
which includes an example using Rosenbrock, and also introduced me to splitting methods for rare events, and many other articles, which looks interesting if I have time to digest it. It seems to be a fairly well known technique but I had not come across it.
Secondly, I can see a way to integrate NUTS and DREAM and BFGS into a single sampler, and will report back when or if I get some decent results. If anybody wants to know more, contact me privately at goo...@essence.co.uk
Some arm waving:
- with multiple chains, different chains can use different samplers.
- there needs to be some way of cross fertilising between different chains, e.g. as in parallel tempering, or using 'other' chains to generate Hessians.
- different samplers have different overheads/function evaluations per sample. You can have some chains heavy duty, some chains light duty.
- you don't necessarily have to integrate different techniques into a single sampler, different techniques can be spread across different chains.
- adapting more than one parameter in a chain (epsilon plus something else) starts to get tricky and can easily become unstable. I tried once to adapt temperatures and epsilon in a single tempered scheme, and it was a disaster.
Concerning Hessians:
- there are (at least) two distinct problems.
- One is how to move between different modes. In this case, too much forcing of positive definiteness can hinder such moves in a Hamiltonian scheme. When at a saddle point, we want to move across the saddle point to another mode, so we want a Hessian with a suitable very small positive eigenvalue in that direction. But it can remain very difficult to get up to the saddle point in the first place, so some scheme which allows the Hessian to be ill conditioned and do long steps is perversely useful.
- I'm not sure how it works to have negative weights .... I once had a student friend at Cambridge who did a PhD in what happens to the laws of physics if you go faster than the speed of light, and he never recovered his sanity.
- Another is how to sample efficiently within a mode. In this case a good Hessian can be useful, but we don't want it contaminated by information from chains which are in some different mode.
- It is almost as if we want both a local positive definite Hessian and a long range ill conditioned Hessian.
- this is similar to proxy modelling which has a linear regression component which models long range effects, and a more local radial basis function component to ensure interpolation.
- As we know, defining Hessian approximations which preserve balance etc. are tricky.
- maybe stochastic techniques such as CMAES could be useful - I have tried and rejected this well known method for optimisation simply because I didn't like the way it handled simple bound constraints.
In short, BFGS or other Hessian schemes which have only been tested on relatively simple single mode problems are probably going to demonstrate some shortcomings i more complex problems, which their proponents already acknowledge. The background of these methods is in local optimisation, not global optimisation.
The splitting papers referred to above have a lot of examples of application to multi mode problems.
http://www.mathcs.emory.edu/~benzi/Web_papers/bgl05.pdf
Of course, when modelling PDE's, indefinite Hessians are not unusual, so presumably a Hamiltonian HMCMC formulation should be able to handle indefinite Hessians?
Multi level/multi grid methods are also discussed.
who is talking about a random-walk metropolis algorithm? :-)
>
>
>
> > Concerning Hessians:
>
> > ...
>
> > - Another is how to sample efficiently within a mode. In this case a good Hessian can be useful, but we don't want it contaminated by information from chains which are in some different mode.
>
>
>
> This isn't just modes, but parts of the space with very
>
> different geometries.
agreed :-)
ps. but with suitable swapping, all the chains can mix nicely?
>
Let's remember the problem which simple Hamiltonian methods (or MH or Gibbs) are unlikely to handle adequately, from my limited understanding. Multi mode problems. Parellel tempering may be an approach, but is quite expensive IMHO, and self adaptation of temperatures, as well as epsilons, is still seeking a robust stable solution?
An example of the problem is given in section 4.4 of http://espace.uq.edu.au/eserv/UQ:178513/splitting_method.pdf.
The splitting method seems to give remarkable results.
Caveat - I am well known for jumping on the last paper I read as the silver bullet for everything. Always the optimist.
It is very easy to define a problem which no algorithm will solve in practice.
ps. I am not using any of these techniques to do an optimisation, I have much better optimisation methods to do that. My interest is Bayesian uncertainty analysis using difficult complex likelihood functions. I only introduced optimisation because it is my very firm belief that cross fertilisation from different disciplines will (almost) always lead to useful new ideas. It is one of the benefits of commerce (or real actual problems) that they jolt academia away from their silos.
As Boris Johnson, Mayor of London, said today, these Olympics have inspired a generation, and "have produced such paroxysms of tears and joy on the sofas of Britain that you probably not only inspired a generation but helped to create one as well."
It reminds me, going back to the multi mode issue. I had a client a few years ago, who had an oil reservoir model, and they seemed to have two different modes, one with high permeability/low porosity, and the other with low permeability/high porosity. Both modes matched historical measurements equally well. One mode gave a high future production forecast, the other had production falling rapidly. But they had to make a decision whether to divest the field or not (worth many billions of USD), and had little way of knowing whether the sampling was properly weighting the likelihoods in the two modes.
In practice, they decided to perform further measurements to distinguish the modes. Turned out the low production forecast model was the correct one.
Of course, they should have presented the 'high production forecast' model as the truth and flogged the oil field off to somebody who didn't know how to do uncertainty analysis.
Does that help explain why multi mode and complex distributions are of great interest, even though they may pose many difficulties? HMCMC makes many intractable problems possible,which is why I use it as the basis.
To put it into perspective, proper Bayesian forecasting is still in its infancy in the oil and gas reservoir engineering industry.
The same issues arise, for example, in weather forecasting models.
Got to go catch a flight to Houston from London.
I want an algorithm which can reliably sample a 400 dimensional complex likelihood function, multi modal with very high peaks and troughs, in, say, 100,000 function evals.
On a standard laptop or workstation, quad core CPU.
Fully adaptive, no user intervention or tuning.
Yes, I'm a dreamer!
At present 100,000 function evals takes several hours on my example test problem.
If I can speed up by a factor of ten from my current DREAM implementation, I should be OK, or at least ahead of the pack. NUTS is, I hope, a major contributor to achieve this.
I may wait until next year to implement a reliable method to handle large numbers of distinct modes....
The reservoir simulation and decision process is part of a much longer process, building geological models, determining critical parameters, etc etc etc.
So there may be a 3 month project, considering much more than just technical factors, safety etc is important, and also partners have to be kept on board.
Of which your simulation and uncertainty/optimisation project is squeezed into the last week.
You may still have a hard deadline, and only a few days before have an agreed reliable reservoir model, so the week before you try to do the final steps, find bugs in the software and/or model, and then hope and pray that when you restart it late on Friday (a) it will work (b) the mother in law isn't expecting you to entertain her (c) the IT department don't do a reboot of all the computers.
Then you present the P50, P10, P90 to the manager on Monday morning, and the manager asks what the realistic possible downside is, and you give some answer in a very confident voice, which is probably just the worst simulation you have done SO FAR, not at all the worst possible downside.
Ten years ago multi billion dollar decisions were made on the basis of a SINGLE reservoir model, which did not even match historical production and pressures.
I'm already running all my hyperthreads at 100%, but use threads calculating the function rather than using threads for the different chains. I used to use threads for chains, but sometimes you want different samplers in different chains, and the samplers have different performance characteristics, so then splitting chains between threads does not fully utilise them. Eg a DREAM thread and NUTS thread - not the best way to do it.
Fear not, I am fully multi threaded from the ground up, at an appropriate level of granularity.
Of course, how to multi thread is very much problem specific. Most people would want to have a thread/chain architecture.
But even with a thread/chain architecture, you could have 21 simple chains using 7 threads and 1 complex chain (e.g. NUTS) using 1 thread, depending on how many leadfrogs NUTS takes per step.
I have also, btw, experimented with GPU, and have done a lot of profiling and benchmarking and tuning.
For example - the cdf of a Gaussian is very expensive, so I use a good enough approximation which is very fast. Similarly, I use Matern correlation functions, but not the exact function, a good approximation is good enough.
There is a general lesson - balance out your errors! If your sampling is going to give you something +/- 2%, why are you looking for exact values of special functions?
I even do an approximation of exp.
Of course, you have to be consistent and careful your approximations are smooth and differentiable etc., and you use them consistently.
But the biggest improvements, once you have done all the tuning, is to find better algorithms.
I am not in control of the computers, I am developing an end user application. Some of the clients have access to thousands of powerful computers in a cluster, some only have a stand alone laptop.
If it was just for me, I would but a 4 socket AMD 16 core machine, you can get them for under $10K. 64 threads. I don't have time and energy to build a simple cluster myself.
I would guess that a set of completely independent runs is less efficient than a set of chains which can mix and cross fertilise.
For example, with DREAM it is definitely beneficial to have a large number of chains in a run. 2xN is recommended.
So you can't just set off independent jobs to independent computers and aggregate.
I am only writing this now because I am taking a break as one of my clients has set off 200 simulation jobs to their cluster, and the results of the engineering jobs will be complete tomorrow morning. My own software controls and manages all these simulation runs, in a multi threaded environment using a producer/consumer paradigm (Ben at least will know what that means). The jobs can take different times, so as soon as one is finished I set off another one, and each time I do a MCMC simulation.
Which reminds me - potentially one of the biggest algorithmic improvements I envisage is when I learn about how to take full advantage of previous MCMC samples when generating a new set of samples - each time, the underlying likelihood function has changed, but gradually converges. I use this fact to have an intelligent way of initiating the chains when I start the MCMC again, but could do a lot more. One obvious thing is I can avid the burn-in after the first time, or at least make it much shorter. Can I sample from the previous set of samples? I'm sure I can do a lot in this area, but not this week.
To be clear I have a workflow:
(a) construct likelihood function
(b) perform MCMC (ideally << 1 hour)
(c) perform full engineering simulation study (0.1 - 24 hours)
(d) update likelihood function
(e) perform MCMC (ideally << 10 mins)
(f) go to (c) until finished
ps one of my clients did, indeed, use their heavyweight computer cluster to do the MCMC part, but that was with a basic RW algorithm back in 2005 or so. In the dark ages.
Seriously, tell me if my input is of interest or a distraction, and I will stop writing essays. There can't be many people who are developing commercial end user applications using advanced MCMC methods, are there?
Yes, I use the latest concurrent package in Java. I played around a lot with how to reduce overhead
- callable v runnable
- how to communicate back results (e.g. using Future's)
- new inner classes or reuse existing classes
- how to allocate 1000 jobs to 8 threads using queues (do it one job at a time with a fixed length thread pool, or group it into sets of 125 or whatever)
These things can make a very big difference if your threads are at a lower level of granularity. For example, I have a common pool which is reused throughout the life of the application.
I have tested threads at very fine granularity (e.g. per row in a matrix multiplication), but generally do it at a higher level.
Unravelling loops does not help much (cf C) but it is important to keep loops simple so the Java optimiser has a chance.
Java is very fast, but as far as I know is still restricted because the JVM cannot take advantage of vector processing on Intel chips. To be honest, it is difficult to find definitive answers about this - but I think I am correct. It would be so nice if the JVM was intelligent enough to be truly portable across different CPU's with different vector processing capabilities.
Bottom line is that java threads have little overhead IF AND ONLY IF you do it right. They can have a very big overhead if you do it according to some template or tutorial you find on the web. The only thing to do is to try ten different ways and benchmark. Also be very wary of profiling tools, they can seriously mislead. There is some very low level locking code which is normally very fast, but becomes a bottleneck as soon as you try to profile.
Some managers in oil and gas don't want to even hear the word probability or statistics. But it is changing.
Hardware for major oil companies moves slowly. They are sometimes a generation behind in operating systems (eg running Windows XP today).
Yes, I write all my own matrix arithmetic. I only ever use (up to now) symmetric positive definite matrices.
I did a lot of benchmarking to find the fastest approach.
I likewise rejected all available Java matrix packages I could find. None of them deal with a symmetric matrix as a row (or column) ordered single array. Hopeless.
I translated L-BFGS-B from Fortran to Java, and so had to translate a little Fortran LAPACK (or it may be Linpack) from Fortran to Java, but those were the easy bits to translate.
I also do a lot of tricks with things like B*V-1*BT - I don't do any matrix multiplications, I construct U-1 * BT, and then construct the uppper triangular result matrix from that. The result is then, of course, factorised again.
Obvious tricks, but worth the effort, they can double speed.
[ B is a matrix, V-1 is the inverse of a positive definite matrix V, for which I have the cholesky factor U. ]
By default my number of threads is the number of hyperthreads. My threads tend to be small, which probably helps. My CPU is certainly showing 100% utilisation.
Memory bandwidth tends to become the bottleneck.
Unless I put my i7 Toshiba laptop under a couple of coffee mugs to aid cooling, I find that my MCMC calculations cause overheating 99% of the way through and I loose everything.
I had a similar problem a few years back with a Pentium laptop in Rio working for Petrobras, my laptop got even hotter than the friendly natives and used to crash. Must make sure I get some tickets for the next Olympics. I never got to go to any London event even though my home is London. The two tickets I got went to my sons. I maxed out in 1972 when I got to go to every single athletics event at Munich.