> Or maybe we need to test on some other models? As Michael said, these
> models are all relatively easy to fit.
These tests were only to validate that the new code wasn't messing anything
up, not to motivate its existence. If every algorithmic change gets micromanaged
like this we're going to very quickly run end up with stagnant samplers.
> Are the MCSE plots absolute or normalized by expected parameterAbsolute. The parameters plotting are those with expected values
> size? I ask because I'm surprised there isn't more range given the wide
> range of param values. Or are all these models standardized in some way?
in the tests so any standardization is incidental.
Those are the marginal variances used for the mass matrix elements.
> What is the plot labeled "Variances"?
The fact that they're the same indicates that the adaptation parameters
are not changing.
Yes, there are no evident pathologies to suggest otherwise.
> If we don't trust the expected values, how am I supposed to
> interpret (mean - expected)/MCSE? I see no reason why the
> flex adaptation wouldn't be doing better because of the lower
> step size. Can I trust the MCSE values?
There's really no statistical difference in those distributions, they're
> Good cases for
> comparison are oxford and salm/salm2, where MCSE looks better for the flex
> adaptation but (mean - expected)/MCSE looks better for develop.
both equally consistent with N(0, 1). In fast, that's true for just about
all of these. Trying to make any judgement by eye is a terrible idea,
these plots are just about identifying gross features.
The speed different is really due to subtleties in NUTS. As the step size
> It seems that the speed issues are directly related to the
> lower step sizes and increased tree depth. This is why I'd
> like to see flexible adaptation tested with the old target acceptance
> rate. Or the develop branch tested with a higher target acceptance
> rate. Can't that be done by just changing the config for acceptance
> rates on develop and flexible adaptation branches?
drops the ESS gets better, but not enough to compensate for the super linear
cost of building the NUTS tree.
These tests were only to validate that the new code wasn't messing anything
> Or maybe we need to test on some other models? As Michael said, these
> models are all relatively easy to fit.
up, not to motivate its existence. If every algorithmic change gets micromanaged
like this we're going to very quickly run end up with stagnant samplers.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hopefully, RMHMC will improve on all the statistical properties. If we want to change the defaults, we are going to have the same discussion with the same level of scrutiny.
Damien
Damien
Ben
- I think the approach Stan has taken to choosing defaults is wrong. It shouldn't be what settings work best for most people most of the time. The defaults should be chosen to avoid disastrous behavior as much as possible. A small percentage of users can tweak the settings to make it go faster for particular special cases.
- Stan still needs more users and more developers. But I don't think there are many instances where we have convinced someone to use Stan over BUGS because the effective sample size per time is larger with Stan. Much more often, they have a model that is infeasible in BUGS that turns out to be feasible with Stan. Or conversely, they can't switch even though they want to because they heavily use some distribution / parameterization that is available for some variant of BUGS but is not in Stan yet.
--
Assuming nobody else has anything to add at this point,
this leaves open the question of what Marcus,
Ben, Michael, and Andrew think the process should be. I'm
specifically thinking about new code and doc in terms of:
* designing
* prioritizing
* coding
* testing
* continuously integrating
* documenting
* reviewing and merging
* releasing
* integrating (R, Python, etc.)
An extreme answer would be to skip overall design and
prioritization, put no requirements on testing and doc,
shut down the continuous integration server,
skip code review, give everyone push permission, forgo
official releases, and leave those working on RStan
and PyStan to fend for themselves on integration.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.
I think all these issues are very typical of projects like this. Many different pulls and pushes. Technical, personal.
My interest is purely as a commercial developer who is using y'alls good work. I do not use Stan itself, I merely have taken the original NUTS Matlab code and converted to Java and continued to integrate with other samplers and tuned it for my own needs.
I keep an eye on this group to find out your latest research ideas, not software details. I can understand if you have difficulties letting a commercial developer see inside your project. In practice, I gleen very little from this discussion forum. Presumably your research ideas are conducted much more privately.
Having previously interfaced between java and R for commercial software, I decided for ease of deployment and control to nowadays do everything in Java.
Not sure what I can add to the discussion apart from:
- for testing, use the principle of Myers 'The art of software testing'. Take cases where there is a known correct solution. Increase dimensions until it fails. Take cases which are difficult. Take cases which are expensive. A test failure is good, try to make tests which will fail. The purpose of testing is to find problems and fix them. Simple easy tests are not tests at all.
- My main tests are high dimension highly correlated Gaussian, and high dimension adapted Hario banana. A million samples of NUTS has difficulty (but not as much difficulty as RWM!).
- a year or so ago I did lots of tests with NUTS, DREAM, RWM and combinations on hard problems, it is very time consuming but was absolutely necessary to persuade my client (a major oil company) that this work was valuable. Trust me, it is. I ended up using all these samplers, they all have merits, and the proportion of time for each sampler depends on the situation. I am also using them for importance sampling, with the resample-move approach.
- although I have integrated NUTS with other samplers, my basic NUTS code is unchanged from the original matlab code. So I am interested in any algorithmic changes, and would rather not have to go through thousands of lines of code to find the changes. So maybe there is a value in maintaining the original Matlab code (which is hardly any lines at all).
- so Stan is more than just code, it is the technical documentation as well.
- I do wonder about scope creep, and probably that is because I do not understand it. There are existing implementations of LBFGS-B, and there is an implementation provided in existing R libraries. So, as an outsider, I wasn't sure why it was being redone in Stan. There is an up to date Fortran version, which is very mature, so why redo it? A year ago I converted that free version to Java, and am using it, I only converted because I want to deploy everything in Java. BTW, the conversion was not at all easy, and I had to do very careful tests to make sure the Fortran and Java gave exactly the same results at every stage. It would surely be madness to try to write a new LBFGS-B from scratch, rather than starting from the existing mature code?
- similarly, I don't know how much you are using existing low level maths utilities or writing your own. Who wants another cholesky factorisation? Particularly when you start looking at gpus, multithreading, and other tuning. Again, I did not follow this advice myself, I wrote all my own matrix algebra routines, but that was because I have experience in that from the 80's, and because the existing Java libraries were not up to scratch.
Sorry for the intrusion....and for any misunderstanding I might have about what you actually do. I just want to support any efforts to continue your excellent work, and am intrigued what the next generation (or maintenance) of NUTS will look like. The rest of it 'I could care less' in your AmeriEnglish.
One day I will publish my validated research in an SPE (Society of Petroleum Engineers) journal. I am currently under a non-compete restriction.
I implemented a mass matrix version of NUTS. One of my implementations built it using BFGS during the burn-in phase, and then fixed it.
What was interesting was that this worked very well for my highly correlated Gaussian distribution (which is not surprising, the mass matrix is constant over all space), but did worse than no mass matrix for the Harrio banana test function.
So any test suites should include cases where mass matrices are useful and where they are not useful, and guidance given in the documentation.
Does your experience reflect mine?
At the end, I implemented diagonal mass matrices which are similar to the suggestions in Nocedal and Wright for a starting point for BFGS.
But my point is that in just this area, testing is quite a difficult and time consuming process, even where test cases have a known analytical solution.
It is counter intuitive, but tests should focus on test cases where NUTS has a hard time. Push NUTS to its limits, find cases where it is hopeless.
The objectives in testing are not the same as the objectives when publishing in journals. The former tries to make it fail, the latter tries to make it work!
At least NUTS should fail less spectacularly than alternatives.
When you have these hard test cases, you can then make recommendations and rationales for why they are difficult and what settings are needed.
Having got a test suite of difficult problems, with analytic solutions, you can then do regression testing with every new algorithm or algorithm change.
For each test, repeat 100 times, and also use different settings and starting points.
It might make your workstations warm.
In the optimisation world there is a 'standard' set of difficult test problems. What we need is an equivalent in the MCMC world, together with 'known' solutions. BUGS examples appear, from comments above, not to be this test suite. Finding the 'known' solution may be difficult.
This kind of process is what my oil company client is requesting. They do want to have some confidence (sic) in their reserves uncertainty quantification. So does SEC. With NUTS, we have some chance of achieving this, but it needs to be demonstrated. Otherwise every Tom Dick and Harry continues to present 10,000 RWM samples in 40 dimensional space as if it is the truth, whereas in fact it represents a biased fraction of the total uncertainty.
I only talk about this because of the discussions you are having about algorithmic changes and the testing process.
I think overall, within reason, I would rather get the right answer than the wrong answer in half the time.
I have found very few in the commercial software industry understand this basic principle! They go on and on about unit testing etc. and yet have the completely wrong psychological approach to testing.
I will be free to publish next year, and have a very open approach, even though all my competitors have the opposite tendency (Schlumberger, Haliburton et al). I will be happy to share my work over the coming months. I am starting to write it all up. It is more an integration of lots of different recent research rather than anything new.
I enclose an article I wrote for an SPE publication aimed at young engineers. It hints at the issues, but my next publication will be much more revealing.
But graphs of number of function/gradient evaluations required to converge v. dimensions would be interesting, for different types of model. At least then would be able to predict how long it will take before convergence is reached - an hour or a week or a year.
What we can be sure is that as soon as algorithms (and hardware) improve, somebody will want to use them on ever bigger and more complex models.
I have my commercial code, and also a separate simple test environment to test MCMC methods. This is all in Java, derived from Matlab.
I need to continually demonstrate to my clients the reliability of my methods, using different simple but difficult test problems, and I would normally do this using my Java test environment.
But there is no big reason why I shouldn't do this in a Matlab environment, starting again from Matt's code.
Either way, I would be happy to share code and results.
I once did a project to optimise the optimisers - I optimised the tuning parameters for genetic algorithms, including weighting and parameters of different operators, including conventional crossover and mutation, plus particle swarm and DE. A similar project could be done for MCMC. Eg. for problem X, how much time should be spent on NUTS v. DREAM v. RWM?
I guess we could continue the discussion talking about what excites those in academia and enhances their careers and gets papers published, v. what is really important for practitioners. I don't think a post doc would get very excited about spending a year doing testing of existing algorithms, although it requires somebody of post doc capability and experience.