Hi there,
I have a few questions regarding different things in emcee, some specific to the program, others more concerning the statistical analysis itself, that I couldn't find an answer for, or that I would just like to make sure are correct before presenting the results obtained in my research.
1. The thin parameter
Using thin=n takes the chain and only returns every nth sample.
The way I see this the only advantages that this has is to save on disk space, when saving the chain, and to save some time if I were to do operations after the chain being complete (like rendering less points on a plot).
However it does have a pretty obvious disadvantage, which is, if I were to set n too high, I would be disregarding relevant information.
(Question 1.1) Is this analysis correct or are there any further advantages/disadvantages?
(Question 1.2) Which analysis should I be making to obtain the optimal thin value?
2. The autocorrelation time
In the section
Saving & monitoring progress of the emcee documentation, we create a plot that shows the autocorrelation time τ (averaged across dimensions) and set the criteria to stop running our MCMC, which is that the number of steps N must be N > 100τ and τ should be changing less than 1%. However, in the paper
emcee: The MCMC Hammer, in page 11, we're told that we should run the sampler for at least 10 autocorrelation times.
(Question 2.1) Is there a reason for this difference?
(Question 2.2) Shouldn't the autocorrelation time τ be actually called the autocorrelation steps?
(Question 2.3) On a high level, what are the differences between the methods used to measure the autocorrelation time in the two previously mentioned sections of the documentation?
In the paper mentioned previously, another way to measure convergence was proposed, which was seeing how the parameter values changes according to the number of steps, which was also done in the section
Fitting a model to data.
(Question 2.4) Is there any difference between testing for convergence using the autocorrelation time (i.e. using N > 100τ and Δτ < 1%) or by the changes of value in function of the number of steps?
(Question 2.5) Out of curiosity, are there any other methods traditionally used to measure convergence?
3. Saving data to .h5 files
Following the section
Saving & monitoring progress of the emcee documentation, I am able to save my chain to a HDF5 file on my disk, to prevent data loss, if my program were to crash unexpectedly.
To my surprise the performance did not decrease when writing to disk. However, not only I have no idea how much information does the chain actually hold (file size is ≈ 13 Kb for a 10000 step chain), but I have an NVMe SSD.
(Question 3.1) Considering that I might leave this running in a server with an older HDD, how should I expect the chain size to increase in function of the number of steps?
(Question 3.2) If a lot of steps are added will the write operations eventually cause the program to slow down or is the HDF5 file format already accounting for that?
4. The number of walkers
In the previously mentioned paper we are told, in page 11, that we are meant to use hundreds of walkers. However in a discussion
here in this mailing list, the author of this package, Dan, said that he uses 64 walkers or fewer, and another emcee user on that same discussion said that you can have at least 2N walkers, where N here is the number of cores, without having performance issues.
Doubling the number of walkers does seem to double the execution time, just like it was said in the paper, but the convergence does not seem to speed up, although it should be returning more independent samples.
(Question 4.1) What analysis should I be making to ensure the optimal number of walkers for the problem at hand (i.e. not spending too much time nor risking walkers getting stuck)?
(Question 4.2) Out of curiosity why does having up to 2N walkers makes no difference in performance?
I have to say that the documentation, at least the tutorials, are really good, and I was able to start using MCMC really fast and successfully replicate plots found in the literature.
This post ended up being (much) longer than expected, and hopefully this can lead to an interesting discussion (it will definitely to me!) and can be used by future emcee users.
Thank you for your time,
José Ferreira