A few questions regarding emcee and MCMC

931 views
Skip to first unread message

JP

unread,
Jul 20, 2021, 10:10:21 AM7/20/21
to emcee users
Hi there,

I have a few questions regarding different things in emcee, some specific to the program, others more concerning the statistical analysis itself, that I couldn't find an answer for, or that I would just like to make sure are correct before presenting the results obtained in my research.


1. The thin parameter
Using thin=n takes the chain and only returns every nth sample.
The way I see this the only advantages that this has is to save on disk space, when saving the chain, and to save some time if I were to do operations after the chain being complete (like rendering less points on a plot).
However it does have a pretty obvious disadvantage, which is, if I were to set n too high, I would be disregarding relevant information.
(Question 1.1) Is this analysis correct or are there any further advantages/disadvantages?
(Question 1.2) Which analysis should I be making to obtain the optimal thin value?


2. The autocorrelation time
In the section Saving & monitoring progress of the emcee documentation, we create a plot that shows the autocorrelation time τ (averaged across dimensions) and set the criteria to stop running our MCMC, which is that the number of steps N must be N > 100τ and τ should be changing less than 1%. However, in the paper emcee: The MCMC Hammer, in page 11, we're told that we should run the sampler for at least 10 autocorrelation times.
(Question 2.1) Is there a reason for this difference?
(Question 2.2) Shouldn't the autocorrelation time τ be actually called the autocorrelation steps?
In the section Autocorrelation analysis & convergence further ways of measuring the autocorrelation time were shown, however, honestly, I don't quite understand what is being done.
(Question 2.3) On a high level, what are the differences between the methods used to measure the autocorrelation time in the two previously mentioned sections of the documentation?
In the paper mentioned previously, another way to measure convergence was proposed, which was seeing how the parameter values changes according to the number of steps, which was also done in the section Fitting a model to data.
(Question 2.4) Is there any difference between testing for convergence using the autocorrelation time (i.e. using N > 100τ and Δτ < 1%) or by the changes of value in function of the number of steps?
(Question 2.5) Out of curiosity, are there any other methods traditionally used to measure convergence?

3. Saving data to .h5 files
Following the section Saving & monitoring progress of the emcee documentation, I am able to save my chain to a HDF5 file on my disk, to prevent data loss, if my program were to crash unexpectedly.
To my surprise the performance did not decrease when writing to disk. However, not only I have no idea how much information does the chain actually hold (file size is ≈ 13 Kb for a 10000 step chain), but I have an NVMe SSD.
(Question 3.1) Considering that I might leave this running in a server with an older HDD, how should I expect the chain size to increase in function of the number of steps?
(Question 3.2) If a lot of steps are added will the write operations eventually cause the program to slow down or is the HDF5 file format already accounting for that?

4. The number of walkers
In the previously mentioned paper we are told, in page 11, that we are meant to use hundreds of walkers. However in a discussion here in this mailing list, the author of this package, Dan, said that he uses 64 walkers or fewer, and another emcee user on that same discussion said that you can have at least 2N walkers, where N here is the number of cores, without having performance issues.
Doubling the number of walkers does seem to double the execution time, just like it was said in the paper, but the convergence does not seem to speed up, although it should be returning more independent samples.
(Question 4.1)  What analysis should I be making to ensure the optimal number of walkers for the problem at hand (i.e. not spending too much time nor risking walkers getting stuck)?
(Question 4.2) Out of curiosity why does having up to 2N walkers makes no difference in performance?


I have to say that the documentation, at least the tutorials, are really good, and I was able to start using MCMC really fast and successfully replicate plots found in the literature.
This post ended up being (much) longer than expected, and hopefully this can lead to an interesting discussion (it will definitely to me!) and can be used by future emcee users.

Thank you for your time,
José Ferreira

Dustin Lang

unread,
Jul 20, 2021, 10:17:30 AM7/20/21
to JP, emcee users
By the 2N walkers, what I meant was:

Each step, emcee steps half your walkers and computes their likelihoods in parallel.
Then it steps the other half of the walkers and computes likelihoods, in parallel.
If you have N cores, then if you have 2N walkers,
each half-step will run N likelihoods in parallel on N cores.

That's the only point I'm trying to make.




--
You received this message because you are subscribed to the Google Groups "emcee users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emcee-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/emcee-users/c09d1280-7baf-4ba7-91f6-dd255917dea7n%40googlegroups.com.

JP

unread,
Jul 29, 2021, 1:54:33 PM7/29/21
to emcee users
I would like to answer my own question (3.2), because only now I found out that I wasn't actually saving the data at all, because I forgot to add the backend to emcee.

As it turns out the size is much bigger, for a chain with 25.000 steps (but a maximum of 50.000 was possible, it converged earlier), with 2 parameters I have a file with 189 MB that hold the entire chain. This has also decreased performance by slightly less than half, even on a NVMe SSD, although it's running NTFS on this specific partition, since it was 120-140 it/s without the backend and 40-60 it/s with the backend.

So now I would like to ask if it is possible to merely store the data in memory until a given threshold (either in Gb or percentage of total memory) and after that criteria is met, dump the values to disk?

William Heymann

unread,
Jul 29, 2021, 3:15:54 PM7/29/21
to emcee users
This is actually pretty easy to do. Instead of running for a static number of steps (which may be far too many or far too few) you can run emcee as an iterator. That makes it very easy to take care of all of the writing to disk yourself. I accumulate in ram and then write out periodically.

I run for 52 Integrated Autocorrelation Times and discard the first two as burn-in. This should ensure that your chain is independent of starting point and doesn't use a static number of steps which could be wildly wrong.

I use something approximately like this (code has been cut down to simplify it). You can easily add a timecheck into this loop and write to disk periodically. 

#run walkers
finished = False
step = 0
tol = 52

def addChain(*args):
    temp = [arg for arg in args if arg is not None]
    if len(temp) > 1:
        return np.concatenate(temp, axis=1)
    else:
        return np.array(temp[0])

state = starting_state

while not finished:
    state = next(sampler.sample(state, iterations=1))

    p = state.coords
    ln_prob = state.log_prob

    chain = addChain(chain, p[:,np.newaxis,:])
    prob = addChain(prob, ln_prob[:,np.newaxis])

    if step % tau_check == 0:
        try:
            tau = sampler.get_autocorr_time(tol=tol)
           
            print(f"step {step}  progress {step/(tau*tol)}")

            if not np.any(np.isnan(tau)):
                finished = True

        except autocorr.AutocorrError as err:
            tau = err.tau
           
            print(f"step {step}  progress {step/(tau*tol)}")

        taus = addChain(taus, tau[:,np.newaxis])
    step = step + 1

print("done processing")
   
burn = int(np.ceil(np.max(tau)) * 2)

burn_chain = chain[:,:burn,:]
burn_prob = prob[:,:burn]

chain = chain[:,burn:,:]
prob = prob[:,burn:]

print("finished")

--
You received this message because you are subscribed to the Google Groups "emcee users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to emcee-users...@googlegroups.com.

JP

unread,
Jul 29, 2021, 5:21:30 PM7/29/21
to emcee users
Interesting, wasn't aware that you could do that.

This surely beats my idea of creating a RAM disk, which would be outside of the scope of my program to automatically manage since it would require root privileges.

Could you explain why you're catching an exception when getting the autocorrelation?

William Heymann

unread,
Jul 30, 2021, 2:47:08 AM7/30/21
to emcee users
The autocorrelation function has two failure modes. 

If you check it too early it might think the system has converged but have all nan or mostly nan. The other one is if you check and you have not yet passed the autocorrelation time specified in the function call it raises an error which makes it really easy to work with. You can just run until it no longer raises an error or returns nans. 

Reply all
Reply to author
Forward
0 new messages