Longer time to run MCMC with marginalized dDHMMo distribution and further potential reductions in memory use and calculation time.

114 views
Skip to first unread message

Frédéric LeTourneux

unread,
Mar 5, 2022, 11:57:31 AM3/5/22
to nimble-users

Hello Nimble community!

I’m an enthusiastic new nimble user, and I’m sorry in advance for the dictionnary I’ve written here… Just trying to provide some context and go over the steps I’ve already taken. I’m trying to run a very memory-hungry CMR model, and I’m looking for input as to how to reduce calculation time and memory use. I’m using a joint live encounter and dead recovery model. I have ringed birds that can be physically caught in the summer and collared ones that can be observed from a distance all year long. One of the issues I have is that I have a relatively large dataset (45k individuals and 30 years of data), and I’m trying to estimate survival on a seasonal basis (119 occasions total). All observation probabilities are fairly variable through time (probabilities of recapture, collar observations and recovery of dead birds). As such the number of parameters to estimate rises quickly to reach the 300s when everything is time-dependent. In addition to all this, there is heterogeneity in live-encounter probability, which varies between seasons (the individuals that are often seen in winter are not the same ones that are often seen in fall or spring). In order to reduce the number of parameters to estimate I’m using time random effects to estimate variation in the encounters and individual level random effects to model heterogeneity in live-encounter probability.

Looking around in the forum I have come across a few tips that could help with memory use and calculation time, like using the calculate=F and useConjugacy=F options in model and mcmc building. I’ve gotten rid of the initial states matrices in the model code and directly feed the initial observations and initial states from pre-defined constants. I’ve modified the traditionnal code that loops over all occasions to generate observation matrices in order to specify some parameters that only occur during particular seasons. For example I generate observation matrices with capture probabilities only for summer occasions, to avoid having the capture probability appear in all matrix slices but turned on and off with an indicator variable (it was my understanding that these probabilities are still getting sampled by the MCMC even if they are multiplied by a 0 to ‘turn them off’). So doing this should save on the calculation time if I understand things correctly. I have also read Perry’s suggestion (here) about creating my own distribution that would ‘use the ingredients for the MCMC directly’, which would avoid unnefficiently creating gigantic arrays like I’m doing here. However, I’m a bit at a loss here and I’m not sure how or where to look to get started on that. I’ve seen a lot of people on this forum that managed to write their own distributions and I’m eager to learn but I feel like I’m quite a ways away from being able to do that.

I have also looked at marginalizing over latent states using the nimbleEcology distribution as I do not necessarely need to track the latent states. Using the marginalized formulation does speed up the model and MCMC buiding. However, I was surprised to see that when using the dDHMMo distribution, the mcmc takes takes much longer to run than simply using the ‘traditional’ likelihood formulation with dcat() and the latent states. Is this normal? Are there cases where using this formulation should take longer? With my sample dataset and running a very short chain (1000 iterations), running the mcmc with the marginalized parametrization takes 35-37 minutes vs. 10-12 minutes when using the unmarginalized formulation. Did I make a mistake somewhere?

Finally, I’m wondering if there are any other steps I can take to reduce memory usage or calculation time. I will eventually put this on a cluster and run chains in parallel, which should help with calculation time but right now I'm trying to reduce memory use as even with only a fraction of my dataset (11k individuals, 29 occasions), I'm already using up almost all 128GB of ram...  I imagine that the higher the thinning, the more we save on memory usage, so some economy can also probably be made there.

I’ve attached my script and a sample of my dataset if this can help (‘only’ 5k individuals and 7 years of data, so 29 occasions). I’ve tried to make the code as clear as possible, but I realize that there is a lot going on so don’t hesitate to let me know if I can clarify some things. There are two nimbleModel codes in the script, one with the marginalization formulation and one without so that anyone can compare time and memory use for both parametrizations.

Thanks in advance for any help with this!

Sincerely,


Frédéric

seasonal_model_sample_data.R
sample_data_nimbleUsers.RData

Olivier Gimenez

unread,
Mar 5, 2022, 12:51:33 PM3/5/22
to nimble-users, Frédéric LeTourneux
Hi Frédéric,

Welcome to the club ;-)

You seem well advanced on the optimisation front, this is impressive. 

There is another trick you might want to try. The idea is to pool individuals with the same encounter history together, potentially saving lots of iterations in the likelihood loop at lines 815-824 of your script. There are hundreds of individuals in your sample datasets with the same encounter history (with a single 1 or a single 2), that’s worth a try. The thing is that you have individual effects (if I’m not mistaken) that might make the pooling less efficient. 

The « pooled » likelihood is used in routine in frequentist programs like MARK or E-SURGE, and was suggested for NIMBLE by Daniel, Perry and Chris in https://link.springer.com/article/10.1007/s10651-016-0353-z. Chloé Nater showed me how to implement it in NIMBLE, with only slight modifications to the nimbleEcology functions to account for the number of individuals with a particular encounter history. This shouldn’t be a problem for you as you clearly have become a NIMBLE expert. 

In case it’s useful, here is an example https://oliviergimenez.github.io/banana-book/hmmcapturerecapture.html#pooled-likelihood. See also the live demo #8 (« skip your coffee break ») there https://oliviergimenez.github.io/bayesian-cr-workshop/.

Good luck ! 
Olivier

----------------------------------------------------------------------------------
Dr Olivier Gimenez | senior scientist at CNRS | @oaggimenez 
olivier...@cefe.cnrs.fr | https://oliviergimenez.github.io/

Due to my own family/work balance, you may get emails from 
me outside of normal working hours. Please do not feel any 
pressure to respond outside of your own working pattern.
----------------------------------------------------------------------------------
--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/d3cb7487-2bca-4f7f-bf09-02cd742ff7a1n%40googlegroups.com.

Frédéric LeTourneux

unread,
Mar 7, 2022, 11:33:02 AM3/7/22
to nimble-users
Salut Olivier!

Thanks for the instantaneous reply!

You are right that this is also an avenue to explore. I had considered it but was not sure whether that was a good idea considering the individual level-random effects. If we reason through this and if I'm not mistaken, basically, pooling all individuals with the same capture history assumes that they share the same fate (e.g. we kind of assume that all individuals caught at t, released and never seen again will have died at the same moment, eventhough we don't explicitly determine when that is when we marginalize over the latent states, right?). With individual-level random effects, this pooling would thus also assume that all individuals sharing a history also share the same random intercept, right? I imagine some simulations would be needed to test whether this simplification is acceptable math-wise, but from a logical point of view it seems reasonable to assume that individuals with a shared capture history could also share heterogeneity in encounter probability (after all, they are either always seen in the same occasions or not seen again...)? Does this make any sense?

Another good point that was raised by a reply I recieved off-list is that when using very long histories, especially with low capture probabilities like in my case, calculating the likelihood of these histories implies calculating a lot of very very small numbers, which could be computationally innefficient and could also offset the gain of marginalizing over latent states (I have the same problem even in E-SURGE where this model takes almost a week to run...). Does this make sense? If that is already true with 29 capture occasions, this might become prohibitive when I use the whole 119 occasions dataset. I don't know if you have any thoughts on this.

In any case I'll give this a shot (and the simulations that come with it) if calculations using my whole dataset turn out to be impossibly long and memory demanding.

Cheers!

Fred

Perry de Valpine

unread,
Mar 7, 2022, 12:50:06 PM3/7/22
to Frédéric LeTourneux, nimble-users
HI Frédéric,

This is an interesting case.  Thanks for posting.  Let me comment on several issues.

A better metric of MCMC performance is effective sample size / computation time.  So I wouldn't conclude that the marginalization is performing worse just because it is slower.  If it mixes correspondingly better, then the rate of generating effectively independent samples could be higher.  That would mean you should be able to run it for fewer iterations than the non-marginal.  It may or may not be the case, but run time alone doesn't tell you.

I would suspect that much of the memory cost arises from the time-indexing of the gamma and omega arrays, many elements of which are not really time-dependent.  This makes the size of the model graph very large, and that is likely involved in the large memory use.  Let me explain more of how you can modify dDHMMo for your own purposes.  You can find the source code for it here.  You can start by copying and pasting that into your own code, renaming it, and using the new name in your model code.  Now you have a version you can modify.  You can see there is a fair bit of checking, followed by the actual calculations, which take only 11 lines of code.  You could remove the checking code to focus on the calculations (and put back checking if you want it later).  What you want to do is reduce the size of the inputs to achieve the same calculations.  Here are a few steps you could take in that direction.

1. Instead of passing gamma[1:6,1:6,first[i]:(K-1)] into dDHMMo, pass only gamma[1:2,1:6,first[i]:(K-1)].  Remove the rest of the gamma declarations from the model.  They are only creating static 1s and 0s.

2. Replace pi <- ((Zpi %*% probTrans[,,t])/sumZpi)[1, ] with something like this:

pi <- (Zpi[1:2] %*% probTrans[1:2,,t] / sumZpi)[1,]
pi[5] <- pi[5] + (Zpi[3] + Zpi[5])/sumZpi # For the gamma[3,5,t] and gamma[5,5,t] values of 1
pi[6] <- pi[6] + (Zpi[4] + Zpi[6])/sumZpi # For the gamma[4,6,t] and gamma[6,6,t] values of 1

In effect you will be customizing dDHMMo to details of your model, allowing you in this step to reduce gamma by about 2/3, with corresponding reduction in nimble's graph management.

(I hope I got the indexing correct to match what ((Zpi %*% probTrans[,,t])/sumZpi) would produce. You should of course check!)

3. Taking this a step further, you could decide not to form gamma at all and instead pass scalars phi_ring[t], UTI[t], recov_ring[t+1], recov_occs[t+1], C, phi_col[t], and recov_call[t] (did I get them all?) into a your version of dDHMMo and inside of that function do the calculations of Zpi %*% probTrans[,,t] one by one.  E.g.

pi[1] <- (Zpi[1] *  pow(phi_ring[t], UTI[t]) + Zpi[2] * (1 - pow(C, UTI[t]))  *   pow(phi_ring[t], UTI[t]))/sumZpi
pi[2] <- (Zpi[2] * pow(C, UTI[t])     *   pow(phi_col[t], UTI[t]) )/sumZpi
# ...etc...
# where variables might have different names inside the function where you don't need the "[t]", but I hope it illustrates the idea.

4. You could do something similar with omega.  It looks like this is mostly 1s and 0s, so you could remove all of most of it and use pobs[i,t]  and pcap[i,co] in the places needed.

Also it should be helpful that you can check your calculations with uncompiled use of the functions and outside of models.  What I mean is you can set up R code with gamma and omega as you have them and call dDHMMo directly as a function outside of model code.  Then you can call a modified version of dDHMMo with the corresponding modified inputs, also uncompiled and outside of a model, to verify that you are doing the calculations correctly, matching the original version.    In addition, if it is hard to follow what the original dDHMMo is doing, you can run it uncompiled and use R's debugging tools, e.g. "debug(dDHMMo)" to step through it and make sense of what happens.

I hope that makes some sense and helps.

Perry


Frédéric LeTourneux

unread,
Mar 7, 2022, 2:18:51 PM3/7/22
to nimble-users
Hello Perry,

Thanks for the reply and input! You make a good point that my metric of performance is not really the right one. Since my model had taken a bit of time to build and run I had not run it until all chains mixed correctly but I realize now I will definitely have to do that in order to figure out which approach is more efficient in terms of generating a satisfactory effective sample size.

I had said in my original question that I was a bit confused about where to start to adapt distributions to my needs and you definitely provided the starting point I needed. It is now much clearer how I would go about doing that and I'll get on it right away. So a big thanks for taking some time to make this accessible to me.

I'll let you know how this works out!

Cheers!

Fred
Reply all
Reply to author
Forward
0 new messages