Calculating WAIC when not all parameters are monitored

275 views
Skip to first unread message

Frédéric LeTourneux

unread,
Aug 24, 2023, 4:58:58 PM8/24/23
to nimble-users

Hello Nimble community,

I'm running a capture-recapture model where I'm estimating survival of 1st year geese, and I'm using the difference between the date of peak plant quality and their hatching date (i.e, trophic mismatch) as an individual-level covariate on survival.

Because the hatching date of most juveniles (~20 000 out of 22 000) is unknown, I'm estimating the age of these birds of unknown age based on the relationship between the length of the 9th primary feather and age, which predicts the age of most birds with a precision of ~1-2 days, which is quite good.

I wanted to make things very clean and incorporate the uncertainty in the age estimation of the goslings when estimating the effect of the mismatch covariate on survival, so I'm running both these analyses (individual age prediction + CR-survival) in a single joint analysis (i.e., in the same nimble model).

My issue lies in calculating the WAIC, or some measure I can use to compare models. Because my dataset is quite large (22 000 juveniles + 30 000 adults) and the model is complex, computation time and memory use are already almost prohibitive, making it difficult to use cross-validation techniques, which is why I opted for WAIC. Currently, using the marginalized likelihood function of nimble ecology, and having optimized my model as much as possible for my level of programming skills, it still takes about a week to run a bit under 30 000 iterations on the fastest cluster I have access to. Consequently, I'm running each chain separately, and I'm saving the MCMC samples every 1000 iterations, restarting the MCMC without resetting model and mcmc state (option reset=F), but resetting model values (resetMV=T) to keep memory use reasonable (I’m still using around 60GB of memory as things are now). I'm calculating the WAIC 'on-the-fly', as the model is running (enableWAIC=T, and mcmc$getWAIC()), but reading into how WAIC and specifically pWAIC2 is calculated, I'm realizing (I think..) that the fact that I'm resetting the number of samples stored by the model every 1000 iterations means I'm estimating the number of parameters based on the sample variance of the last 1000 iterations rather than the entire sample (obviously, since I removed previous samples from the model), which is likely problematic, and would explain I'm getting quite different WAIC and pWAIC values between each run and between chains. I thought that I would then be able to calculate the WAIC post-hoc after running everything and getting all my samples back together. Unfortunately I'm now realizing that I need to monitor all stochastic parent nodes of the data nodes in order to do that. That is also a problem since I'm not monitoring the estimated ages of the 20 000 juveniles, as this would also require way too much memory. 

So I'm wondering what my options are at this point. Since I'm estimating the age of birds in the same way in all models I aim to compare, I think I could simply not account for this in the calculation of the pWAIC, but I’m not sure of the consequences of doing that, and if there is even a way to do this using the post-hoc WAIC calculation method. Has anyone run into a similar problem before and/or would have ideas on how I could proceed next?


Thanks for the help!

Cheers,

Fred L.


heather...@gmail.com

unread,
Aug 24, 2023, 9:29:25 PM8/24/23
to Frédéric LeTourneux, nimble-users
One approach that I have used in large memory situations is to calculate the WAIC myself within the code and just monitor the one “likelihood” variable. You can then process it outside of Nimble to estimate WAIC. This has its own pros and cons - namely, it’s still a large output (but often much smaller than monitoring all those nodes), still takes a long time to run, and you have to make sure you’ve done your coding correctly to get the likelihoods of the data, but it does the job fairly well. 

Happy to show you an example or how to implement it in your own code if you’re interested in going down that route. 

Heather Gaya 

Sent from my iPhone; please excuse any typos

On Aug 24, 2023, at 4:58 PM, Frédéric LeTourneux <frederic....@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/c6b9e243-abb6-466c-b71d-7608fc8f832bn%40googlegroups.com.

Frédéric LeTourneux

unread,
Aug 25, 2023, 10:08:35 AM8/25/23
to nimble-users
Heather,

What a neat idea. Can't believe I hadn't thought of that.. ^^ I would be really glad to see an example of how you've implemented this.

Thanks for the help,

Fred

heather...@gmail.com

unread,
Aug 28, 2023, 9:54:01 AM8/28/23
to nimble-users
Traditionally WAIC has only worried about the fit of the observed data to the model (aka p(y|p, N)) but the full-likelihood approach is a better indicator of overall model fit (and similar to what NIMBLE uses internally).

The easiest way to estimate this is to find the log-likelihood of all data nodes in the model via R notation or a NIMBLE function and store this as a variable "loglike" in your model. Then you monitor this parameter and perform the following (in R) on the posterior chains once converged:

calc.waic <- function(x){
  vars <- grep("loglike", colnames(x$mcmc[[1]])) #find the output that relates to loglike
  like <- as.matrix(x$mcmc[,vars,])
  fbar <- colMeans(exp(like)) #mean log-likelihood
  Pw <- sum(apply(like,2,var)) #mean variance in log-likelihood
  WAIC<- -2*sum(log(fbar))+2*Pw
  return(WAIC)
}

I've attached an example of what I mean, though it's a hierarchical distance sampling example so potentially less applicable. For something like a multi-state CMR model, where you have a z state (true state) and a y state (observed state), it might look something like what I've written below, where you monitor the "like" parameter:

 z[i,1] ~ dcat(p.plot[1:11]) #year 1 z
 logzt[i,1] <- logdensity.cat(z[i,1], p.plot[1:11]) #for likelihood estimation
 y[i,1] ~ dcat(po[z[i,1],i,1,]) #observed year 1
 logyt[i,1] <- logdensity.cat(y[i,1], po[z[i,1],i,1,]) #for likelihood estimation
  
for (t in 2:n.occasions){
    # State process: draw S(t) given S(t-1)
    z[i,t] ~ dcat(pp[z[i,t-1],i, t-1,])
    logzt[i,t] <- logdensity.cat(z[i,t], pp[z[i,t-1],i, t-1,]) #for likelihood estimation
    # Observation process: draw O(t) given S(t)
    y[i,t] ~ dcat(po[z[i,t],i, t,])
    logyt[i,t] <- logdensity.cat(y[i,t], po[z[i,t],i,1,]) #for likelihood estimation
   
    like[i,t] <- logzt[i,t] + logyt[i,t] #likelihood estimation
  } #t

One though I have (that I've never tried) is that you could probably save a lot of memory by putting logzt[i,t] and logy[i,t] into a single NIMBLE function and have it calculate the log-densities and spit out the log-like for that particular individual/time period. By creating more nodes inside the model, it will slow down the compile time and often can steal memory, but I've noticed this behavior is reduced if you do the calculations inside NIMBLE functions.

Hope that points you in the right direction!


Example_WAIC.rtf

heather...@gmail.com

unread,
Aug 28, 2023, 9:57:11 AM8/28/23
to nimble-users
Adding to my previous comment - I inadvertently switched between JAGS and NIMBLE in the multistate CMR code I wrote out! For NIMBLE it would look like:
z[i,1] ~ dcat(p.plot[1:11]) #year 1 z
 logzt[i,1] <- dcat(z[i,1], p.plot[1:11], log = TRUE) #for likelihood estimation

 y[i,1] ~ dcat(po[z[i,1],i,1,]) #observed year 1
 logyt[i,1] <- dcat(y[i,1], po[z[i,1],i,1,], log = TRUE) #for likelihood estimation
  
for (t in 2:n.occasions){
    # State process: draw S(t) given S(t-1)
    z[i,t] ~ dcat(pp[z[i,t-1],i, t-1,])
    logzt[i,t] <- dcat(z[i,t], pp[z[i,t-1],i, t-1,], log = TRUE) #for likelihood estimation

    # Observation process: draw O(t) given S(t)
    y[i,t] ~ dcat(po[z[i,t],i, t,])
    logyt[i,t] <- dcat(y[i,t], po[z[i,t],i,1,], log = TRUE) #for likelihood estimation

   
    like[i,t] <- logzt[i,t] + logyt[i,t] #likelihood estimation
  } #t

Much easier than the JAGS version of log-density distributions :) 

Chris Paciorek

unread,
Aug 28, 2023, 9:26:19 PM8/28/23
to Frédéric LeTourneux, nimble-users
Building on Heather's helpful suggestions, I think you can monitor the logProb values for the variables containing data nodes and then compute WAIC yourself based on all the chains post hoc. We provide functionality to monitor the logProbs so you shouldn't have to put that calculation into the model code. This makes things easier and the model should run faster than if you add such nodes to the model.

1) See here for how to monitor logProb values for a given variable.
2) To save memory, you may want to use `monitors2` with a longer `thin2` interval for the logProb values. You can estimate in advance how much memory will be used based on the number of elements in the data nodes and the number of samples you'll be saving. E.g., 50000 logProb values * 1000 samples is 400 MB.

Let me know if that is not clear.

Unfortunately, I don't see a way to avoid doing some manual coding on your part - the WAIC system we devised isn't able to nicely handle this situation.

If that doesn't work, let me know -- there may be a way to use the summary statistics that we accumulate iteration-by-iteration for our "online" calculation of WAIC (which uses Welford's algorithm) from a single chain and combine those across the multiple chains to then allow calculation of the full WAIC, but it would involve some guidance from me on how to grab those statistics and make use of them. 

-chris

Frédéric LeTourneux

unread,
Aug 31, 2023, 12:01:27 PM8/31/23
to nimble-users
Hello Heather and Chris,

Thanks for these insightful answers. With your suggestions I think I can manage to compute WAIC for my models. No worries that there needs to be some manual coding on my part, where woud be the fun if it was all done automatically lol. 

Some confirmation I'm getting all this correctly and I'm doing things right would be great though :D.

I also still have one interrogation but that may be more in the 'understanding the statistics' department rather than understanding Nimble, so I'm not sure if its ok to discuss this here, feel free to let me know if this is beyond the scope of this forum, and skip to the next chunk with code.

From what I understand from Heather's suggestion, what she is proposing monitors the likelihood of the entire model by looking at how each individual observation fits the estimated survival/observation probabilities, so combining the likelihood of every observation (y) and underlying state (z) would effectively monitor the fit of the overall model, without having to include any information pertaining to covariates for survival or observation probability? I imagine the individual-level survival/observation probability associated to each observation in the po/pp matrices from her post already includes the covariate effect so that does compute how the observed data fits the estimated effects of covariates on observation/survival probabilities. (Clearly I need to put things into words to understand correctly ^^). 

Building on this, if I'm using the marginalized distributions from nimble Ecology, with some code like :
   
 y[i, first[i]:n.occasions] ~ dDHMMo(init = delta[age[i],1:6,sex[i]],
                                                                probTrans = gamma[1:6,1:6,ms.class[i],first[i]:(n.occasions-1),sex[i]],
                                                                probObs = omega[1:6,1:3,first[i],first[i]:n.occasions,sex[i]],
                                                                len = length(first[i]:n.occasions),
                                                                checkRowSums = 0)

then the way to get the equivalent of Heather's logyt[i,t] would basically be to use the same formula but plugging y[i, f[i]:n.occasions] in the distribution function like so:
 
  logy[i, first[i]:n.occasion]  <-  dDHMMo(x = y[i, first[i]:n.occasions] ,
                                                                        init = delta[age[i],1:6,sex[i]],
                                                                        probTrans = gamma[1:6,1:6,ms.class[i],first[i]:(n.occasions-1),sex[i]],
                                                                        probObs = omega[1:6,1:3,first[i],first[i]:n.occasions,sex[i]],
                                                                        len = length(first[i]:n.occasions),
                                                                        checkRowSums = 0,
                                                                        log = 1)

Since I'm marginalizing over the latent states (z[i,t]), then I guess my full likelihood for the survival model would be just that? Or am I missing something?

Building on this to Chris's suggestion, instead of adding this calculation to my model which is already done by Nimble, i could simply add a line like

mcmcConf$addMonitors("logProb_y")

when setting up the configuration and simply save the log probability of each individual capture history given the estimated parameters, which could then be used to calculate all components of the WAIC post-hoc as provided in Heather's code? Basically having

lppd       <-    sum(log(colMeans(exp(logProb_y[,1:n]))))
pwaic2  <-   sum(apply(logProb_y[,1:n],2,var)) 
WAIC     <-   -2*(lppd-pwaic2) # Following calculations of Gelman et al. 2014?

How would it work if I wanted to calculate 1 WAIC vaue from all chains? Simply rbind() results from all chains and performing the calculations as if it all came from the same output?

One other interrogation I have is whether its problematic not to monitor any parameters related to the age-estimation model part of the analysis. Since all the models I'm fitting estimate the age in the exact same way, can it be a problem to leave that out of likelihood comparisons or would you say that it can be a 'legit' way to compare between models (since any differences between models only pertain to the survival part of the models)?

Thanks so much for your help!

Cheers,

Fred

Chris Paciorek

unread,
Sep 2, 2023, 1:12:24 PM9/2/23
to Frédéric LeTourneux, nimble-users
I haven't tried to absorb the full structure of your model, so I won't comment much on some of your detailed discussion of the model structure.

If you monitor `logProb_y` then you are monitoring the log density of the 'y' values conditional on all the parameters (including latent states) that are included in the model. `logProb_y` should give you the same thing as manually putting in the "<- dHMMo" declaration into the model.

To get a single WAIC, yes, you would `rbind` the quantities from all the chains together and then "manually" implement the WAIC calculation following, say, the Gelman reference. As discussed in our WAIC documentation, what you get for WAIC will depend on what you define as an "observation" (i.e. what is "y_i" in the equations in the Gelman ref), which in cases with correlated data can be tricky to decide
and in such cases there is probably no single "right" choice. 

As far as monitoring parameters related to age estimation, if you're manually calculating WAIC based on monitoring `logProb_y` then what parameters you monitor is no longer relevant because you have the logProb values directly. The question for you becomes which logProb values to include in the WAIC calculation. If you leave out the logProb values for certain parts of the data in calculating WAIC, you are saying that your model comparison is done based only on how well the model fits the data that are included in the WAIC calculation. It seems plausible to me that you could do a reasonable comparison based only on survival data, but as I said initially, I haven't absorbed the actual structure of your model.

-chris

Reply all
Reply to author
Forward
0 new messages