Matrix occupancy model - missing visits/values

282 views
Skip to first unread message

Ryan Burner

unread,
Apr 4, 2022, 4:25:43 PM4/4/22
to nimble-users
Hi All,

Thanks for the help! I've got an occupancy model with p and Psi covariates that I set up using matrices rather than vectors for convenience. 

Most sites have 3 visits per year in each of three years, so I have a response variable array of:
y[1:n_sites , 1:n_species , 1:2 (years), 1:3 (surveys per year)]

However, some sites (e.g. 5%) were only surveyed twice in a given year, and a few (3%) were only surveyed in one or two (rather than all three) years. I think all of the NAs in those years are giving me problems though I can't pinpoint exactly where/why. For example, I'm not getting estimates of my covariate coefficients (they are stuck at initial values).

So my main question is how I should be dealing with those missing values. One solution I tried (in for the formula below) is to index year and survey numbers for each site. For years I do this with a years-per-site vector (yrs), and for surveys per year I do this with a matrix of k[1:n_sites,1:n years] that shows how many visits there were. 

This should prevent the model from ever encountering the missing data, but then the parameter matrices that it is estimating are not 'full', and I assume that is causing the model to fail (as it does when using the code below).

Note: this model also involves some latent variables, which you can ignore

#Psi model
for (yr in 1:(yrs[j])) {
  eta[j,i,yr] <-  inprod(u.b[i,1:Vocc], Xocc[j, ,yr]) + inprod(lv.coef[i,1:nlv],LV[j,1:nlv])
  u[j,i,yr] ~ dnorm(eta[j,i,yr],1/(1-sum(lv.coef[i,1:nlv]^2)))
  z[j,i,yr] <- step(u[j,i,yr])
}

#p model
for (yr in 1:(yrs[j])) {
  for (kv in 1:(k[j,yr])) {
    logit(p[j,i,yr,kv]) <-   inprod(v.b[i,1:Vobs], Xobs[j, ,yr,kv])
    mu.p[j,i,yr,kv] <- p[j,i,yr,kv]*z[j,i,yr]
    y[j,i,yr,kv] ~ dbern(mu.p[j,i,yr,kv])
  }
}

Is there a better way to deal with these missing surveys and years at some sites? Or am I stuck using a vector setup of this model?

The answer to the above question may also answer my other question, but I occasionally have missing p and Psi covariates as well. I don't care about estimates for these values except to the extent that Nimble needs them to get estimates for the parameters of interest for those sites. If I need to specify priors for those values can I ask the model to estimate mean and sd of each covariate column to create a prior? And do I need to give a formula in the model that is 'solved' for those missing values (which would be annoying for models with many covariates and a few missing values in each)? Or can Nimble use the information it already has (equations containing those covariates) to infer their values?

Thank you! 

Perry de Valpine

unread,
Apr 6, 2022, 12:02:30 AM4/6/22
to Ryan Burner, nimble-users
Dear Ryan,

For the first problem, can you report more about how the model fails?  Is there an error message in R or does it fail in the C++ compilation step?  The idea of using a vector of start and/or end indices to manage a ragged array is a typical and good way to handle what you need, but it's hard to tell without more information what is going wrong.  A minimal reproducible example would be the most helpful.  One possible thing that caught my eye is the empty index in Xobs.  Sometimes nimble can handle that, depending on what you've provided, and sometimes you would need to insert something like "1:N" explicitly.

For the second issue, I think this is more of a statistical question.  Sometimes with missing covariates, one makes a part of the model for the distribution of the covariates, in which the covariates are the "response", i.e. they are on the left-hand side of a stochastic declaration.  Then a missing covariate, entered into nimble as data with value NA, will be sampled by the MCMC, effectively integrating over its distribution as informed by other values of the covariate.  I hope that helps.

Perry


--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/5eaeefb9-3b6b-4d0e-9997-6214b39f961cn%40googlegroups.com.

Ryan Burner

unread,
Apr 6, 2022, 6:13:40 PM4/6/22
to Perry de Valpine, nimble-users, Burner, Ryan
Dear Perry, (and all)

Thank you so much for the response and helpful suggestions! Just knowing the proper terminology ('ragged array') let me do a bit more Googling. An immediate question that raised was whether I should be padding the arrays with zeros or NA (I had been using NAs).

Easiest point first - I have only a very small number of missing covariates, so I will just replace them with the mean.

I should have been more specific on what kind of 'failure' I was experiencing - in most cases the model will compile and run now, but I get initialization warnings (which may not be a problem) such as:
> Missing values (NAs) or non-finite values were found in model variables: eta, z, p, mu.p. This is not an error....

Then during mcmc I would also get hundreds of messages something like this (although I can't replicate these in my example):
'Warning, the log probability of z[a,b,c] is -INF'

I've now attached example code that simulates data that look like mine. It makes complete data (no missing years or surveys), and you can run the model with complete data by running all the code as a first test. Then, to remove surveys and years that don't exist:
-comment out lines 249-50
-uncomment lines 257-279

That should allow folks to try the model with and without missing surveys. It seems to work fine in both cases (i.e. I get parameter estimates) but I don't know if any of the warnings are important, whether to pad with zeros or NAs, and whether it is doing what I intend.

One final question (that is lower priority) - in my model code, at the end, I have an example of one of the derived parameters that I am interested in calculating. But I haven't been able to make that work with the ragged arrays (can't seem to get the index value to work, or something). 

Thank you so much!

Best,

Ryan
Burner_Nimble_example.R

Ryan Burner

unread,
Apr 7, 2022, 8:53:37 AM4/7/22
to Perry de Valpine, nimble-users, Burner, Ryan
New, slightly modified version of code with a few errors fixed....
Burner_Nimble_example.R

Ryan Burner

unread,
Apr 13, 2022, 8:08:15 AM4/13/22
to nimble-users
Hi All,

Would anyone be willing to run the test code and help me figure out if it is working? My main questions are:

a) Should I be filling in covariates from non-existent surveys (in the 'ragged array') with zeros or NAs?
b) Are any of the warnings/errors compromising the integrity of my results?

I've attached example code that simulates data that look like mine. It makes complete data (no missing years or surveys), and you can run the model with complete data by running all the code as a first test. Then, to remove surveys and years that don't exist:
-comment out lines 249-50
-uncomment lines 257-279

Thank you!

Ryan

Burner_Nimble_example.R

Chris Paciorek

unread,
Apr 18, 2022, 8:50:38 PM4/18/22
to Ryan Burner, nimble-users
Hi Ryan,

I don't have time at the moment to dig into the details of your code/what you are doing, but a few brief comments:

1) It is very likely a problem if you are getting messages about -Inf as logProb values during the MCMC. Often this would indicate that there are invalid parameter values. 

2) While missing initial values (i.e., NAs) may not be a problem in some cases, in that Nimble can draw from the prior to fill them in, it's usually a good idea to initialize everything to reasonable values. So that may be a good place to start.

3) As far as padding your arrays, presumably your model doesn't actually use the padded elements, so either 0 or NA may be fine. But it may be safest to put in zeros. 

-chris

Perry de Valpine

unread,
Apr 19, 2022, 10:09:07 AM4/19/22
to Chris Paciorek, Ryan Burner, nimble-users
Thanks Chris for jumping in here.  Ryan, my apologies that your thread went hanging unanswered.  We try hard to respond to everyone but recently we've been very busy and this slipped past me.

I will just add a bit to what Chris said.  Usually the warning messages come from some kind of incomplete or invalid inits.  Nimble generally benefits from complete inits.  Sometimes an MCMC can recover and start exploring sets of valid values and then operate correctly.  Often it can't.  The way to check is to see how the mixing looks, but that includes mixing of latent states that you might not be monitoring by default.  Usually if it is stuck in invalid values, then some nodes will not mix at all.  A good strategy to be sure your results are valid and also to gain confidence that your model is in order is to track down why there are incomplete or invalid inits. You can use the model object in R to do so.  If you want to browse through previous postings on nimble-users, you might find other examples of doing this.  Often it looks something like the following.

# Assume model was returned from nimbleModel.
model$calculate() #Somewhere you get warnings or -Infs, or you get these when you call nimbleMode with default calculate = TRUE.
# If you get a message saying there is an issue with a node z[1, 2, 3], check it:
model$z[1, 2, 3]
model$calculate("z[1, 2, 3]")
# Then check its parent nodes.  Maybe it depends on a node called prob[1, 2, 3], so check and re-calculate that
model$prob[1, 2, 3]
# etc. and the idea is by studying each node and sequence of calculations from one node to another, you 
# can determine where some init is missing or has a value that is resulting in a NA or -Inf in a later calculation.

# If you do not get a clear message on where there is a problem node, you can track it down by calculating narrower and narrower parts of the model.
model$calculate("z") # will do all z nodes and reveal if there is a problem somewhere
model$calculate("z[1, , ]") will do this subset of z nodes, etc.
# You can also look at the log probabilities of stochastic nodes to help see where there was a problem
model$logProb_z

I hope that helps.  Thanks for providing your code.  Would it work for you to try this strategy and then get back to us if you are still stuck and one of us can dive into your code at that point?

-Perry


Ryan Burner

unread,
Apr 20, 2022, 11:08:31 AM4/20/22
to Perry de Valpine, Chris Paciorek, nimble-users
Hi Chris and Perry,

Yes, thank you both for the messages! 

I don't get the '-Inf' issues with my reproducible example, so it is probably some issue with my inits using the real data. The example does show that some nodes are not initialized, but when I look at the chains they just have an NA for the first iteration and then have reasonable values so I don't think those are fatal errors either. But great to know about the model$calculate() option...

The main issues seems to be something with my indexing - even when I run the full simulated data in the example (no missing values) with the simulated k matrix of index values (which shows that fewer than the full number of sites and year were surveyed) I get a large number of nodes (nearly all of them) stuck on z = 1 for the entire chain length (and a smaller number stuck on z = 0). Any thoughts on this issue? 

Thanks again,

Ryan


Reply all
Reply to author
Forward
0 new messages