Dealing with Missing information in covariate

131 views
Skip to first unread message

ARJUN BANIK

unread,
Aug 2, 2023, 5:08:32 AM8/2/23
to nimble-users
Hi All,

I am doing capture-recapture modeling for estimating survival, capture, and breeding transition probability with covariate as sex. But, when some individual's sex is unknown, we are using "NA", then initializing accordingly and the model runs well. The problem is the WAIC is coming higher with missing covariates. I was expecting after incorporating sex as a covariate, my model will improve, but it is not happening when there is missing sex for some individuals. Although, when ignoring those individuals with missing sex and modeling with those individuals with known sex, the WAIC value is coming lower after incorporating sex as a covariate for estimating state transition probabilities. 
Can you please help me with this? Why it is coming like this? Is it any Nimble problem or my coding problem? I am confused. 
Looking forward to your response. I attached the code for both with sex and without sex and one sample data. 

Thank you,
Arjun
--
BB_w_sex_uknown_OG.R
BB_wo_sex_OG.R
try_b_sex.csv

PierGianLuca

unread,
Aug 2, 2023, 9:22:03 AM8/2/23
to nimble...@googlegroups.com
Apologies for this impromptu and maybe off-topic comment.

I see regular requests for help regarding WAIC, and my fear is that people might end up discarding either Nimble or the results of their (often expensive) computations, just because they obtain "unsatisfying" WAIC results.

Strange WAIC results may of course signal some bug in one's code or in one's probability representation, and in this case it's useful to investigate. But, as has been pointed out before in this list, strange WAIC values don't mean that something is wrong (the opposite may even happen!).

WAIC is a very coarse and in some cases even inappropriate approximation of what should be done using utility functions or matrices in an exact application of Bayesian probability & decision theory. A low or high WAIC doesn't really say anything, if one doesn't first show that it is an appropriate approximation for that specific problem.

The point is that Nimble allows one to do much more than just WAIC.

If one is using WAIC because one has no idea of the real gain/costs in the model-choice problem, then it is honest and respectable (it was also recommended by Savage) to simply report a painfully calculated posterior distribution with Nimble, letting the readers supply their problem-dependent utility functions. On the other hand, if one does have an idea of the real gains/costs, then WAIC becomes pointless.

See also this recent report <https://doi.org/10.1038/s41598-021-04694-7> (unrelated to me), which puts things into perspective.

Just my point of view and exhortation to use the full powers of Nimble :) Apologies if it was out of line!
Cheers,
Luca

John D Clare

unread,
Aug 4, 2023, 4:47:06 PM8/4/23
to ARJUN BANIK, nimble-users
HI Arjun,

Just had a few quick (read: potentially dubious!) thoughts on this:

--Based on your description, it's plausible to me that that sex is an informative/useful predictor when known, and that it also inflates the variance of the pointwise fit/log-likelihood (the WAIC penalty) for those individuals where sex is unknown. My initial thought would be that retaining sex seems like the right thing to do (particularly if estimates of the sex-specific parameters look pretty different). 

--The developers would know better, but I did wonder if adding sex to "data" with some missing values means that these data-likelihoods (sex_i|delta_m) are included in the WAIC calculations. If so, the comparison is not quite correct.

--For models like this, I think the correct point-wise likelihood unit is really the individual observation history (over time) vs. treating point-wise datum as observations of individuals at a given time. So it may help--or be more valid--to use the WAIC dataGroup argument to specify this. I also think that the point-wise data likelihood should also include the state process [z_it|z_it-1, gamma]. By default, I *think* the data-likelihood being calculated is just [y_it|z_it, omega]. Not sure if this would work, but maybe using the marginalizeNodes argument (passing along z) would also help. Note, I think using one of the custom dHMM distributions provided by nimbleEcology instead of specifying z explicitly would deal with both the grouping issue and ensure that the that data likelihood used to calculate WAIC includes both state and observation processes. 

Hopefully this makes sense, and happy to be corrected.

John

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/4049a3e8-9081-427d-990a-a3caa2e185e2n%40googlegroups.com.

Chris Paciorek

unread,
Aug 10, 2023, 11:48:26 AM8/10/23
to ARJUN BANIK, nimble-users, John D Clare
Just to follow up on John's 2nd comment, the only case in which there would be additional terms in the likelihood calculation in WAIC in such situations would be if a user added the known covariate values to the model _with_ a distribution assigned to them and flagged those elements as 'data'. In this case and probably most others, I don't think it would make sense to do that for observed covariate values. Of course for missing covariate values, one needs to assign a distribution to them, as they are being treated as parameters in the model that are being sampled by the MCMC.

Thanks to the other responders for useful discussion here.

-chris

John D Clare

unread,
Aug 10, 2023, 12:41:10 PM8/10/23
to paci...@stat.berkeley.edu, ARJUN BANIK, nimble-users
Thanks for the clarification, Chris!
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages