I have a really basic question, which I just can't seem to figure it out. I have a two category variable with a lot of missing data. It's my understanding that the easiest way to deal with this -- since Stan doesn't sample discrete parameters -- is to marginalize over the missing data.
increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));
Thanks. The chapters are useful. log_mix makes sense to me when the discrete parameter is unobserved, such as a change point model or a mixture model. It doesn't make sense to me in this context. My data are structured in a database type structure, as described in the manual (p. 136). If the data were complete, I would specify:
What confuses me is how I am supposed to loop through y_i when observations are missing. Shouldn't the code below:(2) increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));operate on a variable coded 1 for observed, 0 for missing?
int<lower=0,upper=1> M[N];
Thanks, Ben. This is starting to now make more sense. To be sure, the code should look something like this://gamma is the probability of missing data, m is the vector of missing data, coded 1=missing, 0=present, and y is the vector of observed data.
I can't see any reason why you would want to impose the constraint that p = q.
Ben
Okay, so for each observed unit I have the response equation:y[n]~bernoulli_logit(alpha[kk[n]]*theta[jj[n]]- beta[kk[n]]);
p <- alpha[kk[n]] * theta[jj[n]] - beta[kk[n]];
if (m[n,i] == 0) {
y[n,i] ~ bernoulli_logit(p[i]);
m[n,i] ~ bernoulli_logit(something[i]); // I previously called something[i] q
}
if (m[n,i] == 1) {
increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));
m[n,i] ~ bernoulli_logit(something[i]);
}
if (m[n,i] == 0) {
y[n,i] ~ bernoulli_logit(p[i]);
m[n,i] ~ bernoulli_logit(something[i]); // I previously called something[i] q
}
if (m[n,i] == 1) {
increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));
m[n,i] ~ bernoulli_logit(something[i]);
}
I have an IRT model. As is often the case, I have a fair amount of missing data. In the past, I've used JAGS with these models, and as you know, NAs are permitted. So, I'm trying to figure out something comparable here that doesn't require dropping the data.The sample code I've included, is just a variant of the IRT model in the manual. Rather than dropping the data, I'm trying to account for the missing data. The problem is that y is a discrete variable. One of the suggestions seems to be to use a mixture approach to do this. Ben had suggested that for y_miss, I use:increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));when m==1 (m is coded 0 when y is observed 1 when y is missing). p is the conditional probability of y=1.I'd rather do something other than drop the data.If this is the wrong approach, then I'm honestly not sure what the correct approach should be.
log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p))
log_mix(p, bernoulli_logit_log(1, eta), bernoulli_logit_log(0, eta))
increment_log_prob(log_mix(inv_logit(eta[i]), bernoulli_logit_log(1, eta[i]), bernoulli_logit_log(0, eta[i])));
increment_log_prob(bernoulli_logit(y[i], eta[i]));
At this point, I'm fine with assuming MAR. This is a large cross sectional survey, in which questions were randomized, so that seems reasonable to me.
With nothing else, the above would assume MAR conditional on the student's latent ability, but even for a dataset like PISA where some students are asked some questions but not others by the design of the research, you still have some missingness on questions a student was actually asked, which provides additional information on the student's (lack of) ability. The most obvious reasons why a student would not answer an asked question are that it seemed difficult and / or it was toward the end of the test and the student never got to it. Thus, it is pretty easy to construct another likelihood contribution for m, which could also be a logit model and does not require any mixing because m is observed for all students and all questions.
In the case where a student is not asked a question by design, that is not a random variable and does not contribute anything to the likelihood.
So, you need a likelihood function for that four-headed monster. The usual way of proceeding (the "selection" approach) is to factor this as
Pr(y, m | stuff) = Pr(y | stuff) * Pr(m | y, stuff)
Conversely, whether the student would have gotten the answer right or wrong tells you a lot about the probability would answer a question that was, in fact, asked. In other words, Pr(m | y, stuff) != Pr(m | stuff) and you need a model for Pr(m | y, stuff). That could be anything but at least seems to depend on the difficulty of the question and the position of the question in the test.
All this is a bit more tedious in Stan than BUGS because HMC needs derivatives and thus cannot do discrete unknowns. So, when m = 1 (and thus y is missing), we have to write
Pr(y, m = 1 | stuff) = Pr(y = 1 | stuff) * Pr(y = 1 | stuff) * Pr(m = 1 | y = 1, stuff) +
Pr(y = 0 | stuff) * Pr(y = 0 | stuff) * Pr(m = 1 | y = 0, stuff)
log(Pr(y, m = 1 | stuff)) = log(exp(log(Pr(y = 1 | stuff)) + log(Pr(y = 1 | stuff)) + log(Pr(m = 1 | y = 1, stuff))) +
exp(log(Pr(y = 0 | stuff)) + log(Pr(y = 0 | stuff)) + log(Pr(m = 1 | y = 0, stuff))))
> is that if you believe, prior to seeing the data, that a binary random variable Y is distributed Bernoulli with some probability p (that may vary from observation to observation as a function of the IRT stuff), how could you possibly believe that the likelihood contribution for a missing outcome is anything other than
>
> log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p))
This expression reduces to 0. In fact, so does the more
general
log_mix(q, bernoulli_log(1, p), bernoulli_log(0, p))
p * [p^1 * (1 - p)^(1 - 1)] + (1 - p) * [p^0 * (1 - p)^(1 - 0)] = p^2 + (1 - p)^2 = 1 - 2p + 2p^2
Why is the missingness related to the probability that the outcome
is one?
Also, in the IRT setting, do we get any information by modeling
the missingness? I suppose you would, because you could see a student
answering just a couple questions they knew the answers to and
leaving the rest blank, from which you don't want to infer that their
ability is +infinity. We should be working on some of these cases
for our IES grant!
= log(2 * p^2 - 2p + 1)
So it doesn't reduce to a no-op. So this is missing
missingness information or something? I still don't understand
why p is used twice here.
p <- 2 / 3 # or whatever depending on the student and the IRT stuff
y <- rbinom(n = 10^6, size = 1, prob = p)
log(mean(dbinom(y, size = 1, prob = p))) # about 0.588
log(1 - 2 * p + 2 * p^2) # basically also 0.588
log(1 - 2 * p + 2 * p^2)
or equivalently log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p)) to get a likelihood contribution for Pr(y | stuff) when y is missing. Then we need another likelihood contribution for the same observation(s) that is Pr(m = 1 | y, stuff), which would be another Bernoulli mixture that now marginalizes over the missing predictor (y) unless we made the assumption that Pr(m = 1 | y, stuff) = Pr(m = 1 | stuff).In terms of the four-headed monster (great term....I completely agree), if we factor p(y,m) in the usual selection specification....so,Pr(y, m | stuff) = Pr(y | stuff) * Pr(m | y, stuff)Are you sure the likelihood contribution, when m==1 is:Pr(y, m = 1 | stuff) = Pr(y = 1 | stuff) * Pr(y = 1 | stuff) * Pr(m = 1 | y = 1, stuff) +
Pr(y = 0 | stuff) * Pr(y = 0 | stuff) * Pr(m = 1 | y = 0, stuff)it's not clear whey there are two pr(y=1|stuff) and pr(y=0|stuff) terms....
Pr(y, m = 1 | stuff) = Pr(y = 1 | stuff) * y1_branch + (1 - Pr(y = 1 | stuff)) * y0_branch
= Pr(y = 1 | stuff) *
[Pr(y = 1 | stuff) * Pr(m = 1 | y = 1, stuff)] +
Pr(y = 0 | stuff) * [Pr(y = 0 | stuff) * Pr(m = 1 | y = 0, stuff)]