Missing Categorical Data

504 views
Skip to first unread message

Chris Weber

unread,
Nov 13, 2015, 11:44:54 AM11/13/15
to Stan users mailing list
I have a really basic question, which I just can't seem to figure it out. I have a two category variable with a lot of missing data. It's my understanding that the easiest way to deal with this -- since Stan doesn't sample discrete parameters -- is to marginalize over the missing data. Suppose I am estimating an IRT model, where y is the observed data vector (aa indexes the item and bb indexes the respondent),  then I believe something like this should work in the model block (and produce the equivalent of bernoulli_logit). 

  for (n in 1:N){
      logp[n]<-  bernoulli_logit_log(y[n], alpha[aa[n]]*theta[bb[n]]- beta[aa[n]]);
  }
 increment_log_prob(log_sum_exp(logp));
}

This of course assumes fully observed data. The problem is how do I structure the data and specify logp[n] if y contains both missing and observed data?  The chapter on latent discrete parameterizations is really useful in explaining the issue for a variety of models, but I can't seem to figure out how to apply it to this embarrassingly simple case.  I entirely understand the logic, though for whatever reason, I don't quite grasp how it would be set up in Stan.  Thanks in advance.




Ben Goodrich

unread,
Nov 13, 2015, 11:56:58 AM11/13/15
to Stan users mailing list
On Friday, November 13, 2015 at 11:44:54 AM UTC-5, Chris Weber wrote:
I have a really basic question, which I just can't seem to figure it out. I have a two category variable with a lot of missing data. It's my understanding that the easiest way to deal with this -- since Stan doesn't sample discrete parameters -- is to marginalize over the missing data.

Assuming a binary outcome is Missing At Random, then it is relatively easy to deal with the missingness using the log_mix() function

increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));

where p is the probability of success, which may be a function of some other parameters. What this is doing is getting the log probability if y were 1 and weighting that by p and then adding the log probability if y were 0, weighted 1 - p.

However, in an IRT context, the Missing at Random assumption is usually implausible (even if you are ostensibly conditioning on latent ability) because students skip questions that they initially have trouble answering or fail to make it to the end of the test in time because they struggled too much with the questions at the beginning of the test. In that case, you should also somehow be modeling a binary variable that indicates whether the student answered the question.

Ben

Chris Weber

unread,
Nov 13, 2015, 12:35:34 PM11/13/15
to Stan users mailing list
Thanks for the quick reply. Just a few more questions (sorry).

In 

increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));

Are the last two components in reference to the missing data?  If so, where do I define the model:

y[n]~bernoulli_logit(alpha[aa[n]]*theta[bb[n]]- beta[aa[n]]))?

Also, I'm a bit confused about the data? If y is coded [1,0,NA], should I be declaring NA to be a third category?

Thanks again.

Bob Carpenter

unread,
Nov 13, 2015, 12:58:53 PM11/13/15
to stan-...@googlegroups.com
There is no NA in Stan. There are two relevant chapters in
the user's guide part of the manual:

* missing data

* latent discrete parameters

We don't discuss the log_mix function there because the
feature got in without an example in the doc, but it just
simplifies the computations we do discuss there.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Chris Weber

unread,
Nov 13, 2015, 6:15:19 PM11/13/15
to Stan users mailing list
Thanks. The chapters are useful. log_mix makes sense to me when the discrete parameter is unobserved, such as a change point model or a mixture model. It doesn't make sense to me in this context. My data are structured in a database type structure, as described in the manual (p. 136). If the data were complete, I would specify:

(1)for(n in 1:N){
 y[n]~bernoulli_logit(alpha[kk[n]]*theta[jj[n]]- beta[kk[n]])
}
What confuses me is how I am supposed to loop through y_i when observations are missing. Shouldn't the code below: 

(2) increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));

operate on a variable coded 1 for observed, 0 for missing? If this is the case, then how is the actual response model (1) applied? Is it something like y[n[observed]] and then y[n[missing]] in the two bernoulli_logit() calls? Of course, y[n[missing]] will not work, because NAs are not permitted; y_miss could be declared as an unobserved variable in the parameters block. Thanks again for all the help with this -- I know the answer should be obvious...it's just not clear to me how the data should be structured

Ben Goodrich

unread,
Nov 13, 2015, 7:51:22 PM11/13/15
to Stan users mailing list
On Friday, November 13, 2015 at 6:15:19 PM UTC-5, Chris Weber wrote:
What confuses me is how I am supposed to loop through y_i when observations are missing. Shouldn't the code below: 

(2) increment_log_prob(log_mix(p, bernoulli_logit_log(1, p), bernoulli_logit_log(0, p)));

operate on a variable coded 1 for observed, 0 for missing?

Yes, you can pass in an integer array like

int<lower=0,upper=1> M[N];

and do the above for observations where M is 1 and do a regular bernoulli_logit when M is 0.

Ben


Chris Weber

unread,
Nov 15, 2015, 3:48:07 PM11/15/15
to Stan users mailing list

Thanks, Ben. This is starting to now make more sense. To be sure, the code should look something like this:

//gamma is the probability of missing data, m is the vector of missing data, coded 1=missing, 0=present, and y is the vector of observed data.
for(n in 1:N){
    increment_log_prob(log_mix(gamma, bernoulli_logit_log(m, gamma), bernoulli_logit_log(y[n], alpha[kk[n]]*theta[jj[n]]- beta[kk[n]])
                                         ));
}
}

I'm still not entirely sure this is right, but if I'm marginalizing over the missing data, I believe bernoulli_logit_log(m, gamma) is the correct statement.  

Or, am I to specify the response model for y_miss in the second bernoulli logit statement? The problem with doing this is that it must be declared as a parameter and I can't use bernoulli_logit_log, since that function requires a discrete value as the first entry..

It's extremely slow. I simulated a dataset with 5 items and 500 respondents with 10% randomly generated missing data, and it takes about two hours to run 1000 iterations (compared to about a minute if the missing data are dropped). I've attached the full model code below, in case that's helpful.

Thanks again for all the help.

#K, kk indexes items, J,jj indexes respondents. N is the length of the observed data. N_miss the missing data. N_full the full data, including missing codes. m indexes missingness 1=missing, 0 not missing

dat<-list(J=J, 
          K=K, 
          N_full=N,
          N=N_obs,
          N_miss=N_miss,
          jj=jj,
          kk=kk,
          y=y_obs,
          m=abs(as.numeric(missing))
                    )



two_pl<-"
data {
  int<lower=1> J;                // number of respondents
  int<lower=1> K;                // number of roll calls
  int<lower=1> N;                
  int<lower=1> N_full;                
  int<lower=1> N_miss;               
  int<lower=1,upper=J> jj[N];    
  int<lower=1,upper=K> kk[N];    // Roll Call
  int<lower=0,upper=1> y[N];     // Roll call for respondent n
  int<lower=0,upper=1> m[N_full];     // Missing data code for respondent n

}
parameters {   
  real theta[J];                // latent score for respondent
  real<lower=0> alpha[K];                 // discrimination of roll call k
  real beta[K];                  //  difficulty of roll call k
  real<lower=0, upper=1> gamma; //probability of data missing

}
model {
  gamma~uniform(0,1);
  theta ~ normal(0,1); 
  beta ~ normal(0,2); 
  alpha ~ normal(0,2); 
for(n in 1:N){
    increment_log_prob(log_mix(gamma, bernoulli_logit_log(m, gamma), bernoulli_logit_log(y[n], alpha[kk[n]]*theta[jj[n]]- beta[kk[n]])
                                         ));
}
}
"



fit <- stan(model_code = two_pl,  pars=c("alpha", "beta"),
            data = dat, iter=1000,
            chains=1) 




Ben Goodrich

unread,
Nov 15, 2015, 3:53:07 PM11/15/15
to Stan users mailing list
On Sunday, November 15, 2015 at 3:48:07 PM UTC-5, Chris Weber wrote:

Thanks, Ben. This is starting to now make more sense. To be sure, the code should look something like this:

//gamma is the probability of missing data, m is the vector of missing data, coded 1=missing, 0=present, and y is the vector of observed data.

No, the first argument to log_mix() should be the conditional probability that the missing value is 1. You need a second part to the likelihood to model the probability that a value goes missing.

Ben
 

Chris Weber

unread,
Nov 15, 2015, 5:54:40 PM11/15/15
to Stan users mailing list

Sorry -- it should read:

 increment_log_prob(log_mix(gamma, bernoulli_log(m, gamma), bernoulli_logit_log(y[n], alpha[kk[n]]*theta[jj[n]]- beta[kk[n]])

Since bernoulli_log collects the log probability that a value is missing (m is the vector of missing data), gamma should represent theta -- the probability that a value is missing. Why can't this value then be used in the first argument, which is also probability that a value is missing? I understand this is overly simplistic, and I could also model the probability that a value is missing, but shouldn't it work?
 

Ben Goodrich

unread,
Nov 15, 2015, 8:57:07 PM11/15/15
to Stan users mailing list
It is even simpler.
  • For each unit that is observed, you have bernoulli_log(y[i,j], p) where p = Pr(y = 1 | x)
  • For each unit that is missing, you need a mixture of the two possibilities (y = 1 and y = 0) where the mixing weight is again p = Pr(y = 1 | x). So, that would be log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p)).
  • For each observation,  you need another likelihood contribution for q = Pr(m = 1 | z), which is just bernoulli_log(m[i,j], q).

I can't see any reason why you would want to impose the constraint that p = q.


Ben

Chris Weber

unread,
Nov 16, 2015, 8:38:45 PM11/16/15
to Stan users mailing list
Ben,

Okay, so for each observed unit I have the response equation:

y[n]~bernoulli_logit(alpha[kk[n]]*theta[jj[n]]- beta[kk[n]]);

Are you suggesting that I need to declare p=alpha[kk[n]]*theta[jj[n]]- beta[kk[n]] as p in the mixing equation? This won't work, since p is a vector of length (N_obs), and m ls of length N_obs+N_miss?

Or, are you suggesting two parts need to be included, something like:

for(n in 1:N_obs{
y[n]~bernoulli_logit(alpha[kk[n]]*theta[jj[n]]- beta[kk[n]]); //IRT model that y=1
increment_log_prob(log_mix(p, bernoulli_log(m, p), bernoulli_log(m, q));
}

The problem is that y[i,j] and m[i,j] are of different dimensions. I'm also a bit puzzled by:
  • For each observation,  you need another likelihood contribution for q = Pr(m = 1 | z), which is just bernoulli_log(m[i,j], q). 
The way that I coded m, it's m=1 if missing, m=0 if observed. Isn't the first bernoulli_log in

< increment_log_prob(log_mix(gamma, bernoulli_log(m, gamma), bernoulli_logit_log(y[n], alpha[kk[n]]*theta[jj[n]]- beta[kk[n]]) >

the likelihood contribution of m=1| z? And isn't the second is the likelihood contribution when m=0, which is simply the IRT function? Shouldn't the mixing parameter then be p=(pr(missing), q=pr(observed)=1-pr(missing)?  Again, I really appreciate all the help with this.

Thanks,
Chris






Ben Goodrich

unread,
Nov 17, 2015, 9:44:55 PM11/17/15
to Stan users mailing list
On Monday, November 16, 2015 at 8:38:45 PM UTC-5, Chris Weber wrote:
Okay, so for each observed unit I have the response equation:

y[n]~bernoulli_logit(alpha[kk[n]]*theta[jj[n]]- beta[kk[n]]);

Let's call that vector of length N_obs

p <- alpha[kk[n]] * theta[jj[n]] - beta[kk[n]];

You are going to have to loop over N_obs. If y[n,i] is observed, then you have two nonmixing likelihood contributions

if (m[n,i] == 0) {
  y
[n,i] ~ bernoulli_logit(p[i]);
  m
[n,i] ~ bernoulli_logit(something[i]); // I previously called something[i] q
}

where something is a vector whose logit transformation is the conditional probability of being missing that you will have to come up with an expression for depending on what makes sense in your research context. For example, the probability of being missing may be a monotonically increasing function of the order in which the questions were asked (presuming some students do not finish the test).

If y[n,i] is missing, then you have two likelihood contributions, one of which is a mixture of success and failure weighted by the probability of success and the other is not a mixture
if (m[n,i] == 1) {
  increment_log_prob
(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));
  m
[n,i] ~ bernoulli_logit(something[i]);
}

Ben
 

Chris Weber

unread,
Nov 20, 2015, 11:46:31 AM11/20/15
to Stan users mailing list

Ben, 

I see. This now makes a lot of sense. I didn't realize that the code is nearly identical to what it would be with a continuous missing variable. For some reason, I thought the mixing proportion was based on the proportion of missing data.

Because p is of length N_obs, I assume it needs to be defined separately when missing==1. I moved things around a bit because of several indexing exceptions. Also, in the mixing function, I defined inv_logit(p), since I assume this should be a mixing proportion. I also moved the missing data component,   m~bernoulli_logit(q), outside the loops. I haven't yet defined a conditional probability of missingness, and for the time being, just specified it as drawn from a distribution bound at 0 and 1.

Does this approach what you've described? The model is quite slow. With a small simulated dataset (500 observations, 10 items, and 25% missing at random), it takes about 60 minutes to complete 1000 iterations.  Any hints on what I might do to make this more efficient? The real dataset I'll be analyzing is much larger. Perhaps this is expected given the mixture component, and 25% missing is a bit extreme.

I know I've said it several times, but I really do appreciate all the help.


model {
  real p1[N_obs];
  real p2[N_miss];
  theta ~ normal(0,1); 
  beta ~ normal(0,2); 
  alpha ~ normal(0,2); 
      
for(n in 1:N_obs){
      p1[n]<- alpha[kk_obs[n]]*theta[jj_obs[n]]- beta[kk_obs[n]]; //length N_obs, Response model
      y[n]~bernoulli_logit(p1[n]); 
}
for(n in 1:N_miss){
      p2[n]<- alpha[kk_miss[n]]*theta[jj_miss[n]]- beta[kk_miss[n]]; //length N_miss, Response model
      increment_log_prob(log_mix(inv_logit(p2[n]), bernoulli_logit_log(1, p2[n]), bernoulli_logit_log(0, p2[n])));
  }
  m~bernoulli_logit(q);
 
}
"


Bob Carpenter

unread,
Nov 20, 2015, 1:58:33 PM11/20/15
to stan-...@googlegroups.com

> On Nov 20, 2015, at 11:46 AM, Chris Weber <christoph...@gmail.com> wrote:
>
> ...


> increment_log_prob(log_mix(inv_logit(p2[n]), bernoulli_logit_log(1, p2[n]), bernoulli_logit_log(0, p2[n])));

This can't be right. First, I doubt you meant to have p2 in both the mixing and
the outcomes. But even if the mixing proportion was different, say lambda, it
just reduces to 1, so doesn't do anything:

lambda * p(x) + (1 - lambda) (1 - p(x)) = 1

- Bob

Chris Weber

unread,
Nov 20, 2015, 2:25:10 PM11/20/15
to Stan users mailing list
Originally, no, but Ben notes:

"If y[n,i] is missing, then you have two likelihood contributions, one of which is a mixture of success and failure weighted by the probability of success and the other is not a mixture"

And suggests the following:

<- alpha[kk[n]] * theta[jj[n]] - beta[kk[n]];

if (m[n,i] == 0) {
  y
[n,i] ~ bernoulli_logit(p[i]);
  m
[n,i] ~ bernoulli_logit(something[i]); // I previously called something[i] q
}


if (m[n,i] == 1) {
  increment_log_prob
(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));
  m
[n,i] ~ bernoulli_logit(something[i]);
}

with "something" corresponding to the conditional probability of missingness. Isn't the case of miss==1 the weighted probability y=1? 

I guess I'm just really confused about what log_mix is doing and how it's supposed to be specified. If p is the IRT/response model that y=1, is it somehow something different when m==1?

Thanks

Bob Carpenter

unread,
Nov 20, 2015, 2:28:29 PM11/20/15
to stan-...@googlegroups.com
The definition of log_mix is in the doc. It's just
a function. Whatever's going on with the model, you
can just eliminate this:

increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));

Because it reduces to a no-op.

- Bob

Chris Weber

unread,
Nov 20, 2015, 6:10:47 PM11/20/15
to Stan users mailing list
Okay. Thank you. 

I  realize it's just a function to estimate mixtures -- I've been though the manual a number of times. I guess I just don't see how it applies in the context of missing discrete data. The suggestion that
  • For each unit that is missing, you need a mixture of the two possibilities (y = 1 and y = 0) where the mixing weight is again p = Pr(y = 1 | x). So, that would be log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p)).
appears to be what Ben suggested and what is specified in his sample code. Is there a different mixing proportion I should be using? Should it be used someplace other than when m==1? Thanks. 

Ben Goodrich

unread,
Nov 20, 2015, 6:32:30 PM11/20/15
to Stan users mailing list

No

Bob Carpenter

unread,
Nov 21, 2015, 12:13:29 AM11/21/15
to stan-...@googlegroups.com
I still can't figure out what you're trying to do,
so I should leave this to Ben.

If you haven't, you should at least work through the
algebra to see why no matter what a in (0, 1) and
p in (0, 1) are, the expression

log_mix(a, log(p), log(1-p))

works out to 0 and hence doesn't do anything.

The chapter on latent discrete parameters shows how
to marginalize missing discrete parameters, as does
the mixture modeling sections of the manual.

- Bob

Chris Weber

unread,
Nov 21, 2015, 9:38:01 AM11/21/15
to Stan users mailing list
I have an IRT model. As is often the case, I have a fair amount of missing data. In the past, I've used JAGS with these models, and as you know, NAs are permitted. So, I'm trying to figure out something comparable here that doesn't require dropping the data.

The sample code I've included, is just a variant of the IRT model in the manual. Rather than dropping the data, I'm trying to account for the missing data. The problem is that y is a discrete variable. One of the suggestions seems to be to use a mixture approach to do this. Ben had suggested that for y_miss, I use:

increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i]))); 

when m==1 (m is coded 0 when y is observed 1 when y is missing). p is the conditional probability of y=1.

I'd rather do something other than drop the data.If this is the wrong approach, then I'm honestly not sure what the correct approach should be.

At this point, I'm fine with assuming MAR. This is a large cross sectional survey, in which questions were randomized, so that seems reasonable to me. I'm aware of the chapters on latent discrete parameterizations, missing data, and mixtures. I was merely searching for some pointers on how such a model is set up with missing data. It seems to be a common issue, and it just wasn't clear to me how to implement this in stan. Thanks




Ben Goodrich

unread,
Nov 21, 2015, 12:33:02 PM11/21/15
to Stan users mailing list
On Saturday, November 21, 2015 at 9:38:01 AM UTC-5, Chris Weber wrote:
I have an IRT model. As is often the case, I have a fair amount of missing data. In the past, I've used JAGS with these models, and as you know, NAs are permitted. So, I'm trying to figure out something comparable here that doesn't require dropping the data.

The sample code I've included, is just a variant of the IRT model in the manual. Rather than dropping the data, I'm trying to account for the missing data. The problem is that y is a discrete variable. One of the suggestions seems to be to use a mixture approach to do this. Ben had suggested that for y_miss, I use:

increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i]))); 

when m==1 (m is coded 0 when y is observed 1 when y is missing). p is the conditional probability of y=1.

I'd rather do something other than drop the data.If this is the wrong approach, then I'm honestly not sure what the correct approach should be.

I think what Bob and I are saying is that if you believe, prior to seeing the data, that a binary random variable Y is distributed Bernoulli with some probability p (that may vary from observation to observation as a function of the IRT stuff), how could you possibly believe that the likelihood contribution for a missing outcome is anything other than

log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p))

which is equivalent to but less stable than

log_mix(p, bernoulli_logit_log(1, eta), bernoulli_logit_log(0, eta))

where p = inv_logit(eta)? To believe anything else about a missing outcome would be contradictory. Just make eta a vector and modify the previous expression to

increment_log_prob(log_mix(inv_logit(eta[i]), bernoulli_logit_log(1, eta[i]), bernoulli_logit_log(0, eta[i])));

whenever m[i] == 1. If m[i] == 0, then you just have

increment_log_prob(bernoulli_logit(y[i], eta[i]));
 
At this point, I'm fine with assuming MAR. This is a large cross sectional survey, in which questions were randomized, so that seems reasonable to me.

With nothing else, the above would assume MAR conditional on the student's latent ability, but even for a dataset like PISA where some students are asked some questions but not others by the design of the research, you still have some missingness on questions a student was actually asked, which provides additional information on the student's (lack of) ability. The most obvious reasons why a student would not answer an asked question are that it seemed difficult and / or it was toward the end of the test and the student never got to it. Thus, it is pretty easy to construct another likelihood contribution for m, which could also be a logit model and does not require any mixing because m is observed for all students and all questions.

In the case where a student is not asked a question by design, that is not a random variable and does not contribute anything to the likelihood.

Ben

Bob Carpenter

unread,
Nov 21, 2015, 1:09:35 PM11/21/15
to stan-...@googlegroups.com

> On Nov 21, 2015, at 12:33 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Saturday, November 21, 2015 at 9:38:01 AM UTC-5, Chris Weber wrote:
> I have an IRT model. As is often the case, I have a fair amount of missing data. In the past, I've used JAGS with these models, and as you know, NAs are permitted. So, I'm trying to figure out something comparable here that doesn't require dropping the data.
>
> The sample code I've included, is just a variant of the IRT model in the manual. Rather than dropping the data, I'm trying to account for the missing data. The problem is that y is a discrete variable. One of the suggestions seems to be to use a mixture approach to do this. Ben had suggested that for y_miss, I use:
>
> increment_log_prob(log_mix(p[i], bernoulli_logit_log(1,p[i]), bernoulli_logit_log(0, p[i])));
>
> when m==1 (m is coded 0 when y is observed 1 when y is missing). p is the conditional probability of y=1.
>
> I'd rather do something other than drop the data.If this is the wrong approach, then I'm honestly not sure what the correct approach should be.
>
> I think what Bob and I are saying

Not me. I'm still asking.

> is that if you believe, prior to seeing the data, that a binary random variable Y is distributed Bernoulli with some probability p (that may vary from observation to observation as a function of the IRT stuff), how could you possibly believe that the likelihood contribution for a missing outcome is anything other than
>
> log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p))

This expression reduces to 0. In fact, so does the more
general

log_mix(q, bernoulli_log(1, p), bernoulli_log(0, p))

Why is the missingness related to the probability that the outcome
is one?

Also, in the IRT setting, do we get any information by modeling
the missingness? I suppose you would, because you could see a student
answering just a couple questions they knew the answers to and
leaving the rest blank, from which you don't want to infer that their
ability is +infinity. We should be working on some of these cases
for our IES grant!

- Bob

Ben Goodrich

unread,
Nov 21, 2015, 1:14:31 PM11/21/15
to Stan users mailing list
On Saturday, November 21, 2015 at 12:33:02 PM UTC-5, Ben Goodrich wrote:
With nothing else, the above would assume MAR conditional on the student's latent ability, but even for a dataset like PISA where some students are asked some questions but not others by the design of the research, you still have some missingness on questions a student was actually asked, which provides additional information on the student's (lack of) ability. The most obvious reasons why a student would not answer an asked question are that it seemed difficult and / or it was toward the end of the test and the student never got to it. Thus, it is pretty easy to construct another likelihood contribution for m, which could also be a logit model and does not require any mixing because m is observed for all students and all questions.

In the case where a student is not asked a question by design, that is not a random variable and does not contribute anything to the likelihood.

Another way of saying the same thing is that the observed data of a student on a test question are one of
  1. Observed to be the correct answer
  2. Observed to be the incorrect answer
  3. Unobserved but the student would have given the correct answer to that question
  4. Unobserved but the student would have given the incorrect answer to that question

So, you need a likelihood function for that four-headed monster. The usual way of proceeding (the "selection" approach) is to factor this as


Pr(y, m | stuff) = Pr(y | stuff) * Pr(m | y, stuff)

In the situations where a student was not asked a particular question by design, them Pr(m | y, stuff) = Pr(m | stuff). In other words, whether the student would have gotten that question correct tells you nothing about the probability that the student would have an opportunity to answer the question (which was determined by the research design alone). That is the MAR case and you don't have to mess with Pr(m | stuff).


Conversely, whether the student would have gotten the answer right or wrong tells you a lot about the probability would answer a question that was, in fact, asked. In other words, Pr(m | y, stuff) != Pr(m | stuff) and you need a model for Pr(m | y, stuff). That could be anything but at least seems to depend on the difficulty of the question and the position of the question in the test.


All this is a bit more tedious in Stan than BUGS because HMC needs derivatives and thus cannot do discrete unknowns. So, when m = 1 (and thus y is missing), we have to write


Pr(y, m = 1 | stuff) =  Pr(y = 1 | stuff) * Pr(y = 1 | stuff) * Pr(m = 1 | y = 1, stuff) +
                       
Pr(y = 0 | stuff) * Pr(y = 0 | stuff) * Pr(m = 1 | y = 0, stuff)

Also HMC utilizes the logarithm of the posterior density, so if you take the logarithm of both sides, on the right hand side, you get a log of a sum of exponentials,

log(Pr(y, m = 1 | stuff)) = log(exp(log(Pr(y = 1 | stuff)) + log(Pr(y = 1 | stuff)) + log(Pr(m = 1 | y = 1, stuff))) +
                                exp
(log(Pr(y = 0 | stuff)) + log(Pr(y = 0 | stuff)) + log(Pr(m = 1 | y = 0, stuff))))

which is what the log_mix() function facilitates by wrapping log_sum_exp() (which is why you may see both forms in the User Manual). And since Pr(y = 0 | stuff) = 1 - Pr(y = 1 | stuff), you only have to pass Pr(y = 1 | stuff) to the first argument of log_mix(). The second and third arguments to log_mix() are the two things being mixed together on the log scale, which are at first log(Pr(y = 1| stuff)) and log(Pr(y = 0 | stuff)). You could call log_mix() again to mix log(Pr(m = 1 | y = 1, stuff)) and log(Pr(m = 1 | y = 0, stuff)) in the NMAR case.

Ben

Ben Goodrich

unread,
Nov 21, 2015, 1:31:32 PM11/21/15
to Stan users mailing list
On Saturday, November 21, 2015 at 1:09:35 PM UTC-5, Bob Carpenter wrote:
> is that if you believe, prior to seeing the data, that a binary random variable Y is distributed Bernoulli with some probability p (that may vary from observation to observation as a function of the IRT stuff), how could you possibly believe that the likelihood contribution for a missing outcome is anything other than
>
> log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p))

This expression reduces to 0.  In fact, so does the more
general

  log_mix(q, bernoulli_log(1, p), bernoulli_log(0, p))

Ah, maybe I am perpetuating some of the confusion. I think we need (on the probability scale)

p * [p^1 * (1 - p)^(1 - 1)] + (1 - p) * [p^0 * (1 - p)^(1 - 0)] = p^2 + (1 - p)^2 = 1 - 2p + 2p^2

But log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p)) does not do that on the log scale?
 
Why is the missingness related to the probability that the outcome
is one?

What you said, basically that students skip questions they have a hard time answering or don't get to questions at the end because they took too long to answer the questions at the beginning.

 
Also, in the IRT setting, do we get any information by modeling
the missingness?  I suppose you would, because you could see a student
answering just a couple questions they knew the answers to and
leaving the rest blank, from which you don't want to infer that their
ability is +infinity.  We should be working on some of these cases
for our IES grant!

Yeah.

Ben

Chris Weber

unread,
Nov 23, 2015, 11:20:50 AM11/23/15
to Stan users mailing list
Assuming that

  log_mix(lambda, log(p), log(1-p)) 

Is 

 lambda * p(x) + (1 - lambda) (1 - p(x))  = 1 

which it seems to be, and lambda and p(x) is between 0 and 1, the expression is not 1, except in certain cases (e.g., when both p(x) and lambda=1)...but again, maybe this isn't actually what log_mix is doing. 

In terms of the four-headed monster (great term....I completely agree), if we factor p(y,m) in the usual selection specification....so,

Pr(y, m | stuff) = Pr(| stuff) * Pr(| y, stuff)

Are you sure the likelihood contribution, when m==1 is:

Pr(y, m = 1 | stuff) =  Pr(= 1 | stuff) * Pr(= 1 | stuff) * Pr(= 1 | y = 1, stuff) + 
                        
Pr(= 0 | stuff) * Pr(= 0 | stuff) * Pr(= 1 | y = 0, stuff)

it's not clear whey there are two pr(y=1|stuff) and pr(y=0|stuff) terms....

Thanks

Bob Carpenter

unread,
Nov 23, 2015, 1:10:30 PM11/23/15
to stan-...@googlegroups.com
I think I got the algebra wrong last time. Here's the
right thing:

log_mix(p, bernoull_log(1, p), bernoulli_log(0, p))

= log_mix(p, log(p), log(1 - p))

= log(p * exp(log(p)) + (1 - p) * exp(log(1-p)))

= log(p * p + (1 - p) * (1 - p))

= log(p^2 + 1 - 2p + p^2)

= log(2 * p^2 - 2p + 1)

So it doesn't reduce to a no-op. So this is missing
missingness information or something? I still don't understand
why p is used twice here.

- Bob

Chris Weber

unread,
Nov 23, 2015, 1:45:13 PM11/23/15
to Stan users mailing list
Okay, good, thanks. I still don't know why p is used twice. It seems like it should be used once, in which case when missing==1, 

log_mix(pr(y), pr(miss|y=1), pr(miss|y=0)), right?

If MCAR, then 

log_mix(pr(y), pr(miss), pr(miss)), right?

If, 

Pr(y, m = 1 | stuff) =  Pr(= 1 | stuff) * Pr(= 1 | y = 1, stuff) + 
                        
Pr(= 0 | stuff) *  Pr(= 1 | y = 0, stuff)
The way it's written, pr(y) and 1-pr(y) would be the two mixing weights, which subsequently weight the probability of missingness . This also seems to logically match the two forms of missingness:

  1. Unobserved but the student would have given the correct answer to that question
  2. Unobserved but the student would have given the incorrect answer to that question
Or no? Sorry for making this so confusing..

Ben Goodrich

unread,
Nov 23, 2015, 2:39:43 PM11/23/15
to Stan users mailing list
On Monday, November 23, 2015 at 1:10:30 PM UTC-5, Bob Carpenter wrote:
  = log(2 * p^2 - 2p + 1)

So it doesn't reduce to a no-op.  So this is missing
missingness information or something?  I still don't understand
why p is used twice here.

The fact that an observation is missing is almost always known. Maybe try thinking of it like this: If we were using BUGS and y[i] were NA, at each iteration BUGS would draw a realization of the binary outcome from its full-conditional distribution, which is just Bernoulli(p). Now imagine drawing a lot of realizations of that binary outcome from a Bernoulli(p) and taking the log of the average Bernoulli PMF over the realizations. In R code that would be,

p <- 2 / 3 # or whatever depending on the student and the IRT stuff
y
<- rbinom(n = 10^6, size = 1, prob = p)
log
(mean(dbinom(y, size = 1, prob = p))) # about 0.588
log
(1 - 2 * p + 2 * p^2) # basically also 0.588

In Stan, we can't draw from a Bernoulli in a model block, but we can evaluate log(1 - 2 * p + 2 * p^2) or equivalently log_mix(p, bernoulli_log(1, p), bernoulli_log(0, p)) to get a likelihood contribution for Pr(y | stuff) when y is missing. Then we need another likelihood contribution for the same observation(s) that is Pr(m = 1 | y, stuff), which would be another Bernoulli mixture that now marginalizes over the missing predictor (y) unless we made the assumption that Pr(m = 1 | y, stuff) = Pr(m = 1 | stuff).

Ben

Ben Goodrich

unread,
Nov 23, 2015, 2:49:16 PM11/23/15
to Stan users mailing list
On Monday, November 23, 2015 at 11:20:50 AM UTC-5, Chris Weber wrote:
In terms of the four-headed monster (great term....I completely agree), if we factor p(y,m) in the usual selection specification....so,

Pr(y, m | stuff) = Pr(| stuff) * Pr(| y, stuff)

Are you sure the likelihood contribution, when m==1 is:

Pr(y, m = 1 | stuff) =  Pr(= 1 | stuff) * Pr(= 1 | stuff) * Pr(= 1 | y = 1, stuff) + 
                        
Pr(= 0 | stuff) * Pr(= 0 | stuff) * Pr(= 1 | y = 0, stuff)

it's not clear whey there are two pr(y=1|stuff) and pr(y=0|stuff) terms....

There are two such terms because one comes from the factorization

Pr(y, m | stuff) = Pr(| stuff) * Pr(| y, stuff)

and the other comes from the fact that we have to mix the above over the two possibilities for the unknown y. In other words, the mixing weight on the y = 1 branch is Pr(y = 1 | stuff) and the mixing weight on the y = 0 branch is Pr(y = 0 | stuff) = (1 - Pr(y = 1 | stuff)). So, we get

Pr(y, m = 1 | stuff) =  Pr(y = 1 | stuff) * y1_branch + (1 - Pr(y = 1 | stuff)) * y0_branch
                     = 
Pr(y = 1 | stuff) * [Pr(= 1 | stuff) * Pr(= 1 | y = 1, stuff)] +
                        Pr(y = 0 | stuff) * [
Pr(= 0 | stuff) * Pr(= 1 | y = 0, stuff)]

Ben

Chris Weber

unread,
Nov 24, 2015, 1:42:16 PM11/24/15
to Stan users mailing list
Of course -- I see, it now makes perfect sense. Thanks again.

Reply all
Reply to author
Forward
0 new messages