# Model for binary classification task with misclassification error

86 views

### Marko

Mar 6, 2017, 3:50:08 AM3/6/17
to Stan users mailing list
Dear Stan developers and experts,
I am new to Stan and don't really have anyone from whom to get feedback on such questions in my immediate environment. I am thankful for any feedback from the list and hope that my question is not embarrassingly stupid.

I want estimate P(x = 1) in a binary classification task which has some amount of misclassification error. For this I have two data sets.
1) The test data is a validation data set where for each unit a gold measure x (which I assume to represent the "true" values x) and the misclassified measure w is available. The test data set is *not* a representative sample of the population, so I cannot infer from the distribution of x in the test data to the distribution of x in the main data, but I am willing to assume that the misclassification process is the same in both data sets.
2) The main data is a representative sample of the population where only the misclassified measure w is available for each unit.

I started by simplifying the cervix bugs example (https://github.com/stan-dev/example-models/tree/master/bugs_examples/vol2/cervix) to get the model described below. The basic idea is to estimate the misclassification probabilities phi from x and w in the test data set and q (P(x = 1) in the main data) from the misclassification probabilities and w in the main data. It is also possible to specify more informative priors on q or phi.

From my layman's perspective, the model seems to be ok, but I never learned how to systematically evaluate such models. It would be great if someone could give me some feedback on whether the model is at all sensible.
If the model in general makes sense, I would like to generalize the model to classification tasks with more than two categories. I would appreciate any pointers to resources that might be relevant for this.

Thank you!
Marko

Stan model:
data {
int<lower=0> Nc; // Test data sample size (gold measure and misclassified measure are observed)
int<lower=0> Ni; // Main data sample size (only misclassified measure is observed)
int xc[Nc]; // gold measure in test data
int wc[Nc]; // misclassified measure in test data
int wi[Ni]; // misclassified measure in main data
}

parameters
{
real
<lower=0,upper=1> phi[2]; // misclassification probabilities: phi[P(w = 1 | x = 0), P(w = 1 | x = 1)]
real
<lower=0,upper=1> q; // Probability of gold = 1 in main data, P(x = 1) <- quantity of interest
}

model
{
for (n in 1:Nc) { // for each case in test data:
wc
[n] ~ bernoulli(phi[xc[n] + 1]); // estimate misclassification probabilities
}
for (n in 1:Ni) { // for each case with incomplete data:
wi
[n] ~ bernoulli(phi[1] * (1 - q) + phi[2] * q); // estimate q
}
// Priors on q and/or phi possible, but not necessary
// q ~ uniform(0, 1);
// phi[1] ~ normal(.2, .05);  // Priors for misclassification probabilities
// phi[2] ~ normal(.8, .05);  // Priors for misclassification probabilities
}

I also paste a simple simulation model in R to generate example data, in case this helps to evaluate the model:
library(rstan)
library(tidyverse); theme_set(theme_bw())

# Simulated data

# Sample sizes
Nc = 200
Ni = 1e3

# Misclassification
misclass
= function(x, misclassification_matrix){
x
= as.character(x)
sample
(colnames(misclassification_matrix), size = 1, prob = misclassification_matrix[, x]) %>% as.integer
}
mcm
= matrix(c(.7,.3,.1,.9), nrow = 2) # misclassification matrix (phi is in mcm[2, ])
dimnames
(mcm) = list(0:1,0:1)
mcm

xc
= rbinom(Nc, 1, .5) # gold measure in test data; distribution != main data
wc
= sapply(xc, misclass, misclassification_matrix = mcm) # misclassified measure in test data
xi
= rbinom(Ni, 1, .2) # gold measure in main data, not observed
wi
= sapply(xi, misclass, misclassification_matrix = mcm) # misclassified measure in main data

fm_stan
= stan(model_code = stan_model, data = list(Nc = Nc, Ni = Ni, xc = xc, wc = wc, wi = wi), control = list(adapt_delta = 0.95))

### Bob Carpenter

Mar 7, 2017, 11:44:24 AM3/7/17
For some hints on how to evaluate classifiers, you can check out
the case study on repeated binary trials.

http://mc-stan.org/documentation/case-studies/pool-binary-trials.html

We highly recommend testing on simulated data first. And
starting with simple models then generalizing a step at a time.

Usually for measurement error models, the generative model is
that there's a true value, then a noisy measurement. There are
examples in the measurement error chapter of the manual and in
the latent discrete parameters chapter for the Dawid and Skene
model for classification training data, which is probably closer
to what you want.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

### Marko Bachl

Mar 8, 2017, 12:14:15 PM3/8/17
Dear Bob,
Thank you for your helpful suggestions. Especially the Stan version of the Dawid and Skene model in the manual will be extremely helpful! I started googling "dawid skene stan" and came across your GitHub project (https://github.com/bob-carpenter/anno) and an older blog post (http://andrewgelman.com/2015/02/15/stan-3/). The paper with Becky Passonneau is really interesting and has motivated me to go in a similar direction.
I also looked at the code at GitHub and found somewhere hidden a reference to a file named "dawid-skene-hier.stan" - the file itself is not on GitHub. Would you be willing to share the code for your implementation of a hierarchical Dawid-Skene model?

Many thanks again!
Marko

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/G1nPD2TbZ08/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

### Marko

Mar 14, 2017, 1:16:12 PM3/14/17
to Stan users mailing list
Dear Bob, dear list,
Thank you again for the reference to the Dawid and Skene model. It provides a very helpful framework for my problem.
I work in communication research, where a typical content analysis usually works in two steps. First, you train the annotators and let them classify the same test units. The Dawid and Skene model can then be used to estimate the category proportions (pi_test in the following model code) and the misclassification matrices for all annotators (theta). Second, each unit in the actual study sample is only classified by one annotator.
My idea is to use the estimate of misclassification in the test data to correct the estimate of the category proportions in the main study (pi_study) - with the necessary assumption that misclassification in the test and in the actual study are the same. I have added the parts of the study data to the model and it seems to work well with simulated data. However, it is still quite likely that a made a stupid mistake somewhere along the way.

Any feedback on the model is highly appreciated.
Thank you!
Marko

Adapted Dawid and Skene model (from p. 220 of the manual, my changes in bold red):
data {
// test & study
int<lower=2> K; // number of categories
int<lower=1> J; // number of annotators
vector<lower=0>[K] beta[K]; // dirichlet prior on misclassification matrix theta
// test
int<lower=1> I_test; // number of coding units in test data
int<lower=1> N_test; // number of codings in test data
// Data set in long format with indices for annotator j and coding unit i; allows for unbalanced number of annotations per coding unit
int<lower=1,upper=K> y_test[N_test]; // codings in test data
int<lower=1,upper=I_test> ii_test[N_test]; // index coding units in test data
int<lower=1,upper=J> jj_test[N_test]; // index annotators in test data
vector<lower=0>[K] alpha_test; // dirichlet hyperprior on category prevalence in test data
// study; data for the main study where only on annotator annotates each coding unit; format is the same as above
int<lower=1> N_study; // number of codings in study data
int<lower=1,upper=N_study> ii_study[N_study]; // index coding units in study data
int<lower=1,upper=J> jj_study[N_study]; // index coder in study data
vector<lower=0>[K] alpha_study; // dirichlet prior on category prevalence in study data
// study
int<lower=1,upper=K> y_study[N_study]; // codings in study data
}
parameters {
simplex[K] pi_test;
simplex[K] theta[J, K];

simplex[K] pi_study_obs; // category proportions in study data without correction; estimated for comparison
simplex[K] pi_study_cor; // category proportions in study data corrected by misclassification theta
}
transformed parameters {
vector[K] log_q_z[I_test];
vector[K] log_q_z2[N_study]; // for study data

for (i in 1:I_test)
log_q_z[i] = log(pi_test);
for (n in 1:N_test)
for (k in 1:K)
log_q_z[ii_test[n], k] = log_q_z[ii_test[n], k] + log(theta[jj_test[n], k, y_test[n]]);

for (n in 1:N_study) // for study data; theta as in test data above
log_q_z2[n] = log(pi_study_cor);
for (n in 1:N_study)
for (k in 1:K)
log_q_z2[ii_study[n], k] = log_q_z2[ii_study[n], k] + log(theta[jj_study[n],k, y_study[n]]);
}
model {
pi_test ~ dirichlet(alpha_test);
pi_study_cor ~ dirichlet(alpha_study);
for (j in 1:J)
for (k in 1:K)
theta[j, k] ~ dirichlet(beta[k]);

for (i in 1:I_test)
target += log_sum_exp(log_q_z[i]);
for (n in 1:N_study)
target += log_sum_exp(log_q_z2[n]);

y_study ~ categorical(pi_study_obs);
}
generated quantities {
// Apply softmax to generate probabilities from log_q_z(2)
vector[K] softmax_lqz[I_test]; // probability of each k for each coding unit in test data
vector[K] softmax_lqz2[N_study]; // probability of each k for each coding unit in study data
for (i in 1:I_test)
softmax_lqz[i] = softmax(log_q_z[i, ]);
for (n in 1:N_study)
softmax_lqz2[n] = softmax(log_q_z2[n, ]);
}

dawid_skene_nopooling_model.stan

### Bob Carpenter

Mar 14, 2017, 2:23:28 PM3/14/17

> On Mar 14, 2017, at 1:16 PM, Marko <marko...@gmail.com> wrote:
>
> Dear Bob, dear list,
> Thank you again for the reference to the Dawid and Skene model. It provides a very helpful framework for my problem.
> I work in communication research, where a typical content analysis usually works in two steps. First, you train the annotators and let them classify the same test units. The Dawid and Skene model can then be used to estimate the category proportions (pi_test in the following model code) and the misclassification matrices for all annotators (theta). Second, each unit in the actual study sample is only classified by one annotator.
> My idea is to use the estimate of misclassification in the test data to correct the estimate of the category proportions in the main study (pi_study) - with the necessary assumption that misclassification in the test and in the actual study are the same.

You can do that. But even better, do full Bayesian inference
on all the data and parameters jointly. It's perfectly OK for
there to be only one (or even no) annotators per item. That's the only way
to properly propagate your uncertainty and it has the side
benefit of using all of your data efficiently.

> I have added the parts of the study data to the model and it seems to work well with simulated data. However, it is still quite likely that a made a stupid mistake somewhere along the way.

Maybe, but this kind of thing is almost impossible to debug just by
squinting at it.

You should be able to vectorize many of those loops. We still need
to go back and update all of our old model examples.

I need to write the Dawid and Skene model up with the Raykar et al.
JMLR model (it trains a logistic regression for prevalence rather
than just having a categorical distribution---or in machine learning
terms, trains a model for the problem and the annotation model at
the same time).

- Bob