86 views

Skip to first unread message

Mar 6, 2017, 3:50:08 AM3/6/17

to Stan users mailing list

Dear Stan developers and experts,

I am new to Stan and don't really have anyone from whom to get feedback on such questions in my immediate environment. I am thankful for any feedback from the list and hope that my question is not embarrassingly stupid.

I want estimate P(x = 1) in a binary classification task which has some amount of misclassification error. For this I have two data sets.

1) The test data is a validation data set where for each unit a gold measure x (which I assume to represent the "true" values x) and the misclassified measure w is available. The test data set is *not* a representative sample of the population, so I cannot infer from the distribution of x in the test data to the distribution of x in the main data, but I am willing to assume that the misclassification process is the same in both data sets.

2) The main data is a representative sample of the population where only the misclassified measure w is available for each unit.

I started by simplifying the cervix bugs example (https://github.com/stan-dev/example-models/tree/master/bugs_examples/vol2/cervix) to get the model described below. The basic idea is to estimate the misclassification probabilities phi from x and w in the test data set and q (P(x = 1) in the main data) from the misclassification probabilities and w in the main data. It is also possible to specify more informative priors on q or phi.

From my layman's perspective, the model seems to be ok, but I never learned how to systematically evaluate such models. It would be great if someone could give me some feedback on whether the model is at all sensible.

If the model in general makes sense, I would like to generalize the model to classification tasks with more than two categories. I would appreciate any pointers to resources that might be relevant for this.

Thank you!

Marko

Stan model:

`data {`

int<lower=0> Nc; // Test data sample size (gold measure and misclassified measure are observed)

int<lower=0> Ni; // Main data sample size (only misclassified measure is observed)

int xc[Nc]; // gold measure in test data

int wc[Nc]; // misclassified measure in test data

int wi[Ni]; // misclassified measure in main data

}

parameters {

real<lower=0,upper=1> phi[2]; // misclassification probabilities: phi[P(w = 1 | x = 0), P(w = 1 | x = 1)]

real<lower=0,upper=1> q; // Probability of gold = 1 in main data, P(x = 1) <- quantity of interest

}

model {

for (n in 1:Nc) { // for each case in test data:

wc[n] ~ bernoulli(phi[xc[n] + 1]); // estimate misclassification probabilities

}

for (n in 1:Ni) { // for each case with incomplete data:

wi[n] ~ bernoulli(phi[1] * (1 - q) + phi[2] * q); // estimate q

}

// Priors on q and/or phi possible, but not necessary

// q ~ uniform(0, 1);

// phi[1] ~ normal(.2, .05); // Priors for misclassification probabilities

// phi[2] ~ normal(.8, .05); // Priors for misclassification probabilities

}

I also paste a simple simulation model in R to generate example data, in case this helps to evaluate the model:

`library(rstan)library(tidyverse); theme_set(theme_bw())`

# Simulated data

# Sample sizes

Nc = 200

Ni = 1e3

# Misclassification

misclass = function(x, misclassification_matrix){

x = as.character(x)

sample(colnames(misclassification_matrix), size = 1, prob = misclassification_matrix[, x]) %>% as.integer

}

mcm = matrix(c(.7,.3,.1,.9), nrow = 2) # misclassification matrix (phi is in mcm[2, ])

dimnames(mcm) = list(0:1,0:1)

mcm

xc = rbinom(Nc, 1, .5) # gold measure in test data; distribution != main data

wc = sapply(xc, misclass, misclassification_matrix = mcm) # misclassified measure in test data

xi = rbinom(Ni, 1, .2) # gold measure in main data, not observed

wi = sapply(xi, misclass, misclassification_matrix = mcm) # misclassified measure in main data

fm_stan = stan(model_code = stan_model, data = list(Nc = Nc, Ni = Ni, xc = xc, wc = wc, wi = wi), control = list(adapt_delta = 0.95))

Mar 7, 2017, 11:44:24 AM3/7/17

to stan-...@googlegroups.com

For some hints on how to evaluate classifiers, you can check out

the case study on repeated binary trials.

http://mc-stan.org/documentation/case-studies/pool-binary-trials.html

We highly recommend testing on simulated data first. And

starting with simple models then generalizing a step at a time.

Usually for measurement error models, the generative model is

that there's a true value, then a noisy measurement. There are

examples in the measurement error chapter of the manual and in

the latent discrete parameters chapter for the Dawid and Skene

model for classification training data, which is probably closer

to what you want.

- Bob

> --

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

the case study on repeated binary trials.

http://mc-stan.org/documentation/case-studies/pool-binary-trials.html

We highly recommend testing on simulated data first. And

starting with simple models then generalizing a step at a time.

Usually for measurement error models, the generative model is

that there's a true value, then a noisy measurement. There are

examples in the measurement error chapter of the manual and in

the latent discrete parameters chapter for the Dawid and Skene

model for classification training data, which is probably closer

to what you want.

- Bob

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

Mar 8, 2017, 12:14:15 PM3/8/17

to stan-...@googlegroups.com

Dear Bob,

Thank you for your helpful suggestions. Especially the Stan version of the Dawid and Skene model in the manual will be extremely helpful! I started googling "dawid skene stan" and came across your GitHub project (https://github.com/bob-carpenter/anno) and an older blog post (http://andrewgelman.com/2015/02/15/stan-3/). The paper with Becky Passonneau is really interesting and has motivated me to go in a similar direction.

I also looked at the code at GitHub and found somewhere hidden a reference to a file named "dawid-skene-hier.stan" - the file itself is not on GitHub. Would you be willing to share the code for your implementation of a hierarchical Dawid-Skene model?

Many thanks again!

Marko

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/G1nPD2TbZ08/unsubscribe.

To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Mar 14, 2017, 1:16:12 PM3/14/17

to Stan users mailing list

Dear Bob, dear list,

Thank you again for the reference to the Dawid and Skene model. It provides a very helpful framework for my problem.

I work in communication research, where a typical content analysis usually works in two steps. First, you train the annotators and let them classify the same test units. The Dawid and Skene model can then be used to estimate the category proportions (pi_test in the following model code) and the misclassification matrices for all annotators (theta). Second, each unit in the actual study sample is only classified by one annotator.

My idea is to use the estimate of misclassification in the test data to correct the estimate of the category proportions in the main study (pi_study) - with the necessary assumption that misclassification in the test and in the actual study are the same. I have added the parts of the study data to the model and it seems to work well with simulated data. However, it is still quite likely that a made a stupid mistake somewhere along the way.

Any feedback on the model is highly appreciated.

Thank you!

Marko

Adapted Dawid and Skene model (from p. 220 of the manual, my changes in bold red):

`data {// test & study int<lower=2> K; // number of categories int<lower=1> J; // number of annotators vector<lower=0>[K] beta[K]; // dirichlet prior on misclassification matrix theta// test int<lower=1> I_test; // number of coding units in test data int<lower=1> N_test; // number of codings in test data`**// Data set in long format with indices for annotator j and coding unit i; allows for unbalanced number of annotations per coding unit**** int<lower=1,upper=K> y_test[N_test]; // codings in test data**** int<lower=1,upper=I_test> ii_test[N_test]; // index coding units in test data**** int<lower=1,upper=J> jj_test[N_test]; // index annotators in test data** vector<lower=0>[K] alpha_test; // dirichlet hyperprior on category prevalence in test data**// study; data for the main study where only on annotator annotates each coding unit; format is the same as above**** int<lower=1> N_study; // number of codings in study data**** int<lower=1,upper=N_study> ii_study[N_study]; // index coding units in study data**** int<lower=1,upper=J> jj_study[N_study]; // index coder in study data**** vector<lower=0>[K] alpha_study; // dirichlet prior on category prevalence in study data****// study**** int<lower=1,upper=K> y_study[N_study]; // codings in study data**}parameters { simplex[K] pi_test; simplex[K] theta[J, K];

** simplex[K] pi_study_obs; // category proportions in study data without correction; estimated for comparison**** simplex[K] pi_study_cor; // category proportions in study data corrected by misclassification theta**}transformed parameters { vector[K] log_q_z[I_test];** vector[K] log_q_z2[N_study]; // for study data**

for (i in 1:I_test) log_q_z[i] = log(pi_test); for (n in 1:N_test) for (k in 1:K) log_q_z[ii_test[n], k] = log_q_z[ii_test[n], k] + log(theta[jj_test[n], k, y_test[n]]);

** for (n in 1:N_study) // for study data; theta as in test data above**** log_q_z2[n] = log(pi_study_cor);**** for (n in 1:N_study)**** for (k in 1:K)**** log_q_z2[ii_study[n], k] = log_q_z2[ii_study[n], k] + log(theta[jj_study[n],k, y_study[n]]);**}model { pi_test ~ dirichlet(alpha_test);** pi_study_cor ~ dirichlet(alpha_study);** for (j in 1:J) for (k in 1:K) theta[j, k] ~ dirichlet(beta[k]); for (i in 1:I_test) target += log_sum_exp(log_q_z[i]);** for (n in 1:N_study)**** target += log_sum_exp(log_q_z2[n]);** ** y_study ~ categorical(pi_study_obs);**}**generated quantities {**

// Apply softmax to generate probabilities from log_q_z(2)** vector[K] softmax_lqz[I_test]; // probability of each k for each coding unit in test data**** vector[K] softmax_lqz2[N_study]; // probability of each k for each coding unit in study data**** for (i in 1:I_test)**** softmax_lqz[i] = softmax(log_q_z[i, ]);**** for (n in 1:N_study)**** softmax_lqz2[n] = softmax(log_q_z2[n, ]);****}**

Mar 14, 2017, 2:23:28 PM3/14/17

to stan-...@googlegroups.com

> On Mar 14, 2017, at 1:16 PM, Marko <marko...@gmail.com> wrote:

>

> Dear Bob, dear list,

> Thank you again for the reference to the Dawid and Skene model. It provides a very helpful framework for my problem.

> I work in communication research, where a typical content analysis usually works in two steps. First, you train the annotators and let them classify the same test units. The Dawid and Skene model can then be used to estimate the category proportions (pi_test in the following model code) and the misclassification matrices for all annotators (theta). Second, each unit in the actual study sample is only classified by one annotator.

> My idea is to use the estimate of misclassification in the test data to correct the estimate of the category proportions in the main study (pi_study) - with the necessary assumption that misclassification in the test and in the actual study are the same.

on all the data and parameters jointly. It's perfectly OK for

there to be only one (or even no) annotators per item. That's the only way

to properly propagate your uncertainty and it has the side

benefit of using all of your data efficiently.

> I have added the parts of the study data to the model and it seems to work well with simulated data. However, it is still quite likely that a made a stupid mistake somewhere along the way.

squinting at it.

You should be able to vectorize many of those loops. We still need

to go back and update all of our old model examples.

I need to write the Dawid and Skene model up with the Raykar et al.

JMLR model (it trains a logistic regression for prevalence rather

than just having a categorical distribution---or in machine learning

terms, trains a model for the problem and the annotation model at

the same time).

- Bob

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu