232 views

Skip to first unread message

May 19, 2017, 5:13:30 AM5/19/17

to Stan users mailing list

Hi,

I am trying to implement a mixture model of Negative Binomial regression models. I want to use the Dirichlet Process prior to estimate the number of mixture components.

I found this thread (https://groups.google.com/forum/#!topic/stan-users/d-R7J6MldkQ) with an example code for a hierarchical model, and together with the mixture model example in the docs, I came up with this code:

library(rstan)

model_string <-"

data {

int<lower=1> K; // number of mixture components

int<lower=1> N; // number of data points

int<lower=1> M; //the number of columns in the model matrix

matrix[N,M] X; //the model matrix

int y[N]; // observations

real<lower=0> alpha0 ; // dirichlet prior

}

transformed data {

vector<lower=0>[K] alpha0_vec;

for (k in 1:K){

alpha0_vec[k] = alpha0; //symmetric dirichlet prior

}

}

parameters {

simplex[K] theta; // mixing proportions

vector[M] betas[K];

real<lower=0> sigma[K]; // scales of mixture components

}

model {

theta ~ dirichlet(alpha0_vec);

sigma ~ lognormal(0, 2);

for (k in 1:K)

betas[k,] ~ normal(0,10);

for (n in 1:N) {

real lps[K];

for (k in 1:K)

lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | X[n] * betas[k], sigma[k]);

target += log_sum_exp(lps);

}

}"

# toy data

set.seed(123)

N <- 10000

phi <- 1.2

b0 <- -0.5

b1 <- 0.4

X <- data.frame(V=rep(1,N), V1=runif(N), V2=runif(N))

y <- rnbinom(N, size = phi, mu = exp(as.matrix(X) %*% c(1, b0, b1)))

# call stan

m <- stan(model_code = model_string, data = list(X=X, M=ncol(X), K=3, y = y, N = nrow(X), alpha0=0.2), chains=1, cores=1)

The results from this model are:

> m Inference for Stan model: eb789d3c1710fcbbff2117791762e0d8. 1 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=1000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat theta[1] 1.00 0.00 0.01 0.99 1.00 1.00 1.00 1.00 34 1.06 theta[2] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 27 1.04 theta[3] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 127 1.01 betas[1,1] 2.77 0.01 0.08 2.62 2.71 2.77 2.83 2.92 31 1.02 betas[1,2] -1.21 0.02 0.09 -1.38 -1.29 -1.22 -1.14 -1.02 31 1.00 betas[1,3] 0.93 0.02 0.11 0.71 0.85 0.95 1.01 1.15 39 1.04 betas[2,1] 8.64 1.04 5.76 0.32 4.03 7.58 12.53 20.06 31 1.02 betas[2,2] 4.04 1.67 8.19 -9.68 -1.00 3.06 7.95 24.66 24 1.01 betas[2,3] 3.73 1.28 7.68 -14.85 -0.38 3.65 8.71 18.40 36 1.00 betas[3,1] 8.56 0.67 5.72 0.62 4.20 7.67 12.02 22.33 72 1.01 betas[3,2] 4.16 0.85 7.04 -8.70 -0.43 3.70 8.21 18.67 69 1.01 betas[3,3] 2.99 0.73 7.12 -10.94 -1.32 2.48 7.08 18.79 96 1.00 sigma[1] 1.19 0.00 0.03 1.14 1.17 1.19 1.21 1.24 60 1.02 sigma[2] 7.06 1.52 16.44 0.04 0.39 1.51 6.55 60.25 116 1.00 sigma[3] 13.63 4.40 45.52 0.04 0.30 1.52 6.83 126.71 107 1.03 lp__ -21242.16 0.33 2.59 -21247.43 -21243.81 -21242.07 -21240.33 -21237.42 61 1.00

The traceplot looks a bitt wobbly, but the Rhat is not so bad.

I this model specification correct?

When looking at the thetas, theta[1] has a value of 1.0, where the other thetas have zero. Does that mean the optimal number of components is 1?

Actually that would be true here, as the data are generated from a simple NegBin.

How do I interpret the betas? betas other for components other than the first are meangingless?!

The true sigma is 1.2 (called phi in the data generation) and the estimated sigma for component 1 is 1.19, so that is perfect.

The estimated betas are way off the truth, however.

Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line

ordered mu[K]; // locations of mixture components

but how can I do it here?

I also get tons of errors like this, but I assume I can safely ignore them?

The following numerical problems occurred the indicated number of times on chain 1

count

Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite! 2

Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be > 1

Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0 1

...

thanks

May 19, 2017, 1:27:38 PM5/19/17

to Stan users mailing list

On Friday, May 19, 2017 at 5:13:30 AM UTC-4, Chris wrote:

Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a lineordered mu[K]; // locations of mixture componentsbut how can I do it here?

Approximate Dirichlet process priors have been known to sort of work with a large number of components but it is more than a bit dicey for Stan. To answer the specific question above, do

`parameters {`

pos_ordered[K] gamma; // primitives of mixing proportions

` vector[M] betas[K]; real<lower=0> sigma[K]; // scales of mixture components}`

transformed parameters {

vector[K] theta = gamma / sum(gamma);

}

model {

gamma ~ gamma(alpha_vec, 1); // implies: theta ~ dirichlet(alpha_vec)

...

}

But it still might not work too well for various reasons. Mike has a case study on identifying mixture models with tricks like this.

Ben

May 19, 2017, 8:02:44 PM5/19/17

to stan-...@googlegroups.com

Ben: What do you mean by dicey? And how large? What's

the problem?

Chris: see some suggestions inline below

- Bob

P.S. This list is shutting soon and moving to discourse.

See: mc-stan.org/community

> On May 19, 2017, at 5:13 AM, Chris <wel...@gmail.com> wrote:

>

> Hi,

>

> I am trying to implement a mixture model of Negative Binomial regression models. I want to use the Dirichlet Process prior to estimate the number of mixture components.

> I found this thread (https://groups.google.com/forum/#!topic/stan-users/d-R7J6MldkQ) with an example code for a hierarchical model, and together with the mixture model example in the docs, I came up with this code:

>

>

> library(rstan)

>

> model_string <-"

> data {

> int<lower=1> K; // number of mixture components

> int<lower=1> N; // number of data points

> int<lower=1> M; //the number of columns in the model matrix

> matrix[N,M] X; //the model matrix

> int y[N]; // observations

> real<lower=0> alpha0 ; // dirichlet prior

> }

> transformed data {

> vector<lower=0>[K] alpha0_vec;

This can just be

vector<lower=0>[K] alpha0_vec = rep_vector(alpha0, K);

> for (k in 1:K){

> alpha0_vec[k] = alpha0; //symmetric dirichlet prior

> }

> }

> parameters {

> simplex[K] theta; // mixing proportions

> vector[M] betas[K];

> real<lower=0> sigma[K]; // scales of mixture components

> }

> model {

> theta ~ dirichlet(alpha0_vec);

> sigma ~ lognormal(0, 2);

> for (k in 1:K)

> betas[k,] ~ normal(0,10);

> for (n in 1:N) {

> real lps[K];

> for (k in 1:K)

> lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | X[n] * betas[k], sigma[k]);

> target += log_sum_exp(lps);

> }

> }"

I don't know what the predictors in X[n] are, but it might

make sense to have a hierarchical prior for the betas either

altogether or by predictor m in 1:M.

Otherwise, this looks fine as a negative binomial mixture.

You're combining two hard to estimate pieces---negative binomial

overidspersion and mixture models. Michael Betancourt wrote

a case study recently on mixtures illustrating some of the

difficulties (see mc-stan.org/docs).

If you run this for longer and the n_eff go up proportionally you should be OK.

You'll only find one mode, of course.

>

> I this model specification correct?

>

> When looking at the thetas, theta[1] has a value of 1.0, where the other thetas have zero. Does that mean the optimal number of components is 1?

There's no optimization per se, but that's the posterior mean number

of components. You could print this to more decimal places and maybe

see that it's not precisely 1.0.

> Actually that would be true here, as the data are generated from a simple NegBin.

Well, that's reassuring at least.

You could also have a stronger prior (wouldn't approximate a DP, but

it may save you from having some bins not have any data).

> How do I interpret the betas? betas other for components other than the first are meangingless?!

You could run this out to more decimal places. Probably not exactly meaningless,

but only making a weak contribution. That essentially means you're using very

little data to train the components, so their posteriors should look a lot like

your priors.

> The true sigma is 1.2 (called phi in the data generation) and the estimated sigma for component 1 is 1.19, so that is perfect.

> The estimated betas are way off the truth, however.

If you generated everything from one negative binomial, that's

to be expected. Are the betas for the first component that's being

used way off?

> Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line

> ordered mu[K]; // locations of mixture components

> but how can I do it here?

You'd have to do write your own ordered simplex transform, which

is a bit tricky. You could order one component of mu, like maybe

an intercept. Again, I recommend Michael's case study to get some

more understanding of how mixtures work.

> I also get tons of errors like this, but I assume I can safely ignore them?

>

> The following numerical problems occurred the indicated number of times on chain 1

> count

> Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite! 2

> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be > 1

> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0 1

> ...

You should try upping the target acceptance rate (adapt_delta = 0.95 or 0.99 or something)

and maybe starting with a lower stepsize and running more iterations.

- Bob

the problem?

Chris: see some suggestions inline below

- Bob

P.S. This list is shutting soon and moving to discourse.

See: mc-stan.org/community

> On May 19, 2017, at 5:13 AM, Chris <wel...@gmail.com> wrote:

>

> Hi,

>

> I am trying to implement a mixture model of Negative Binomial regression models. I want to use the Dirichlet Process prior to estimate the number of mixture components.

> I found this thread (https://groups.google.com/forum/#!topic/stan-users/d-R7J6MldkQ) with an example code for a hierarchical model, and together with the mixture model example in the docs, I came up with this code:

>

>

> library(rstan)

>

> model_string <-"

> data {

> int<lower=1> K; // number of mixture components

> int<lower=1> N; // number of data points

> int<lower=1> M; //the number of columns in the model matrix

> matrix[N,M] X; //the model matrix

> int y[N]; // observations

> real<lower=0> alpha0 ; // dirichlet prior

> }

> transformed data {

> vector<lower=0>[K] alpha0_vec;

vector<lower=0>[K] alpha0_vec = rep_vector(alpha0, K);

> for (k in 1:K){

> alpha0_vec[k] = alpha0; //symmetric dirichlet prior

> }

> }

> parameters {

> simplex[K] theta; // mixing proportions

> vector[M] betas[K];

> real<lower=0> sigma[K]; // scales of mixture components

> }

> model {

> theta ~ dirichlet(alpha0_vec);

> sigma ~ lognormal(0, 2);

> for (k in 1:K)

> betas[k,] ~ normal(0,10);

> for (n in 1:N) {

> real lps[K];

> for (k in 1:K)

> lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | X[n] * betas[k], sigma[k]);

> target += log_sum_exp(lps);

> }

> }"

make sense to have a hierarchical prior for the betas either

altogether or by predictor m in 1:M.

Otherwise, this looks fine as a negative binomial mixture.

You're combining two hard to estimate pieces---negative binomial

overidspersion and mixture models. Michael Betancourt wrote

a case study recently on mixtures illustrating some of the

difficulties (see mc-stan.org/docs).

You'll only find one mode, of course.

>

> I this model specification correct?

>

> When looking at the thetas, theta[1] has a value of 1.0, where the other thetas have zero. Does that mean the optimal number of components is 1?

of components. You could print this to more decimal places and maybe

see that it's not precisely 1.0.

> Actually that would be true here, as the data are generated from a simple NegBin.

You could also have a stronger prior (wouldn't approximate a DP, but

it may save you from having some bins not have any data).

> How do I interpret the betas? betas other for components other than the first are meangingless?!

but only making a weak contribution. That essentially means you're using very

little data to train the components, so their posteriors should look a lot like

your priors.

> The true sigma is 1.2 (called phi in the data generation) and the estimated sigma for component 1 is 1.19, so that is perfect.

> The estimated betas are way off the truth, however.

to be expected. Are the betas for the first component that's being

used way off?

> Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line

> ordered mu[K]; // locations of mixture components

> but how can I do it here?

is a bit tricky. You could order one component of mu, like maybe

an intercept. Again, I recommend Michael's case study to get some

more understanding of how mixtures work.

> I also get tons of errors like this, but I assume I can safely ignore them?

>

> The following numerical problems occurred the indicated number of times on chain 1

> count

> Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite! 2

> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be > 1

> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0 1

> ...

and maybe starting with a lower stepsize and running more iterations.

- Bob

May 19, 2017, 8:06:29 PM5/19/17

to stan-...@googlegroups.com

Ben: What do you mean by dicey? And how large? What's

the problem?

For more than two components there are a bunch more

pathologies in the posterior that obstruct exploration than

just the combinatorial multimodalities addressed in the

case study.

Ultimately these Dirichlet mixtures suffer from the same

problems as LDA, unsurprisingly. It’s near impossible

to quantify all of the model configurations consistent

with any given data set.

May 20, 2017, 3:21:47 PM5/20/17

to stan-...@googlegroups.com

Oh, just that. I keep meaning to finish a blog post

I wrote on the topic in my negative-Nellie series---that

you can't do Bayesian inference on these models.

I just took that as a given. All the simulated data

studies I've seen have also shown that they're terrible

at actually pretty bad at inferring the exact number of

mixture components---I can't quite remember the pathologies

that come up. Here's some elaboration from Andrew (with

quotes from Aki Vehtari and Christian Robert [aka X]):

http://andrewgelman.com/2016/11/25/30605/

- Bob

> --

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

I wrote on the topic in my negative-Nellie series---that

you can't do Bayesian inference on these models.

I just took that as a given. All the simulated data

studies I've seen have also shown that they're terrible

at actually pretty bad at inferring the exact number of

mixture components---I can't quite remember the pathologies

that come up. Here's some elaboration from Andrew (with

quotes from Aki Vehtari and Christian Robert [aka X]):

http://andrewgelman.com/2016/11/25/30605/

- Bob

> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

May 20, 2017, 4:04:25 PM5/20/17

to stan-...@googlegroups.com

Yeah, it turns out that exchangeability induces

a horde of nastiness even for seemingly simple

problems. And much of this is relatively new to

me as I’ve been investigating these models

(and supposed fixes) recently.

a horde of nastiness even for seemingly simple

problems. And much of this is relatively new to

me as I’ve been investigating these models

(and supposed fixes) recently.

May 20, 2017, 4:50:40 PM5/20/17

to Stan users mailing list

On Friday, May 19, 2017 at 5:13:30 AM UTC-4, Chris wrote:

I also get tons of errors like this, but I assume I can safely ignore them?The following numerical problems occurred the indicated number of times on chain 1countException thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite! 2Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be > 1Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0 1

I missed this the first time around, but hell no. You need to make the mean positive with

`lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | exp(X[n] * betas[k]), sigma[k]);`

May 21, 2017, 6:03:34 AM5/21/17

to Stan users mailing list

thanks for all your input, especially Ben catching the missing exp, that eluded me.

Here is the final model definition:

model_string <-"

data {

int<lower=1> K; // number of mixture components

int<lower=1> N; // number of data points

int<lower=1> M; //the number of columns in the model matrix

matrix[N,M] X; //the model matrix

int y[N]; // observations

real<lower=0> alpha0 ; // dirichlet prior

}

transformed data {

vector<lower=0>[K] alpha0_vec = rep_vector(alpha0, K); //symmetric dirichlet prior

}

parameters {

positive_ordered[K] gamma; // primitives of mixing proportions

vector[M] betas[K];

real<lower=0> sigma[K]; // scales of mixture components

}

transformed parameters {

vector[K] theta = gamma / sum(gamma);

}

model {

gamma ~ gamma(alpha0_vec, 1); // implies: theta ~ dirichlet(alpha_vec)

sigma ~ lognormal(0, 2);

for (k in 1:K)

betas[k,] ~ normal(0,10);

for (n in 1:N) {

real lps[K];

for (k in 1:K)

lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | exp(X[n] * betas[k]), sigma[k]);

target += log_sum_exp(lps);

}

}"

I increased the iterations to 3000, the model now not only finds the true number of components accurately, but also the parameter estimates are spot on

> m

Inference for Stan model: 0f129c7ab3800a3be1c846ab001a3559.

3 chains, each with iter=2000; warmup=1000; thin=1;

post-warmup draws per chain=1000, total post-warmup draws=3000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat

gamma[1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2179 1.00

gamma[2] 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 1618 1.00

gamma[3] 0.61 0.02 0.80 0.00 0.08 0.32 0.81 2.75 2745 1.00

betas[1,1] -1.15 0.18 9.69 -20.62 -7.42 -1.19 5.38 18.04 3000 1.00

betas[1,2] -0.77 0.19 9.95 -20.54 -7.46 -1.01 6.02 19.01 2748 1.00

betas[1,3] -0.52 0.19 9.97 -20.18 -7.45 -0.62 6.35 18.50 2783 1.00

betas[2,1] -4.28 0.17 8.90 -22.30 -9.96 -4.42 0.79 14.98 2659 1.00

betas[2,2] -2.02 0.17 9.17 -20.26 -8.11 -1.89 4.04 16.10 3000 1.00

betas[2,3] -1.91 0.17 9.40 -20.60 -7.97 -1.80 4.16 16.99 2905 1.00

betas[3,1] 1.02 0.00 0.03 0.96 1.00 1.02 1.04 1.07 1930 1.00

betas[3,2] -0.49 0.00 0.04 -0.57 -0.52 -0.49 -0.46 -0.42 2439 1.00

betas[3,3] 0.37 0.00 0.04 0.29 0.34 0.37 0.39 0.45 2234 1.00

sigma[1] 7.72 1.16 51.72 0.02 0.25 0.99 3.88 41.54 2005 1.00

sigma[2] 7.15 1.07 41.69 0.02 0.23 0.99 3.87 43.96 1506 1.00

sigma[3] 1.20 0.00 0.03 1.15 1.18 1.20 1.22 1.27 1693 1.00

theta[1] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1949 1.00

theta[2] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 788 1.01

theta[3] 1.00 0.00 0.00 0.98 1.00 1.00 1.00 1.00 811 1.01

lp__ -21240.09 0.11 3.02 -21246.78 -21241.87 -21239.85 -21237.92 -21234.94 822 1.00

for comparison, the true sigma is 1.20, the estimated sigma is also 1.20,

the true betas are 1.0, -0.5, 0.4, and the estimates are 1.02, -0.49, 0.37

I uploaded the traceplots here if you're interested, they look perfect, no label switching:

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu