# dirichlet process mixture models

232 views

### Chris

May 19, 2017, 5:13:30 AM5/19/17
to Stan users mailing list
Hi,

I am trying to implement a mixture model of Negative Binomial regression models. I want to use the Dirichlet Process prior to estimate the number of mixture components.
I found this thread (https://groups.google.com/forum/#!topic/stan-users/d-R7J6MldkQ) with an example code for a hierarchical model, and together with the mixture model example in the docs, I came up with this code:

library(rstan)

model_string <-"
data {
int<lower=1> K; // number of mixture components
int<lower=1> N; // number of data points
int<lower=1> M; //the number of columns in the model matrix
matrix[N,M] X; //the model matrix
int y[N]; // observations
real<lower=0> alpha0 ; // dirichlet prior
}
transformed data {
vector<lower=0>[K] alpha0_vec;
for (k in 1:K){
alpha0_vec[k] = alpha0; //symmetric dirichlet prior
}
}
parameters {
simplex[K] theta; // mixing proportions
vector[M] betas[K];
real<lower=0> sigma[K]; // scales of mixture components
}
model {
theta ~ dirichlet(alpha0_vec);
sigma ~ lognormal(0, 2);
for (k in 1:K)
betas[k,] ~ normal(0,10);
for (n in 1:N) {
real lps[K];
for (k in 1:K)
lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | X[n] * betas[k], sigma[k]);
target += log_sum_exp(lps);
}
}"

# toy data
set.seed(123)
N <- 10000
phi <- 1.2
b0 <- -0.5
b1 <- 0.4
X <- data.frame(V=rep(1,N), V1=runif(N), V2=runif(N))
y <- rnbinom(N, size = phi, mu = exp(as.matrix(X) %*% c(1, b0, b1)))

# call stan
m <- stan(model_code = model_string, data = list(X=X, M=ncol(X), K=3, y = y, N = nrow(X), alpha0=0.2), chains=1, cores=1)

The results from this model are:

```> m
Inference for Stan model: eb789d3c1710fcbbff2117791762e0d8.
1 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=1000.

mean se_mean    sd      2.5%       25%       50%       75%     97.5% n_eff Rhat
theta        1.00    0.00  0.01      0.99      1.00      1.00      1.00      1.00    34 1.06
theta        0.00    0.00  0.00      0.00      0.00      0.00      0.00      0.01    27 1.04
theta        0.00    0.00  0.00      0.00      0.00      0.00      0.00      0.00   127 1.01
betas[1,1]      2.77    0.01  0.08      2.62      2.71      2.77      2.83      2.92    31 1.02
betas[1,2]     -1.21    0.02  0.09     -1.38     -1.29     -1.22     -1.14     -1.02    31 1.00
betas[1,3]      0.93    0.02  0.11      0.71      0.85      0.95      1.01      1.15    39 1.04
betas[2,1]      8.64    1.04  5.76      0.32      4.03      7.58     12.53     20.06    31 1.02
betas[2,2]      4.04    1.67  8.19     -9.68     -1.00      3.06      7.95     24.66    24 1.01
betas[2,3]      3.73    1.28  7.68    -14.85     -0.38      3.65      8.71     18.40    36 1.00
betas[3,1]      8.56    0.67  5.72      0.62      4.20      7.67     12.02     22.33    72 1.01
betas[3,2]      4.16    0.85  7.04     -8.70     -0.43      3.70      8.21     18.67    69 1.01
betas[3,3]      2.99    0.73  7.12    -10.94     -1.32      2.48      7.08     18.79    96 1.00
sigma        1.19    0.00  0.03      1.14      1.17      1.19      1.21      1.24    60 1.02
sigma        7.06    1.52 16.44      0.04      0.39      1.51      6.55     60.25   116 1.00
sigma       13.63    4.40 45.52      0.04      0.30      1.52      6.83    126.71   107 1.03
lp__       -21242.16    0.33  2.59 -21247.43 -21243.81 -21242.07 -21240.33 -21237.42    61 1.00```

The traceplot looks a bitt wobbly, but the Rhat is not so bad.

I this model specification correct?

When looking at the thetas, theta has a value of 1.0, where the other thetas have zero. Does that mean the optimal number of components is 1?
Actually that would be true here, as the data are generated from a simple NegBin.

How do I interpret the betas? betas other for components other than the first are meangingless?!

The true sigma is 1.2 (called phi in the data generation) and the estimated sigma for component 1 is 1.19, so that is perfect.
The estimated betas are way off the truth, however.

Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line
ordered mu[K]; // locations of mixture components
but how can I do it here?

I also get tons of errors like this, but I assume I can safely ignore them?

The following numerical problems occurred the indicated number of times on chain 1
count
Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite!        2
Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be >      1
Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0     1
...

thanks

### Ben Goodrich

May 19, 2017, 1:27:38 PM5/19/17
to Stan users mailing list
On Friday, May 19, 2017 at 5:13:30 AM UTC-4, Chris wrote:
Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line
ordered mu[K]; // locations of mixture components
but how can I do it here?

Approximate Dirichlet process priors have been known to sort of work with a large number of components but it is more than a bit dicey for Stan. To answer the specific question above, do

`parameters {  pos_ordered[K] gamma; // primitives of mixing proportions`
`   vector[M] betas[K];   real<lower=0> sigma[K]; // scales of mixture components}`
`transformed parameters {  vector[K] theta = gamma / sum(gamma);}model {  gamma ~ gamma(alpha_vec, 1); // implies: theta ~ dirichlet(alpha_vec)  ...}`

But it still might not work too well for various reasons. Mike has a case study on identifying mixture models with tricks like this.

Ben

### Bob Carpenter

May 19, 2017, 8:02:44 PM5/19/17
Ben: What do you mean by dicey? And how large? What's
the problem?

Chris: see some suggestions inline below

- Bob

P.S. This list is shutting soon and moving to discourse.
See: mc-stan.org/community

> On May 19, 2017, at 5:13 AM, Chris <wel...@gmail.com> wrote:
>
> Hi,
>
> I am trying to implement a mixture model of Negative Binomial regression models. I want to use the Dirichlet Process prior to estimate the number of mixture components.
> I found this thread (https://groups.google.com/forum/#!topic/stan-users/d-R7J6MldkQ) with an example code for a hierarchical model, and together with the mixture model example in the docs, I came up with this code:
>
>
> library(rstan)
>
> model_string <-"
> data {
> int<lower=1> K; // number of mixture components
> int<lower=1> N; // number of data points
> int<lower=1> M; //the number of columns in the model matrix
> matrix[N,M] X; //the model matrix
> int y[N]; // observations
> real<lower=0> alpha0 ; // dirichlet prior
> }
> transformed data {
> vector<lower=0>[K] alpha0_vec;

This can just be

vector<lower=0>[K] alpha0_vec = rep_vector(alpha0, K);

> for (k in 1:K){
> alpha0_vec[k] = alpha0; //symmetric dirichlet prior
> }
> }
> parameters {
> simplex[K] theta; // mixing proportions
> vector[M] betas[K];
> real<lower=0> sigma[K]; // scales of mixture components
> }
> model {
> theta ~ dirichlet(alpha0_vec);
> sigma ~ lognormal(0, 2);
> for (k in 1:K)
> betas[k,] ~ normal(0,10);
> for (n in 1:N) {
> real lps[K];
> for (k in 1:K)
> lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | X[n] * betas[k], sigma[k]);
> target += log_sum_exp(lps);
> }
> }"

I don't know what the predictors in X[n] are, but it might
make sense to have a hierarchical prior for the betas either
altogether or by predictor m in 1:M.

Otherwise, this looks fine as a negative binomial mixture.

You're combining two hard to estimate pieces---negative binomial
overidspersion and mixture models. Michael Betancourt wrote
a case study recently on mixtures illustrating some of the
difficulties (see mc-stan.org/docs).
If you run this for longer and the n_eff go up proportionally you should be OK.
You'll only find one mode, of course.

>
> I this model specification correct?
>
> When looking at the thetas, theta has a value of 1.0, where the other thetas have zero. Does that mean the optimal number of components is 1?

There's no optimization per se, but that's the posterior mean number
of components. You could print this to more decimal places and maybe
see that it's not precisely 1.0.

> Actually that would be true here, as the data are generated from a simple NegBin.

Well, that's reassuring at least.

You could also have a stronger prior (wouldn't approximate a DP, but
it may save you from having some bins not have any data).

> How do I interpret the betas? betas other for components other than the first are meangingless?!

You could run this out to more decimal places. Probably not exactly meaningless,
but only making a weak contribution. That essentially means you're using very
little data to train the components, so their posteriors should look a lot like

> The true sigma is 1.2 (called phi in the data generation) and the estimated sigma for component 1 is 1.19, so that is perfect.
> The estimated betas are way off the truth, however.

If you generated everything from one negative binomial, that's
to be expected. Are the betas for the first component that's being
used way off?

> Another question, how can I add ordering to avoid label switching? In the docs it is achieved by a line
> ordered mu[K]; // locations of mixture components
> but how can I do it here?

You'd have to do write your own ordered simplex transform, which
is a bit tricky. You could order one component of mu, like maybe
an intercept. Again, I recommend Michael's case study to get some
more understanding of how mixtures work.

> I also get tons of errors like this, but I assume I can safely ignore them?
>
> The following numerical problems occurred the indicated number of times on chain 1
> count
> Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite! 2
> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be > 1
> Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0 1
> ...

You should try upping the target acceptance rate (adapt_delta = 0.95 or 0.99 or something)
and maybe starting with a lower stepsize and running more iterations.

- Bob

### Michael Betancourt

May 19, 2017, 8:06:29 PM5/19/17

Ben:  What do you mean by dicey?  And how large?  What's
the problem?

For more than two components there are a bunch more
pathologies in the posterior that obstruct exploration than
just the combinatorial multimodalities addressed in the
case study.

Ultimately these Dirichlet mixtures suffer from the same
problems as LDA, unsurprisingly.  It’s near impossible
to quantify all of the model configurations consistent
with any given data set.

### Bob Carpenter

May 20, 2017, 3:21:47 PM5/20/17
Oh, just that. I keep meaning to finish a blog post
I wrote on the topic in my negative-Nellie series---that
you can't do Bayesian inference on these models.

I just took that as a given. All the simulated data
studies I've seen have also shown that they're terrible
at actually pretty bad at inferring the exact number of
mixture components---I can't quite remember the pathologies
that come up. Here's some elaboration from Andrew (with
quotes from Aki Vehtari and Christian Robert [aka X]):

http://andrewgelman.com/2016/11/25/30605/

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

### Michael Betancourt

May 20, 2017, 4:04:25 PM5/20/17
Yeah, it turns out that exchangeability induces
a horde of nastiness even for seemingly simple
problems. And much of this is relatively new to
me as I’ve been investigating these models
(and supposed fixes) recently.

### Ben Goodrich

May 20, 2017, 4:50:40 PM5/20/17
to Stan users mailing list
On Friday, May 19, 2017 at 5:13:30 AM UTC-4, Chris wrote:
I also get tons of errors like this, but I assume I can safely ignore them?

The following numerical problems occurred the indicated number of times on chain 1
count
Exception thrown at line 31: neg_binomial_2_lpmf: Precision parameter is inf, but must be finite!        2
Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.000138117, but must be >      1
Exception thrown at line 31: neg_binomial_2_lpmf: Location parameter is -0.00013909, but must be > 0     1

I missed this the first time around, but hell no. You need to make the mean positive with

`lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | exp(X[n] * betas[k]), sigma[k]);`

### Chris

May 21, 2017, 6:03:34 AM5/21/17
to Stan users mailing list
thanks for all your input, especially Ben catching the missing exp, that eluded me.

Here is the final model definition:

model_string <-"
data {
int<lower=1> K; // number of mixture components
int<lower=1> N; // number of data points
int<lower=1> M; //the number of columns in the model matrix
matrix[N,M] X; //the model matrix
int y[N]; // observations
real<lower=0> alpha0 ; // dirichlet prior
}
transformed data {
vector<lower=0>[K] alpha0_vec = rep_vector(alpha0, K); //symmetric dirichlet prior
}
parameters {
positive_ordered[K] gamma; // primitives of mixing proportions
vector[M] betas[K];
real<lower=0> sigma[K]; // scales of mixture components
}
transformed parameters {
vector[K] theta = gamma / sum(gamma);
}
model {
gamma ~ gamma(alpha0_vec, 1); // implies: theta ~ dirichlet(alpha_vec)
sigma ~ lognormal(0, 2);
for (k in 1:K)
betas[k,] ~ normal(0,10);
for (n in 1:N) {
real lps[K];
for (k in 1:K)
lps[k] = log(theta[k]) + neg_binomial_2_lpmf(y[n] | exp(X[n] * betas[k]), sigma[k]);
target += log_sum_exp(lps);
}
}"

I increased the iterations to 3000, the model now not only finds the true number of components accurately, but also the parameter estimates are spot on

> m
Inference for Stan model: 0f129c7ab3800a3be1c846ab001a3559.
3 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.

mean se_mean    sd      2.5%       25%       50%       75%     97.5% n_eff Rhat
gamma        0.00    0.00  0.00      0.00      0.00      0.00      0.00      0.00  2179 1.00
gamma        0.00    0.00  0.01      0.00      0.00      0.00      0.00      0.01  1618 1.00
gamma        0.61    0.02  0.80      0.00      0.08      0.32      0.81      2.75  2745 1.00
betas[1,1]     -1.15    0.18  9.69    -20.62     -7.42     -1.19      5.38     18.04  3000 1.00
betas[1,2]     -0.77    0.19  9.95    -20.54     -7.46     -1.01      6.02     19.01  2748 1.00
betas[1,3]     -0.52    0.19  9.97    -20.18     -7.45     -0.62      6.35     18.50  2783 1.00
betas[2,1]     -4.28    0.17  8.90    -22.30     -9.96     -4.42      0.79     14.98  2659 1.00
betas[2,2]     -2.02    0.17  9.17    -20.26     -8.11     -1.89      4.04     16.10  3000 1.00
betas[2,3]     -1.91    0.17  9.40    -20.60     -7.97     -1.80      4.16     16.99  2905 1.00
betas[3,1]      1.02    0.00  0.03      0.96      1.00      1.02      1.04      1.07  1930 1.00
betas[3,2]     -0.49    0.00  0.04     -0.57     -0.52     -0.49     -0.46     -0.42  2439 1.00
betas[3,3]      0.37    0.00  0.04      0.29      0.34      0.37      0.39      0.45  2234 1.00
sigma        7.72    1.16 51.72      0.02      0.25      0.99      3.88     41.54  2005 1.00
sigma        7.15    1.07 41.69      0.02      0.23      0.99      3.87     43.96  1506 1.00
sigma        1.20    0.00  0.03      1.15      1.18      1.20      1.22      1.27  1693 1.00
theta        0.00    0.00  0.00      0.00      0.00      0.00      0.00      0.00  1949 1.00
theta        0.00    0.00  0.00      0.00      0.00      0.00      0.00      0.02   788 1.01
theta        1.00    0.00  0.00      0.98      1.00      1.00      1.00      1.00   811 1.01
lp__       -21240.09    0.11  3.02 -21246.78 -21241.87 -21239.85 -21237.92 -21234.94   822 1.00

for comparison, the true sigma is 1.20, the estimated sigma is also 1.20,
the true betas are 1.0, -0.5, 0.4, and the estimates are 1.02, -0.49, 0.37

I uploaded the traceplots here if you're interested, they look perfect, no label switching: