Ideas on fitting Dirichlet Process Mixture models in Stan

1,971 views
Skip to first unread message

sepehr akhavan

unread,
Mar 27, 2016, 3:39:25 PM3/27/16
to Stan users mailing list
Hi everyone,

I read all the messages posted in the user group about Dirichlet Process models in Stan and I completely understand that stan uses HMC and hence, is not able to sample discrete parameters yet. I needed to somehow find a way to implement DP and DPM models and I thought of two solutions:

  • idea 1: I wrote my own DP sampling code in R. I then combined my code with my Stan model in a Gibbs sampling fashion where all parameters are sampled in Stan except beta0. Then given sampled parameters, I use my own code to sample beta0. I put this sampling into a loop from 1 to Num-of-iterations and in each iteration, I use stan with iter = 2, and warmup = 1 and my own code. The code seems working and converges. Would this idea make sense to you? 


  • idea 2: In Stan manual, there is an example of using LDA for topic modeling where authors propose marginalizing over the discrete parameter. We know with Latent Dirichlet Allocation, we can choose a fixed number of mixtures to approximate DP. Therefore, I used the same idea to approximate DP. The issue is my code is super slow. I fixed number of mixture components to 30 and ran the code for 300 subjects (this basically leads to two nesting loops with 30*300 iterations). As the tutorial highlighted for the mixture models, we should avoid vectorizing for this models. Is there any way to make the code faster? What would you think?

Attached are my .stan and my .R code for idea 2 and for the model below:

Y_i =  beta0_i + beta_1 * X[i] + \epsilon
\epsilon ~ Normal(0, var = sigma^2)

--- Priors:
beta0_i ~ G
G ~ DP(\alpha, G0 = N(0, 10))

beta_1 ~ Normal() 
sigma^2 ~ flat prior

Thanks very much for your help,
-Sepehr

simpleReg_DP.R
simpleReg_DP.stan

Bob Carpenter

unread,
Mar 27, 2016, 10:47:00 PM3/27/16
to stan-...@googlegroups.com

> On Mar 27, 2016, at 3:39 PM, sepehr akhavan <akhavan...@gmail.com> wrote:
>
> Hi everyone,
>
> I read all the messages posted in the user group about Dirichlet Process models in Stan and I completely understand that stan uses HMC and hence, is not able to sample discrete parameters yet.

The problem isn't the discrete parameters, it's that the
DP has different numbers of parameters each iteration.

> I needed to somehow find a way to implement DP and DPM models and I thought of two solutions:
>
>
> • idea 1: I wrote my own DP sampling code in R. I then combined my code with my Stan model in a Gibbs sampling fashion where all parameters are sampled in Stan except beta0. Then given sampled parameters, I use my own code to sample beta0. I put this sampling into a loop from 1 to Num-of-iterations and in each iteration, I use stan with iter = 2, and warmup = 1 and my own code. The code seems working and converges. Would this idea make sense to you?

You have to balance the number of iterations in Stan with
the autocorrelation time; too few iterations and you don't
move far enough, too many and you do too much work.


> • idea 2: In Stan manual, there is an example of using LDA for topic modeling where authors propose marginalizing over the discrete parameter. We know with Latent Dirichlet Allocation, we can choose a fixed number of mixtures to approximate DP. Therefore, I used the same idea to approximate DP. The issue is my code is super slow. I fixed number of mixture components to 30 and ran the code for 300 subjects (this basically leads to two nesting loops with 30*300 iterations). As the tutorial highlighted for the mixture models, we should avoid vectorizing for this models. Is there any way to make the code faster? What would you think?

This is a common approach (it's one of the later BUGS example).
There's not a way to make the code faster.
And given that the posterior is so multimodal, you can't use
the usual convergence diagnostics.

You might try something like variational approximation
or even penalized maximum likelihood.

I'm afraid I didn't follow the combination of regression and
something labeled LDA in the program, or in the math below --- I
don't see how the betas connect to anything.

Are you trying to do a DP mixture of regressions?

- Bob


> Attached are my .stan and my .R code for idea 2 and for the model below:
>
> Y_i = beta0_i + beta_1 * X[i] + \epsilon
> \epsilon ~ Normal(0, var = sigma^2)
>
> --- Priors:
> beta0_i ~ G
> G ~ DP(\alpha, G0 = N(0, 10))
>
> beta_1 ~ Normal()
> sigma^2 ~ flat prior
>
> Thanks very much for your help,
> -Sepehr
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <simpleReg_DP.R><simpleReg_DP.stan>

Allen B. Riddell

unread,
Mar 28, 2016, 9:06:51 AM3/28/16
to stan-...@googlegroups.com
On 03/27, Bob Carpenter wrote:
>
> > On Mar 27, 2016, at 3:39 PM, sepehr akhavan <akhavan...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I read all the messages posted in the user group about Dirichlet Process models in Stan and I completely understand that stan uses HMC and hence, is not able to sample discrete parameters yet.
>
> The problem isn't the discrete parameters, it's that the
> DP has different numbers of parameters each iteration.
>

A truncated DP or Pitman-Yor Process mixture would be easy (if slow) for
Stan, right? IIRC, both can be expressed with a Dirichlet distribution.

Bob Carpenter

unread,
Mar 28, 2016, 1:33:01 PM3/28/16
to stan-...@googlegroups.com
Right --- that's what we're talking about --- putting an upper
limit on the number of categories.

- Bob

sepehr akhavan

unread,
Mar 28, 2016, 1:44:24 PM3/28/16
to Stan users mailing list


On Sunday, March 27, 2016 at 7:47:00 PM UTC-7, Bob Carpenter wrote:

> On Mar 27, 2016, at 3:39 PM, sepehr akhavan <akhavan...@gmail.com> wrote:
>
> Hi everyone,
>
> I read all the messages posted in the user group about Dirichlet Process models in Stan and I completely understand that stan uses HMC and hence, is not able to sample discrete parameters yet.

The problem isn't the discrete parameters, it's that the
DP has different numbers of parameters each iteration.

That's true. But I guess number of components (parameters) by itself is a parameter and is discrete. If could sample that parameter, implementing DP would be very easy. 
 

> I needed to somehow find a way to implement DP and DPM models and I thought of two solutions:
>
>
>         • idea 1: I wrote my own DP sampling code in R. I then combined my code with my Stan model in a Gibbs sampling fashion where all parameters are sampled in Stan except beta0. Then given sampled parameters, I use my own code to sample beta0. I put this sampling into a loop from 1 to Num-of-iterations and in each iteration, I use stan with iter = 2, and warmup = 1 and my own code. The code seems working and converges. Would this idea make sense to you?

You have to balance the number of iterations in Stan with
the autocorrelation time;  too few iterations and you don't
move far enough, too many and you do too much work. 

That's awesome. In my results I saw some big autocorrelation is some parameters and I knew I wanted to increase number of iterations in the Stan model, however, I was reluctant as I thought I might undermine the Markov Chain theory by having more iterations on some parameters. I'll definitely give it a shot. Thanks for the idea. 


>         • idea 2: In Stan manual, there is an example of using LDA for topic modeling where authors propose marginalizing over the discrete parameter. We know with Latent Dirichlet Allocation, we can choose a fixed number of mixtures to approximate DP. Therefore, I used the same idea to approximate DP. The issue is my code is super slow. I fixed number of mixture components to 30 and ran the code for 300 subjects (this basically leads to two nesting loops with 30*300 iterations). As the tutorial highlighted for the mixture models, we should avoid vectorizing for this models. Is there any way to make the code faster? What would you think?

This is a common approach (it's one of the later BUGS example).
There's not a way to make the code faster.
And given that the posterior is so multimodal, you can't use
the usual convergence diagnostics.

You might try something like variational approximation
or even penalized maximum likelihood.

I'm afraid I didn't follow the combination of regression and
something labeled LDA in the program, or in the math below --- I
don't see how the betas connect to anything.

Are you trying to do a DP mixture of regressions?

This was a simple example that I wrote to show my corresponding stan code. You may consider subjects with random intercepts (intercepts are subject-specific) as in Linear Mixed effects models. Instead of putting a specific prior on these subject-specific intercepts (usually normal dist in LME models), I wanted to relax the distribution assumption. Therefore, I put a G prior where G itself has a Dirichlet Process prior. And yes, the result is basically putting a mixture of normal distributions where they are mixed over mean. 

Bob Carpenter

unread,
Mar 28, 2016, 2:13:02 PM3/28/16
to stan-...@googlegroups.com

> On Mar 28, 2016, at 1:44 PM, sepehr akhavan <akhavan...@gmail.com> wrote:
>
>
>
> On Sunday, March 27, 2016 at 7:47:00 PM UTC-7, Bob Carpenter wrote:
>
> > On Mar 27, 2016, at 3:39 PM, sepehr akhavan <akhavan...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I read all the messages posted in the user group about Dirichlet Process models in Stan and I completely understand that stan uses HMC and hence, is not able to sample discrete parameters yet.
>
> The problem isn't the discrete parameters, it's that the
> DP has different numbers of parameters each iteration.
>
> That's true. But I guess number of components (parameters) by itself is a parameter and is discrete. If could sample that parameter, implementing DP would be very easy.

The number of components isn't directly a parameter, but it's close.
But even the conjugate Gibbs samplers for these have problems with
modality. And Gibbs can be very slow to mix.


>
>
> > I needed to somehow find a way to implement DP and DPM models and I thought of two solutions:
> >
> >
> > • idea 1: I wrote my own DP sampling code in R. I then combined my code with my Stan model in a Gibbs sampling fashion where all parameters are sampled in Stan except beta0. Then given sampled parameters, I use my own code to sample beta0. I put this sampling into a loop from 1 to Num-of-iterations and in each iteration, I use stan with iter = 2, and warmup = 1 and my own code. The code seems working and converges. Would this idea make sense to you?
>
> You have to balance the number of iterations in Stan with
> the autocorrelation time; too few iterations and you don't
> move far enough, too many and you do too much work.
>
> That's awesome. In my results I saw some big autocorrelation is some parameters and I knew I wanted to increase number of iterations in the Stan model, however, I was reluctant as I thought I might undermine the Markov Chain theory by having more iterations on some parameters. I'll definitely give it a shot. Thanks for the idea.

You just have to be careful with detailed balance. But as long as
the parameters are being treated the same way each time, that's usually
easy. You can do all sorts of tricks with Gibbs like sampling some parameters
more frequently than others and still get the right answer asymptotically.

>
>
> > • idea 2: In Stan manual, there is an example of using LDA for topic modeling where authors propose marginalizing over the discrete parameter. We know with Latent Dirichlet Allocation, we can choose a fixed number of mixtures to approximate DP. Therefore, I used the same idea to approximate DP. The issue is my code is super slow. I fixed number of mixture components to 30 and ran the code for 300 subjects (this basically leads to two nesting loops with 30*300 iterations). As the tutorial highlighted for the mixture models, we should avoid vectorizing for this models. Is there any way to make the code faster? What would you think?
>
> This is a common approach (it's one of the later BUGS example).
> There's not a way to make the code faster.
> And given that the posterior is so multimodal, you can't use
> the usual convergence diagnostics.
>
> You might try something like variational approximation
> or even penalized maximum likelihood.
>
> I'm afraid I didn't follow the combination of regression and
> something labeled LDA in the program, or in the math below --- I
> don't see how the betas connect to anything.
>
> Are you trying to do a DP mixture of regressions?
>
> This was a simple example that I wrote to show my corresponding stan code. You may consider subjects with random intercepts (intercepts are subject-specific) as in Linear Mixed effects models. Instead of putting a specific prior on these subject-specific intercepts (usually normal dist in LME models), I wanted to relax the distribution assumption. Therefore, I put a G prior where G itself has a Dirichlet Process prior. And yes, the result is basically putting a mixture of normal distributions where they are mixed over mean.

I see.

To get back to your earlier question, there's also the mixture of
mixtures. That'll let you fit up to a max number of components, but will
"select" the right number of components for you. There's a nice arXiv
paper I saw on this recently (referenced from Christian Robert's blog):

http://arxiv.org/pdf/1502.06241v1.pdf

- Bob

sepehr akhavan

unread,
Mar 31, 2016, 4:00:47 PM3/31/16
to Stan users mailing list
Hi Bob,
I'm following your suggestion on running part of my Gibbs sampler (the stan portion) for more number of iterations. As a reminder on our conversation, I was trying to build a DP mixture model where I had part of the sampling in Stan and DP sampling with my own code. I had put the whole process into a loop from 1 to nIter and you suggested in the stan code, I could have more iterations. Initially I had one iteration (i.e. iter = 2 and warmup = 1) per loop iteration. To show this "schematically", my model is something like:

for (i in 1:nIter){

# 1) My own code to sample a parameter from "DP"!

# 2) Given sampled "DP" parameter, running my stan code for 1 iteration (i.e. iter = 2 and warmup = 1) with initial values  = sampled values of the last iteration

# 3) Updating initialial values for DP and Stan as current sampled values in (1) (2) 
}

You suggested that the stan portion could be sampled more than 1 iteration. some examples are:

1) iter = 20, warmup = 1 (at every step of Gibbs, I sample from my DP once and from my stan code 20 times)
2) iter = 20, warmup = 10 
3) iter = 20, warmup = 19 

I have some questions regarding what's being done under the hood in stan during the warmup period. If I set warmup = 1, sampling is fast but I get warnings on diverging samples. for bigger warmup (10 or 19), sampling is slower, however, it seems that I get more converging samples. So, 

a) is tuning sampling tuning parameters done during warmup?Which I think the answer is YES
b) Among the top 3, for the current application, which one you suggest?
c) When I set the fit argument in the stan model as the previous model, I understand that stan does not compile the model and uses the already compiled model. The question is, does the stan also uses the same tuning parameters that were tuned in the previous model? or every call to stan model (regardless of the compilation) has its own tuning parameters?

Thanks very much for your help,
-Sepehr


Bob Carpenter

unread,
Mar 31, 2016, 5:22:20 PM3/31/16
to stan-...@googlegroups.com
Yes, tune it. We'll make this easier in the future by letting
you initialize mass matrix as well as step size.

And no, Stan does not (yet) use the same tuning parameters when calling
again with fit.

Stan's not really set up to run the way you want now, so you
have to run many more iterations. Given the way warmup works,
we're really expecting around 100 warmup iterations minimum --- not
much else is going to get a good variance estimate.

- Bob
Reply all
Reply to author
Forward
0 new messages