Anyone used Stan for implementing a Hierarchical Dirichlet Process mixture model?

3,033 views
Skip to first unread message

Sandhya Prabhakaran

unread,
Jul 20, 2016, 11:28:53 AM7/20/16
to Stan users mailing list
Hi all,

We are fairly new to Stan/RStan and would like to implement our Hierarchical Dirichlet Process Mixture model (HDPMM) in Stan.

Our current input data matrix is around 5000 cells (rows) and 1000 genes (columns). Our aim is to cluster these cells based on gene coexpressions (aka gene covariances) but that these covariances are cell-cluster specific. So this is an iterative clustering and normalisation procedure.
The plate model we are referring to is on page 5 (Figure 4) here: http://jmlr.org/proceedings/papers/v48/prabhakaran16.pdf
(Also attached). The model is a HDPMM using Gaussian likelihood. We further scale this Gaussian in both its moments using two scaling parameters: \alpha and \beta.

Questions.
1. Has anyone done DPMM in Stan before? Is it complicated to write down our HDPMM in Stan? I see that the Stan manual (Section 14) discusses regression using Gaussian process.

2. The current input data matrix of 5000 cells and 1000 genes takes around 4-5 hours to converge on naive R (or Matlab).  If we can write our model using Stan, we are hoping to speed this up since Stan uses Hamiltonian MCMC. Is this thought right?

3. If we achieve speed up in Step 2, then we want to scale the model to handle around 100K cells and 20K genes. Would this scale-up be achievable via Stan?

4. We are also considering (Stan's) Variational Bayes. 

Our plan is to start writing the HDPMM in Stan as soon as possible and are soliciting any inputs/thoughts into what we need to look or have overlooked. Thank you very much.

Regards,
Sandhya
prabhakaran16.pdf

Bob Carpenter

unread,
Jul 20, 2016, 3:33:27 PM7/20/16
to stan-...@googlegroups.com
I can never follow plate diagrams for mixtures.

We can't literally code DPs because Stan uses a fixed
number of parameters. Also, most applications of DPs have
very multimodal posteriors where Bayesian inference
fails (that is, you can't integrate over the posterior).

I doubt Stan's gong to be able to scale up to a Dirichlet
of size 20K with 100K observations.

By scaling moments, you just mean normal(mu, sigma) instead
of normal(0, 1)?

- Bob

P.S. I'm very interested in this particular problem and
have some data from a collaborator at Columbia. But I'm
not going to be back in NY until 2 November.
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <prabhakaran16.pdf>

Sandhya Prabhakaran

unread,
Jul 20, 2016, 4:04:22 PM7/20/16
to Stan users mailing list
Dear Bob,

Thank you for responding.

Would this be true with Stan even if we have posteriors that have closed-forms (i.e. they are analytically tractable)?

It is good to know that Stan might have issues in scaling. We are really interested in both speeding-up and scaling our model. Would you then suggest to directly try Variational Bayes sans Stan?

Yes and by scaling moments, we mean normal(\alpha x \mu, \beta x \sigma).

Regards,
Sandhya

(We are at Columbia too so let us keep in touch. Does your data also involve iterative clustering and normalisation?)

Bob Carpenter

unread,
Jul 20, 2016, 4:49:58 PM7/20/16
to stan-...@googlegroups.com

> normal(\alpha x \mu, \beta x \sigma).

That's not identified, so you won't be able to sample with it,
regardless of whether alpha, mu, beta, and sigma are scalars
or vectors.

Yes, if you need scalabiltiy and speed for this kind of problem,
you'll need a custom variational solution.

You might be able to use something like Edward:

http://edwardlib.org

which Dustin Tran and crew are developing. You might want to
just track Dustin down (don't know his summer plans) and talk
to him about it, or talk to Alp Kucukelbir or Dave Blei if you
can get ahold of him.

- Bob

terrance savitsky

unread,
Jul 20, 2016, 5:21:09 PM7/20/16
to stan-...@googlegroups.com
DP's (and associated mixtures with a DP measure) may be parameterized by unique "locations" and cluster assignments that avoid the multimodality to which you refer and are the bread-and-butter of modern Bayesian estimation (along with other related BNP approaches, such as predictor-indexed DPs and other normalized continuous random measures).    bread. and. butter.  I've successfully implemented a version of a truncated DP mixture in Stan that marginalizes over the mixture indicators as suggested in the stan manual and assigns a Dirichlet(alpha/K,...,alpha/K) prior to the K x 1 vector of mixture probabilities.  This construction contracts on a DP mixture in the limit of K.  I've also constructed an HDP mixture in a similar fashion, but wasn't sure I performed the math correctly.  
Anyway, if one's inferential focus is on the parameters on which is imposed a sparsity-induced mixture prior, then the use of this truncated clustering prior approach works well.  

terrance.

terrance.

 
--
Thank you, Terrance Savitsky

Michael Betancourt

unread,
Jul 20, 2016, 5:39:49 PM7/20/16
to stan-...@googlegroups.com
There’s also theoretical work showing that such truncations are
arbitrarily accurate and so should work well in practice,

Andrew Gelman

unread,
Jul 20, 2016, 8:03:26 PM7/20/16
to stan-...@googlegroups.com
Sounds like it could be worth including in the manual, as questions about this model come up from time to time.
A

On Jul 20, 2016, at 5:21 PM, terrance savitsky <tds...@gmail.com> wrote:

Bob Carpenter

unread,
Jul 21, 2016, 12:48:50 AM7/21/16
to stan-...@googlegroups.com
If someone gives me a working example, I'm happy to
include it in the manual. Or anyone can add it with
a pull request.

- Bob

Sandhya Prabhakaran

unread,
Jul 21, 2016, 12:02:01 PM7/21/16
to Stan users mailing list
Thank you everyone for sharing your insights.

We shall work then on Variational Bayes and also get in touch with Dustin regarding Edward.

Regards,
Sandhya

terrance savitsky

unread,
Jul 21, 2016, 2:58:41 PM7/21/16
to stan-...@googlegroups.com
I've got one.  I just have to find it and then I will submit it.

Andrew Gelman

unread,
Jul 21, 2016, 8:19:29 PM7/21/16
to stan-...@googlegroups.com
Hi, just to be clear:  if you write the model in Stan you can fit it right away using NUTS (i.e., full Bayes), and I recommend you start with that, i.e. start with a model that is not too large and fit full Bayes.  This should be faster and more scalable than any Gibbs sampler implementation you might have.  At that point you can run Stan using ADVI (that is, variational Bayes); no additional programming is required.  But I recommend you first do the full Bayes on a reasonable-sized problem, then do ADVI and compare.  You can do it all within Stan.
A

Andrew Gelman

unread,
Jul 21, 2016, 10:46:10 PM7/21/16
to stan-...@googlegroups.com, terrance savitsky
Perhaps Terrance Savitsky can supply the example in detail, as he says he's successfully implemented it?
A

Sandhya Prabhakaran

unread,
Dec 21, 2016, 11:27:12 AM12/21/16
to Stan users mailing list
Hi all,

I now have a Stan implementation of the Hierarchical DPMM based on the paper here: http://jmlr.org/proceedings/papers/v48/prabhakaran16.pdf
The code works for data having 3K observations (rows) and 25 features (cols). If I increase the features beyond 25, I get this error message:

Initialization partially from source failed. Error transforming variable Sigma: factor_cov_matrix failed. Error in sampler$call_sampler(c(args, dotlist)) :

The R version I work with is 3.3.2.
I am also attaching the .stan code (BISCUIT_ADVI_mv5_1.stan) and the R script that calls it (BISCUIT_STAN_vb_6_1.R).

Is there something I have overlooked while installing software or is this a coding-related error?

Regards,
Sandhya
BISCUIT_ADVI_mv5_1.stan
BISCUIT_STAN_vb_6_1.R

Bob Carpenter

unread,
Dec 22, 2016, 10:13:49 AM12/22/16
to stan-...@googlegroups.com
You want to use a Cholesky factor parameterization of
a covariance matrix. Then it'll require some footwork to
do the modifications you do:

GaussLogLike[k] = log (pi_z[k]) + multi_normal_lpdf(X[n]|
((alpha[n]*sum_alpha)*mu[k])/sum(alpha),
((beta[n]*sum_beta)*Sigma[k])/sum(beta));

The more I read these messages, the more I think our users
must have defective space keys on their keyboards! I'd write this as:

gauss_ll[k] = log(pi_z[k])
+ multi_normal_lpdf(X[n] | alpha[n] * sum_alpha /sum(alpha) * mu[k],
beta[n] * sum_beta / sum(beta) * Sigma[k]);

The reason I reordered is that it's better to do all the scalar
arithmetic together to reduce the number of scalar * matrix multiplies.
Arithmetic operators are all left associative
(i.e., (a * b / c) == ((a * b) / c)).

Presumably, contrary to the naming, sum_alpha != sum(alpha) --- otherwise
just drop them all. You'll have to figure out how to calculate the Cholesky
factor of of beta[n] * sum_beta * Sigma[k] / sum(beta) from the Cholesky factor
of Sigma.

I don't understand why you'd want to overparameterize like this, but the
above is what you'll need to do to make it efficient.

We generally use a scaled correlation matrix with an LKJ prior on the correlation
matrix.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <BISCUIT_ADVI_mv5_1.stan><BISCUIT_STAN_vb_6_1.R>

Sandhya Prabhakaran

unread,
Jan 10, 2017, 3:53:49 PM1/10/17
to Stan users mailing list
Dear Bob,
Thank you for these comments. The updated Stan file is attached.

1. I have tidied up the GaussLogLike[k], as per your suggestion. We wish to account for cell (=observation)-wise scalings and therefore have alphas and betas as scalings for the mean and covariance of the Gaussian, respectively.

2. Both sum_alpha and sum(alpha) are different variables. sum_alpha is fixed and sum(alpha) is the sum of all current alphas.

3. I have now Cholesky decomposed (beta[n] * sum_beta * Sigma[k] / sum(beta)) to chol_Sigma_k. And Sigma[k] is constructed from a correlation matrix (=H_prime_inv_corr) in the 'transformed parameters' block. I was not successful in applying the LKJ prior over H_prime_inv_corr and therefore I pass H_prime_inv_corr into Stan (via the data{}).

The code now takes in 1000 cells and more than 25 genes. No errors, there is convergence. Currently I have this running on 1000 cells and 50 genes.  Would you have any suggestions for optimising the current code?
My next step is to get the code handle 3000 cells and 500 genes.

Regards,
Sandhya
BISCUIT_ADVI_mv5_2.stan

Bob Carpenter

unread,
Jan 12, 2017, 5:20:45 PM1/12/17
to stan-...@googlegroups.com
Aha! Since you posted this, I've been studying your paper with Azizi, Carr, and Pe'er
because I'm also looking into modeling cell-level RNA-seq.

More below now that I understand the model.


> On Jan 10, 2017, at 3:53 PM, Sandhya Prabhakaran <sandhyapr...@gmail.com> wrote:
>
> Dear Bob,
> Thank you for these comments. The updated Stan file is attached.
>
> 1. I have tidied up the GaussLogLike[k], as per your suggestion. We wish to account for cell (=observation)-wise scalings and therefore have alphas and betas as scalings for the mean and covariance of the Gaussian, respectively.

Now I get what you're trying to do. You have different counts
in the correlated multinomial you're trying to approximate.

The way I would go about this (other than working on the natural
scale rather than this approximation), would be to have a single
scaling fixed by the data that applies to both the mean and
the covariance matrix.

> 2. Both sum_alpha and sum(alpha) are different variables. sum_alpha is fixed and sum(alpha) is the sum of all current alphas.
>
> 3. I have now Cholesky decomposed (beta[n] * sum_beta * Sigma[k] / sum(beta)) to chol_Sigma_k. And Sigma[k] is constructed from a correlation matrix (=H_prime_inv_corr) in the 'transformed parameters' block. I was not successful in applying the LKJ prior over H_prime_inv_corr and therefore I pass H_prime_inv_corr into Stan (via the data{}).

What happened? You want to use a Cholesky factor of a correlation matrix
as the data type, then give it an LKJ-Cholesky prior, then scale the
mean and the Cholesky factor, and you should be good to go. That way,
you work on the standard deviation scale, which has the same units as the mean.

Otherwise, you seem to have too many degrees of freedom to fit this well.


> The code now takes in 1000 cells and more than 25 genes. No errors, there is convergence. Currently I have this running on 1000 cells and 50 genes. Would you have any suggestions for optimising the current code?

Neat. Can you send the whole current program?

How long does it take and what's the resulting n_eff? Do multiple
chains converge to the same solution?

- Bob

Bob Carpenter

unread,
Jan 12, 2017, 5:24:17 PM1/12/17
to stan-...@googlegroups.com
Never mind --- saw that you attached the program at the bottom.
I'll take a look and get back to you as soon as I can.

- Bob

> On Jan 10, 2017, at 3:53 PM, Sandhya Prabhakaran <sandhyapr...@gmail.com> wrote:
>
> <BISCUIT_ADVI_mv5_2.stan>

Bob Carpenter

unread,
Jan 13, 2017, 12:48:54 PM1/13/17
to stan-...@googlegroups.com
I was talking to Michael Betancourt, who was suggesting that
you use an independent Poisson approximations to the large
count multinomial rather than a multinormal because the covariance
vanishes in the multinomial as the count goes up (intuitively, when
there are only a few counts, you get negative correlations by
construction, which dampen as you get more independent counts).

Apparently, it's what physicists have been doing for years. There
are a gazillion hits for a search on Google, but I don't know which
one to recommend---some look very theoretical. This one seems clear:

https://www.jstor.org/stable/3314676

If N is the number of trials and theta the simplex of distributions
and we have

X ~ Multinomial(N, theta)

then

E[X[n]] = N * theta[n]

and we can approximate

Multinomial(X | N, theta)

~approx~

PROD_k Poisson(X[k] | N * theta[k])

In vectorized Stan notation, that'd just be

X ~ poisson(N * theta);

The other advantage is that it's much more efficient to compute n Poissons
than a multivariate normal, so the scaling should be much better.

- Bob

Sandhya Prabhakaran

unread,
Jan 13, 2017, 2:24:59 PM1/13/17
to stan-...@googlegroups.com
Dear Bob,

Thanks for getting back.

Why we used the Gaussian likelihood: Although we get a count matrix (with cells as rows and genes as columns), we log transform the data so that we can work in the continuous (Gaussian) space. This was further vetted by using the Lilliefors test (to confirm the Gaussianity assumption) on real data.

The reason we use separate scalings for the moments was also dictated by the data. (In our paper, Section 2, paragraph titled ‘Motivations for cell-specific scalings’). The \alphas that scale the means capture the library-size variation (rowsums of the data matrix) whereas the \betas that scale the covariances capture larger noise in cells that have a smaller library size.

Regarding the Poisson likelihood: I shall go through the 'Poisson approximation to multinomial' paper. So here you mean to say that we use a Poisson likelihood instead of a multivariate Gaussian. Yes, I agree and am up for converting the model. This should seem easier as well. Although, our current aim is to get, at least the model we presented in ICML to run with the data of 3000 cells x 500 genes in Stan. We do have R/Matlab code that implements the MCMC/Gibbs. This was taking a much longer time to converge and therefore we decided to go down the Stan route.

Runtime: 1000 cells x 15 genes takes less than an hour. The inferred labels are quite close to the true labels. By ‘multiple chains converging’ did you mean ‘multiple independent runs’? If so, yes, the independent runs do give similar clustering results.
I am currently running 3000 cells x 25 genes and this has been running for > 12 hours.
Do you think it is going to be hard to scale the current model to handle 500 or more genes. I understand that we can scale the number of cells using the subsampling that ADVI can use (via minibatch scaling factor). I have not tried this yet.

Let me code up the model again with the LKJ prior.



>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

>> To post to this group, send email to stan-...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>> <BISCUIT_ADVI_mv5_2.stan>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/sc8tUk1GnIA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Sandhya Prabhakaran

unread,
Jan 15, 2017, 12:39:12 PM1/15/17
to Stan users mailing list
Some more updates.

Please find here the entire code: https://github.com/sandhya212/BISCUIT_ICML_2016

I have modified the previous .stan file - check BISCUIT_ADVI.stan. I have tested the code 1000 cells x 20 genes. Clusters seems to be identified ok. I have not been able to run the code on the entire data (3000 cells and 500 genes); it takes >24 hours so I end up killing the run.

My major concern at this point is that I have not implemented the sampling of the covariances correctly. I would really appreciate suggestions on this.

Here is what I do now. I input into Stan’s data{} the input data’s covariance (cholesky_factorised) : L_Sigma_dprime and correlation: Sigma_dprime_corr. In the transformed parameters{}, each clusters' covariance (Sigma) is the cholesky decomposed covariance constructed from the correlation matrix. Neither Sigma_dprime_corr nor sigma_prime are sampled, so the Sigma per cluster does not vary.

I should sample the Sigmas from a Wishart but for huge matrices (500 x 500 and bigger), I believe I must reparametrise the Wishart as given in the 'Multivariate reparameterisations’ section (pages 326-328). Am I right?

Bob Carpenter

unread,
Jan 23, 2017, 4:07:04 PM1/23/17
to stan-...@googlegroups.com

> On Jan 13, 2017, at 2:24 PM, Sandhya Prabhakaran <sandhyapr...@gmail.com> wrote:
>
> Dear Bob,
>
> Thanks for getting back.
>
> Why we used the Gaussian likelihood: Although we get a count matrix (with cells as rows and genes as columns), we log transform the data so that we can work in the continuous (Gaussian) space. This was further vetted by using the Lilliefors test (to confirm the Gaussianity assumption) on real data.
>
> The reason we use separate scalings for the moments was also dictated by the data. (In our paper, Section 2, paragraph titled ‘Motivations for cell-specific scalings’). The \alphas that scale the means capture the library-size variation (rowsums of the data matrix) whereas the \betas that scale the covariances capture larger noise in cells that have a smaller library size.

I saw that in the paper, but I'd think if it was a size-related effect
that there'd only be a single size parameter controlling that.

> Regarding the Poisson likelihood: I shall go through the 'Poisson approximation to multinomial' paper. So here you mean to say that we use a Poisson likelihood instead of a multivariate Gaussian.

Yes, it should be a more reasonable multinomial approximation.

But it sounds like something else is going on, because of the
scaling.

> Yes, I agree and am up for converting the model. This should seem easier as well. Although, our current aim is to get, at least the model we presented in ICML to run with the data of 3000 cells x 500 genes in Stan. We do have R/Matlab code that implements the MCMC/Gibbs. This was taking a much longer time to converge and therefore we decided to go down the Stan route.
>
> Runtime: 1000 cells x 15 genes takes less than an hour. The inferred labels are quite close to the true labels. By ‘multiple chains converging’ did you mean ‘multiple independent runs’? If so, yes, the independent runs do give similar clustering results.

That's fantastic.

> I am currently running 3000 cells x 25 genes and this has been running for > 12 hours.
> Do you think it is going to be hard to scale the current model to handle 500 or more genes. I understand that we can scale the number of cells using the subsampling that ADVI can use (via minibatch scaling factor). I have not tried this yet.

We don't know where ADVI will work. But I'd expect the scaling
from 25 to 500 genes would take 20 times as long and I doubt
you'll want to wait two weeks for answers.

How many iterations are you running? If convergence is relatively
fast, you probably don't need 2000 iterations per chain.

> Let me code up the model again with the LKJ prior.

That should be a lot faster.

- Bob

Bob Carpenter

unread,
Jan 23, 2017, 4:17:22 PM1/23/17
to stan-...@googlegroups.com

> On Jan 15, 2017, at 12:39 PM, Sandhya Prabhakaran <sandhyapr...@gmail.com> wrote:
>
> Some more updates.
>
> Please find here the entire code: https://github.com/sandhya212/BISCUIT_ICML_2016
>
> I have modified the previous .stan file - check BISCUIT_ADVI.stan. I have tested the code 1000 cells x 20 genes. Clusters seems to be identified ok. I have not been able to run the code on the entire data (3000 cells and 500 genes); it takes >24 hours so I end up killing the run.
>
> My major concern at this point is that I have not implemented the sampling of the covariances correctly. I would really appreciate suggestions on this.
>
> Here is what I do now. I input into Stan’s data{} the input data’s covariance (cholesky_factorised) : L_Sigma_dprime and correlation: Sigma_dprime_corr. In the transformed parameters{}, each clusters' covariance (Sigma) is the cholesky decomposed covariance constructed from the correlation matrix.

There's no way to do Wisharts efficiently in Stan. What we like
to do for both statistical and computational reasons is:

* LKJ prior on parameter representing Cholesky factor of correalation matrix

* independent priors on the scale vector

* multiply the two to get the Cholesky factor for the covariance matrix

There's an example in the manual in the regression chapter.

But the huge inefficiency in what you're doing is here:


for ( k in 1 :K) {
mu[k] ~ multi_normal_cholesky(mu_prime,L_Sigma_prime) ;
}

This should just be

mu ~ multi_normal_cholesky(mu_prime, L_sigma_prime);

No idea what the prime business is all about and why there's a mu
on the left and right hand sid here, but the above is the way to code it.

And for efficient and stable matrix arithmetic, you never ever want to
invert anything.

I'm also confused by your naming and why this:

sum_alpha/sum(alpha)

doesn't just work out to 1.

And there's no way to do that mixture efficiently that you have under
"MV Gaussian loglikelihood".

> Neither Sigma_dprime_corr nor sigma_prime are sampled, so the Sigma per cluster does not vary.
>
> I should sample the Sigmas from a Wishart but for huge matrices (500 x 500 and bigger), I believe I must reparametrise the Wishart as given in the 'Multivariate reparameterisations’ section (pages 326-328). Am I right?

We don't have a Cholesky-parameterized version of the Wishart, so there's
no way to make sampling for it efficient. Also, a scaled LKJ is more
natural (lets you concentrate the prior around the unit correlation matrix).

- Bob

Sandhya Prabhakaran

unread,
Jan 27, 2017, 6:15:43 PM1/27/17
to stan-...@googlegroups.com, Elham Azizi
Hi Bob,

Thank you for the comments.

1. As per your suggestion, I have removed the Wishart and now have placed the LKJ prior over the cholesky factor of the correlation matrix. I have attached the code.

2. I am not sure of #1 though. Currently Lcorr is being sampled and has nothing to do with the empirical covariance (i.e cov(data)). Is there a way to ‘link’ the empirical covariance to Lcorr? (We had used the empirical covariance in the ICML model to sample Sigma per cluster). Am I wrong here?

3. By computing Sigma using Omega, as in the attached code, we just get one Sigma and not a Sigma per cluster. Is it not possible to infer a Sigma per cluster? I have tried declaring Omega and Sigma as matrix[D,D] Omega[K] and matrix[D,D] Sigma[K] but even if this were possible/allowed, Lcorr is still the same.

I believe my above doubts all relate to the fact that I do not see how Lcorr is related to the empirical covariance.
 
4. Having said this, the current code converges for 1000 cells and 15 genes. The MAP-based cluster assignments look good too.
For dimensions > 15, I get the following error when I call vb(): Initialization partially from source failed.Error transforming variable Lcorr: lub_free: Correlation variable is nan, but must be in the interval [-1, 1]Error in sampler$call_sampler(c(args, dotlist)) :

Please advise.

BISCUIT_ADVI_mv5_7.stan

Bob Carpenter

unread,
Jan 27, 2017, 6:47:07 PM1/27/17
to stan-...@googlegroups.com, Elham Azizi
This is giving the correlation matrix
a uniform prior:

Lcorr ~ lkj_corr_cholesky(1);

Increasing the argument > 1 concentrates prior mass around the unit
correlation matrix.

for (n in 1:N) {
real GaussLogLike[K];
for (k in 1 :K)
GaussLogLike[k]
= log(pi_z[k])
+ multi_normal_cholesky_lpdf(X[n] | alpha[n] * sum_alpha/sum(alpha) * mu[k],
beta[n] * sum_beta/sum(beta) * diag_pre_multiply(sigma, Lcorr));
target += log_sum_exp(GaussLogLike);
}

You can make this multi-normal mixture much more efficient by pulling up
the shared calculations:

vector[N] a = alpha * sum_alpha / sum(alpha);
vecotr[N] b = beta * sum_beta / sum(beta)
real GaussLogLike[K];
for (n in 1:N) {
for (k in 1 :K)
GaussLogLike[k]
= log(pi_z[k])
+ multi_normal_cholesky_lpdf(X[n] | alpha[n] * sum_alpha/sum(alpha) * mu[k],
beta[n] * sum_beta/sum(beta) * diag_pre_multiply(sigma, Lcorr));
target += log_sum_exp(GaussLogLike);
}

I dont' know what cov(data) is. The point is that it's
being estimated here from the data. Same way as if you
have multivariate y[n] ~ multi_normal(mu, Sigma), you
estimate mu and Sigma from the y[n]. It's tied together
through the likelihood. You can't mu and Sigma vary by n
because then tere won't be any data to estimate them. But
that has nothing to do with Cholesky vs. Wishart.

I don't know enough about how VB works to provide help with
numerical issues there. I'm also not sure where lub_free
comes into play---that should be mapping a variable in
some finite range (l, u) back to ucnonstrained space. Maybe
it's used in a transform somewhere.

- Bob
> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <BISCUIT_ADVI_mv5_7.stan>

Sandhya Prabhakaran

unread,
Jan 27, 2017, 9:02:21 PM1/27/17
to stan-...@googlegroups.com, Elham Azizi
Thank you again, Bob.

I clubbed the scalings like you mentioned.
I believe you suggested this:  GaussLogLike[k] = log (pi_z[k]) + multi_normal_cholesky_lpdf(X[n]| a[n] * mu[k], b[n] * diag_pre_multiply(sigma, Lcorr));

By cov(data), I was just referring to the covariance of the input data. And yes, I do see now how the lkj_corr_cholesky() works.



> To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <BISCUIT_ADVI_mv5_7.stan>
--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/sc8tUk1GnIA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Jan 31, 2017, 2:20:58 PM1/31/17
to stan-...@googlegroups.com, Elham Azizi

> On Jan 27, 2017, at 9:01 PM, Sandhya Prabhakaran <sandhyapr...@gmail.com> wrote:
>
> Thank you again, Bob.
>
> I clubbed the scalings like you mentioned.
> I believe you suggested this: GaussLogLike[k] = log (pi_z[k]) + multi_normal_cholesky_lpdf(X[n]| a[n] * mu[k], b[n] * diag_pre_multiply(sigma, Lcorr));
>
> By cov(data), I was just referring to the covariance of the input data. And yes, I do see now how the lkj_corr_cholesky() works.

It's unusual to put statistics of the data into the model
unless they form some kind of sufficient statistic you can
exploit to simplify computations. If you make a data-dependent
prior, you lose the nice asymptotic guarantees of Bayesian
inference because you don't have a fixed prior.

- Bob
Reply all
Reply to author
Forward
0 new messages