How well does stan fit LDA model now?

610 views
Skip to first unread message

Zichao Li

unread,
Apr 24, 2016, 1:50:33 PM4/24/16
to Stan users mailing list
I am trying to implement the LDA model using this example code: https://github.com/stan-dev/example-models/blob/master/misc/cluster/lda/lda.stan. But the progress bar always stop at 0%, even the data set I use is just a small 100 * 500 document-term-matrix. I wonder do you tried this piece of code with real data set? And what is the data looks like? Thank you in advance!

Bob Carpenter

unread,
Apr 24, 2016, 2:20:14 PM4/24/16
to stan-...@googlegroups.com
That's not going to work on any reasonable real data because it
assumes each of the documents has exactly the same number of tokens.

If everything's very dense, then it's more efficient to just
take the logs once.

But overall, Stan's not going to be superfast at fitting an
LDA model and you can't really fit an LDA model in a Bayesian
way anyway. So we don't really recommend Stan for any kind of
large-scale LDA work.

- Bob

> On Apr 24, 2016, at 1:50 PM, Zichao Li <ryla...@gmail.com> wrote:
>
> I am trying to implement the LDA model using this example code: https://github.com/stan-dev/example-models/blob/master/misc/cluster/lda/lda.stan. But the progress bar always stop at 0%, even the data set I use is just a small 100 * 500 document-term-matrix. I wonder do you tried this piece of code with real data set? And what is the data looks like? Thank you in advance!
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Apr 24, 2016, 2:21:49 PM4/24/16
to Bob Carpenter, stan-...@googlegroups.com
I looked more closely at that, and it's not actually making
the assumption that every doc's the same size. But it's
still not going to be very efficient.

- Bob

Dustin Tran

unread,
Apr 24, 2016, 2:42:14 PM4/24/16
to stan-...@googlegroups.com, Bob Carpenter
We’ve gotten ADVI to somewhat work, as in it progresses and converges with a reasonable approximate posterior mean. But the bottleneck to use LDA in Stan in practice over, say, lda-c, is that there’s no distinction between global and local latent variables. This means it carries around all the associated latent variables during each iteration of inference, which is too slow.

Dustin

Zichao Li

unread,
Apr 25, 2016, 7:36:58 AM4/25/16
to stan-...@googlegroups.com
Hi Bob and Dustin, 

Thanks for reply! 

But what if I 'd like to test it on a toy data set, say, just has 100 documents and 1000 terms? Why did you said that piece of code mentioned above had assumed each document had the same number of tokens? In my view, it just loops over all the word instances in a long list, instead of a doc-term-matrix? Thanks!


在 2016年4月25日星期一 UTC+8上午1:50:33,Zichao Li写道:

Vivek Kulkarni

unread,
Apr 25, 2016, 9:52:29 AM4/25/16
to Stan users mailing list
Hi Everyone,


Does that imply that ADVI in Stan is not recommended for LDA on large datasets (even when I use stochastic gradient updates)? I had a slight variant of LDA I wanted to implement and Stan ADVI support looked perfect for that. 

I tried  to fit LDA on a dataset of 100K documents with a vocab size of 20K words. I used the minibatch approach to ADVI highlighted here:https://github.com/stan-dev/stan/blob/adsvi/how_to_ADSVI.md but even after running for 50K iterations with a tol_rel_obj=0.0001 and the algorithm saying it converged, I notice that the values for document topic proportions (thetas are pretty much uniform) and hence the question above. I used a batch size of 100 just for reference and asked for 10 topics.

./lda_adsvi variational iter=50000 tol_rel_obj=0.0001 subsample=1 data file=lda_real_minibatch_5.data output file=lda_real_minibatch_5.csv

Best,
Vivek

Bob Carpenter

unread,
Apr 25, 2016, 10:19:19 AM4/25/16
to stan-...@googlegroups.com
Sorry for the initial confusion --- I tried to correct
it. You're right that it uses the long form data, so doesn't
assume every doc is the same length.

No idea about how well ADVI is going to work. I didn't even
know there was a way for users to get at the minibatch code.

- Bob

> On Apr 25, 2016, at 7:36 AM, Zichao Li <ryla...@gmail.com> wrote:
>
> Hi Bob and Dustin,
>
> Thanks for reply!
>
> But what if I 'd like to test it on a toy data set, say, just has 100 documents and 1000 terms? Why did you said that piece code mentioned above had assumed each document had the same number of tokens? In my view, it just loops over all the word instances in a long list, instead of a doc-term-matrix? Thanks!
>
>
> 在 2016年4月25日星期一 UTC+8上午1:50:33,Zichao Li写道:
> I am trying to implement the LDA model using this example code: https://github.com/stan-dev/example-models/blob/master/misc/cluster/lda/lda.stan. But the progress bar always stop at 0%, even the data set I use is just a small 100 * 500 document-term-matrix. I wonder do you tried this piece of code with real data set? And what is the data looks like? Thank you in advance!
>
Reply all
Reply to author
Forward
0 new messages