online LDA and topic drift

399 views
Skip to first unread message

Seth Boyles

unread,
Sep 17, 2012, 3:55:06 PM9/17/12
to gen...@googlegroups.com
In the LDA tutorial (http://radimrehurek.com/gensim/wiki.html#latent-sematic-analysis), it says "LDA is not truly online (the name of the [3] article notwithstanding), as the impact of later updates on the model gradually diminishes."  I'm not sure what this means, as the tutorial only goes on to explain that topic drift will cause the algorithm to take longer.  But does it actually effect the correctness of the model, if you were to compare it to running batch LDA?  If so, how?

Thanks,
Seth

Radim Řehůřek

unread,
Sep 17, 2012, 4:11:42 PM9/17/12
to gensim
Hello Seth,

what this means is that the "online LDA" algorithm by Hoffman et al.
(which is implemented in gensim), is "online" only in the sense of
"incremental". You can update the model with new documents,
incrementally. This will help improve/refine/converge the model.

But later batches have gradually less and less effect on the model, so
if you build a model over 100k docs, and later update it with another
100k docs, the impact of the latter will be much smaller. If documents
in both batches come from the same distribution (=no topic drift),
this is fine, and the order of the two batches doesn't matter much
(~not at all, in the limit of complete convergence). But if there is
topic drift, the order matters -- the resulting model will reflect
statistical patterns from the first batch more prominently than from
the second.

For more info, see the `kappa` and `tau` parameters from the article
"Online Learning for Latent Dirichlet Allocation".

HTH,
Radim


On Sep 17, 9:55 pm, Seth Boyles <s...@scripted.com> wrote:
> In the LDA tutorial (http://radimrehurek.com/gensim/wiki.html#latent-sematic-analysis), it says "LDA
> is not truly online (the name of the [3]<http://radimrehurek.com/gensim/wiki.html#id8> article
Reply all
Reply to author
Forward
0 new messages