LDA convergence.

PaulR

unread,

Feb 20, 2012, 2:37:22 AM2/20/12

to gensim

What does the proportion of documents converging tell me when trying
to train an LDA model?

At the end of a run (200k documents) I see "651/1000 documents
converged with 50 iterations". Presumably this is not a good thing -
we should be hoping for 1000/1000 or something close?

Would tweaking some of the parameters help? Does the convergence
failure indicate something about my data? Would just using more data
help?

TIA.

Radim

unread,

Feb 20, 2012, 5:01:41 AM2/20/12

to gensim

On Feb 20, 8:37 am, PaulR <p...@rudin.co.uk> wrote:
> What does the proportion of documents converging tell me when trying
> to train an LDA model?

The online mini-batch training doesn't happen until complete
convergence. Rather, only until the variational parameters stop
changing much, where "much" is by default self.VAR_THRESH==0.001. Now
in degenerate cases this could take a long time, so there is also a
"force switch" to stop training after self.VAR_MAXITER==50, even if
gamma still keeps changing.

> At the end of a run (200k documents) I see "651/1000 documents
> converged with 50 iterations". Presumably this is not a good thing -
> we should be hoping for 1000/1000 or something close?

Yes.

> Would tweaking some of the parameters help? Does the convergence
> failure indicate something about my data? Would just using more data
> help?

Yes. Yes. And yes. :)

You can 1) increase self.VAR_MAXITER -- just set it to a higher value,
`lda = LdaModel(corpus=None, ..); lda.VAR_MAXITER = 100;
lda.update(corpus)`. The longer your documents are (=more unique
words), the higher you can set self.VAR_MAXITER. Or 2), like you say
just give more training data.

Both 1) and 2) actually work in a similar way. Seeing more documents
with a similar structure via 2) is a bit like spending extra time on
the same documents via 1).

But 2) is more flexible in that if the slowly-converging batch is an
out-lier in the overall online data stream, you won't waste so much
time on it.

HTH,
Radim

PaulR

unread,

Feb 24, 2012, 9:12:02 AM2/24/12

to gensim

By the way - is the 'passes' parameter significant here?

Radim

unread,

Feb 24, 2012, 3:54:46 PM2/24/12

to gensim

On Feb 24, 3:12 pm, PaulR <p...@rudin.co.uk> wrote:
> By the way - is the 'passes' parameter significant here?

What do you mean? `passes` regulates the number of training passes
over the supplied corpus.

-rr

Reply all

Reply to author

Forward