Meaning of LDA parameters: chunksize, passes, update_every

11,186 views
Skip to first unread message

Victor

unread,
Oct 4, 2012, 7:39:40 AM10/4/12
to gen...@googlegroups.com
Hello, 
I am applying LDA realization of Gensim and wondering what do parameters chunksize, passes and update_every mean (nothing is said in API reference on LDA)? What are the reasonable values for them? I've studied Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010, but I don't see exact correspondence between original parameters and these parameters. I guess, chunksize should be the size of the mini-batch S or the total number of documents D in the article's notation...

Also I am I right that decay is equivalent to kappa in the article? If so, then it should belong to (0.5,1] interval to make the method converge, but its default value in Gensim is 0.5.

Thanks in advance,
Victor.

Radim Řehůřek

unread,
Oct 4, 2012, 9:10:54 AM10/4/12
to gensim
Hello Victor,

On Oct 4, 1:39 pm, Victor <vki...@mail.ru> wrote:
> Hello,
> I am applying LDA realization of Gensim and wondering what do
> parameters chunksize, passes and update_every mean (nothing is said in API
> reference on LDA)? What are the reasonable values for them?

The default ones :)

The model will print out what it's going to do, for example:

--
using serial LDA version on this node
running online LDA training, 100 topics, 1 passes over the supplied
corpus of 3931787 documents, updating model once every 10000 documents
--

See http://radimrehurek.com/gensim/wiki.html for a little more
discussion on these params.


> Also I am I right that decay is equivalent to kappa in the article? If so,
> then it should belong to (0.5,1] interval to make the method converge, but
> its default value in Gensim is 0.5.


The difference between `<0.5` and `(0.5` is arbitrary when evaluating
expressions on real machines, due to limited precision.

The particular value of 0.5 was chosen because it performed best (see
Hoffman: Online Learning for Latent Dirichlet Allocation).

Regards,
Radim

Victor

unread,
Oct 10, 2012, 8:10:19 AM10/10/12
to gen...@googlegroups.com
Thank you for explanation. Still argument "passes" is not clear. In Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010 they update lambda (theme generating parameter) after each chunk, so only ChunkSize is informative...



четверг, 4 октября 2012 г., 15:39:41 UTC+4 пользователь Victor написал:

Radim

unread,
Oct 10, 2012, 8:24:38 AM10/10/12
to gensim
Hello Victor,

On Oct 10, 2:10 pm, Victor <vki...@mail.ru> wrote:
> Thank you for explanation. Still argument "passes" is not clear.
> In Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation,
> NIPS 2010 they update lambda (theme generating parameter) after each chunk,
> so only ChunkSize is informative...

`passes` is the number of training passes through the corpus. For
example, if the training corpus has 50,000 documents, chunksize is
10,000, passes is 2, then online training is done in 10 updates:

#1 documents 0-9,999
#2 documents 10,000-19,999
#3 documents 20,000-29,999
#4 documents 30,000-39,999
#5 documents 40,000-49,999
#6 documents 0-9,999
#7 documents 10,000-19,999
#8 documents 20,000-29,999
#9 documents 30,000-39,999
#10 documents 40,000-49,999

HTH,
Radim

Victor

unread,
Oct 11, 2012, 1:47:56 AM10/11/12
to gen...@googlegroups.com
Sorry, I meant another parameter - update_every, not passes. Passes is clear, but regarding update_every I don't understand its meaning. In the original article update is made after each bunch of documents is processed, so update_every is always 1. If we update less frequently and ignore some chunks of documents, then lambda is estimated less accurately.


среда, 10 октября 2012 г., 16:24:38 UTC+4 пользователь Radim написал:

Radim Řehůřek

unread,
Oct 11, 2012, 10:56:57 AM10/11/12
to gensim
In gensim, the chunksize (how many documents to load into memory) is
decoupled from LDA batch size. So you can process the training corpus
with chunksize=10000, but with update_every=2, the maximization step
of EM is done once every 2*10000=20000 documents. This is the same as
chunksize=20000, but uses less memory.

By default, update_every=1, so that the update happens after each
batch of `chunksize` documents.

In case you're running the distributed version, be aware that
`update_every` refers to one worker: with chunksize=10000,
update_every=1 and 4 nodes, the model update is done once every
10000*1*4=40000 documents.

HTH,
Radim
Reply all
Reply to author
Forward
0 new messages