Does LDA Chunksize Affect Training/Accuracy?

john.hard...@gmail.com

unread,

Mar 10, 2017, 10:27:54 AM3/10/17

to gensim

First of all I just wanted to thank Radim for the awesome library. My research into topic modeling landed me a great new job that I started last week, so I am forever in your debt.

The people at my company are looking into using topic modeling to perform document similarity calculations, so I naturally suggested using gensim. My first project is to start doing some experiments with it. I'm comparing comparisons using LDA against comparisons using just TF-IDF as a benchmark of effectiveness. I have about 6000 documents that I used to train my TF-IDF and LDA models. Each of these documents is belongs to a given category, and there are 25 categories.

The way I'm testing my comparison algorithms is by picking N many random categories, and then M man random documents within each category. For each document I compare the similarity against every other document I chose. I then find the average similarity of documents within the category, and the average similarity of documents outside of the category. The ratio of similarity should then provide a measure of how effective TFIDF and LDA are at distinguishing categories. I am testing to see if document comparisons in LDA outperform comparisons in TFIDF, which would be the case if the ratio of difference in LDA comparisons is higher than those in TFIDF.

However, I've noticed that LDA performs significantly worse when using a lower chunk size (1k compared to 10k, where both use update_every=1). I don't understand why this is the case, because I thought that chunk size is only a measure of how many documents are stored in memory, and has no effect on the effectiveness of the algorithm. I'm not that familiar with the math underlying LDA, so I could be wrong about that. Can someone please explain what chunksize is, and whether it should effect the accuracy of my models? Thanks

Ólavur Mortensen

unread,

Mar 14, 2017, 5:24:10 AM3/14/17

to gensim

Chunksize does affect both running time and the resulting model. The way I view it is this: if you set chunksize to, say, 1000, the algorithm will alternately update the topics of 1000 documents (gamma) and then update the topics themselves (lambda). As the lambda update is expensive, increasing chunksize decreases running times, but of course results in a different topic representation. On a related note, if you increase chunksize, you definitely need to increase the number of passes over the data. Hope this helps.

Gaurav Koradiya

unread,

Sep 19, 2019, 6:51:11 AM9/19/19

to Gensim

increase the interation parameter.

Reply all

Reply to author

Forward