First of all I just wanted to thank Radim for the awesome library. My research into topic modeling landed me a great new job that I started last week, so I am forever in your debt.
The people at my company are looking into using topic modeling to perform document similarity calculations, so I naturally suggested using gensim. My first project is to start doing some experiments with it. I'm comparing comparisons using LDA against comparisons using just TF-IDF as a benchmark of effectiveness. I have about 6000 documents that I used to train my TF-IDF and LDA models. Each of these documents is belongs to a given category, and there are 25 categories.
The way I'm testing my comparison algorithms is by picking N many random categories, and then M man random documents within each category. For each document I compare the similarity against every other document I chose. I then find the average similarity of documents within the category, and the average similarity of documents outside of the category. The ratio of similarity should then provide a measure of how effective TFIDF and LDA are at distinguishing categories. I am testing to see if document comparisons in LDA outperform comparisons in TFIDF, which would be the case if the ratio of difference in LDA comparisons is higher than those in TFIDF.
However, I've noticed that LDA performs significantly worse when using a lower chunk size (1k compared to 10k, where both use update_every=1). I don't understand why this is the case, because I thought that chunk size is only a measure of how many documents are stored in memory, and has no effect on the effectiveness of the algorithm. I'm not that familiar with the math underlying LDA, so I could be wrong about that. Can someone please explain what chunksize is, and whether it should effect the accuracy of my models? Thanks