Dictionary size for LSA and LDA

Philipp Dowling

unread,

Dec 13, 2014, 2:39:11 AM12/13/14

to gen...@googlegroups.com

Hi,

quick question: am I correct in assuming that LSA and LDA are both very limited in the size of the dictionary they are given? Word2Vec can handle millions of words easily, but training times for LSA seem to increase significantly for higher dimensional input. I already went from 100k words to 200k, but I'm talking in the range of 1-2 million. Is that infeasible?

I realize I could try this out to verify, but I'd have to spend 9h of computing time building a dict and serializing wikipedia first, so I figured I'd try and ask here first.

Cheers,

Philipp

Radim Řehůřek

unread,

Dec 13, 2014, 9:44:15 AM12/13/14

to gen...@googlegroups.com

Hi Philipp,

no, it's the same things. In all of LDA/LSA/word2vec, the final "model" is essentially a `num_words x num_topics` matrix. So for 2x as many words, you'll need 2x as much memory.

Maybe the confusion comes from "number of words" as in unique words = size of vocabulary, vs. "number of words" as in total words in the corpus.

The memory is determined by the former (unique word types). Gensim doesn't care about the latter (algos are online, can process arbitrary amount of documents/words in constant RAM).

HTH,

Radim

Philipp Dowling

unread,

Dec 14, 2014, 3:01:45 AM12/14/14

to gen...@googlegroups.com

Thanks! I guess assumed so because in the Wikipedia examples, the dict is trimmed to something like 100k words.

So if I'm able to train a Word2Vec model at 1.9million words, I should be able to train LSA and LDA on the same vocabulary (save for maybe lowering the chunk size some)?

Cheers,

Philipp

Radim Řehůřek

unread,

Dec 14, 2014, 4:58:25 AM12/14/14

to gen...@googlegroups.com

On Sunday, December 14, 2014 9:01:45 AM UTC+1, Philipp Dowling wrote:

Thanks! I guess assumed so because in the Wikipedia examples, the dict is trimmed to something like 100k words.

So if I'm able to train a Word2Vec model at 1.9million words, I should be able to train LSA and LDA on the same vocabulary (save for maybe lowering the chunk size some)?

Yes.

By the way, why do you need so many words, what's your app?

Best,

Radim

Philipp Dowling

unread,

Dec 14, 2014, 5:27:32 AM12/14/14

to gen...@googlegroups.com

Similarity of short phrases for an MT evaluation system - I realize LDA and LSA are not ideal for the task, but I'm comparing different models for my thesis. Since we are comparing a lot of different, unseen fragments of sentences, we need very high vocabulary coverage.

We've so far found Word2Vec to work very well for this task, I'm now trying to some results out of LSA and LDA that are at least a fair comparison. By the way, Gensim has made this whole thing much easier for me, so thanks for putting so much effort into it!

Reply all

Reply to author

Forward