java exception with LDA mallet

suvir

unread,

May 30, 2014, 11:46:29 AM5/30/14

to gen...@googlegroups.com

Hi,

As i figured out from other posts on, in LDA mallet , we can increase the heap size by

set MALLET_MEMORY=1G

But is it possible to increase the heapsize in gensim LDA wrapper?

This is because with large corpus size or even large number of topics with large corpus, i'm getting java exception.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
 at cc.mallet.topics.ParallelTopicModel.getSortedWords(ParallelTopicModel.java:1150)
 at cc.mallet.topics.ParallelTopicModel.displayTopWords(ParallelTopicModel.java:1200)
 at cc.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:790)
 at cc.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:272)

I guess other people have faced similar java error

Regards

Suvir

Radim Řehůřek

unread,

May 30, 2014, 1:08:52 PM5/30/14

to gen...@googlegroups.com

Hello Suvir,

On Friday, May 30, 2014 5:46:29 PM UTC+2, suvir wrote:

Hi,

As i figured out from other posts on, in LDA mallet , we can increase the heap size by
set MALLET_MEMORY=1G

But is it possible to increase the heapsize in gensim LDA wrapper?

No.

I personally simply increase the memory directly in the `MALLET_HOME/bin/mallet` script, by setting "MEMORY=8g".

If you find a clean and robust way of doing this dynamically from the wrapper, you can submit a pull request :)

Best,

Radim

suvirbhargav

unread,

May 31, 2014, 2:12:01 PM5/31/14

to gen...@googlegroups.com

Thanks Radim, that worked.

I was wonderging if its a good idea to remove unwanted topics from the list of topics.

This is because out of 300 topics, some of the topics are clearly too irrelevant and they add up into similarity measure.

These bad topics effects the whole document similarity. I'm still using same hellinger distance for similarity.

Suvir

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/r5xlmH6Q9UM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,

Jun 1, 2014, 5:22:03 AM6/1/14

to gen...@googlegroups.com

Hi Suvir,

On Saturday, May 31, 2014 8:12:01 PM UTC+2, suvir wrote:

Thanks Radim, that worked.

I was wonderging if its a good idea to remove unwanted topics from the list of topics.
This is because out of 300 topics, some of the topics are clearly too irrelevant and they add up into similarity measure.

These bad topics effects the whole document similarity. I'm still using same hellinger distance for similarity.

sure. The easiest way is to either post-process the topic vectors, removing unwanted topics & renormalizing to prob dist, as another transformation on top of LDA.

Or you can look at the unwanted topics' words and remove the words from the vocabulary (=add them to "stop words") and retrain LDA.

HTH,

Radim

Suvir

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward