java exception with LDA mallet

477 views
Skip to first unread message

suvir

unread,
May 30, 2014, 11:46:29 AM5/30/14
to gen...@googlegroups.com
Hi,

As i figured out from other posts on, in LDA mallet , we can increase the heap size by 
set MALLET_MEMORY=1G

But is it possible to increase the heapsize in gensim LDA wrapper?
This is because with large corpus size or even large number of topics with large corpus, i'm getting java exception.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
 at cc
.mallet.topics.ParallelTopicModel.getSortedWords(ParallelTopicModel.java:1150)
 at cc
.mallet.topics.ParallelTopicModel.displayTopWords(ParallelTopicModel.java:1200)
 at cc
.mallet.topics.ParallelTopicModel.estimate(ParallelTopicModel.java:790)
 at cc
.mallet.topics.tui.TopicTrainer.main(TopicTrainer.java:272)

I guess other people have faced similar java error

Regards
Suvir

Radim Řehůřek

unread,
May 30, 2014, 1:08:52 PM5/30/14
to gen...@googlegroups.com
Hello Suvir,


On Friday, May 30, 2014 5:46:29 PM UTC+2, suvir wrote:
Hi,

As i figured out from other posts on, in LDA mallet , we can increase the heap size by 
set MALLET_MEMORY=1G

But is it possible to increase the heapsize in gensim LDA wrapper?

No.

I personally simply increase the memory directly in the `MALLET_HOME/bin/mallet` script, by setting "MEMORY=8g".

If you find a clean and robust way of doing this dynamically from the wrapper, you can submit a pull request :)

Best,
Radim

suvirbhargav

unread,
May 31, 2014, 2:12:01 PM5/31/14
to gen...@googlegroups.com
Thanks Radim, that worked.

I was wonderging if its a good idea to remove unwanted topics from the list of topics.
This is because out of 300 topics, some of the topics are clearly too irrelevant and they add up into similarity measure.
These bad topics effects the whole document similarity. I'm still using same hellinger distance for similarity.


Suvir


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/r5xlmH6Q9UM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,
Jun 1, 2014, 5:22:03 AM6/1/14
to gen...@googlegroups.com
Hi Suvir,


On Saturday, May 31, 2014 8:12:01 PM UTC+2, suvir wrote:
Thanks Radim, that worked.

I was wonderging if its a good idea to remove unwanted topics from the list of topics.
This is because out of 300 topics, some of the topics are clearly too irrelevant and they add up into similarity measure.
These bad topics effects the whole document similarity. I'm still using same hellinger distance for similarity.


sure. The easiest way is to either post-process the topic vectors, removing unwanted topics & renormalizing to prob dist, as another transformation on top of LDA.

Or you can look at the unwanted topics' words and remove the words from the vocabulary (=add them to "stop words") and retrain LDA.

HTH,
Radim

 


Suvir


To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages