blog post: mallet@gensim tutorial

660 views
Skip to first unread message

Radim Řehůřek

unread,
Mar 20, 2014, 4:05:59 PM3/20/14
to gen...@googlegroups.com
Hello all,

I posted a short tutorial on how to use the new MALLET wrapper in gensim: http://radimrehurek.com/2014/03/tutorial-on-mallet-in-python/

Let me know if you run into any issues!

Radim

janre...@gmail.com

unread,
Mar 23, 2014, 11:02:37 AM3/23/14
to gen...@googlegroups.com
Thanks for that feature, it looks interesting, so I gave it a try. Everything worked, there was just a little oddity at the end: after my script (see below) finished python kept working
on something for about 10-15 minutes. I suppose it was deleting temporary files or something to that effect as the lda corpus and model had already been saved
at that point.
I used a sample bag of words corpus (800mb) and have 4gb of RAM on my PC. During the later steps all 4gb were in use and gradually cleared after the script finished. Maybe some logging info towards the end that tells
the user that the memory is being cleared would be good?

from datetime import datetime
from gensim import corpora, models
import logging
start = datetime.now()
logging
.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus
= corpora.MmCorpus('corpus.mm')
dictionary
= corpora.Dictionary.load('dict.dict')
path_to_mallet
='C:/mallet/bin/mallet'
lda
= models.LdaMallet(path_to_mallet, corpus, num_topics=100, id2word=dictionary)
corpus_lda
= lda[corpus]
corpora
.MmCorpus.serialize('corpusldamallet.mm', corpus_lda)
lda
.save('ldamodelmallet.lda')
difference
= datetime.now() - start
print difference

suvir

unread,
Mar 24, 2014, 9:12:31 AM3/24/14
to gen...@googlegroups.com
In case someone run into java related error, keep openjdk 7 for both mallet and system's default.


On Thursday, March 20, 2014 9:05:59 PM UTC+1, Radim Řehůřek wrote:

suvir

unread,
Mar 25, 2014, 11:52:47 AM3/25/14
to gen...@googlegroups.com
Hi,

I'm running mallet wrapper. 
After getting the topics, i arranged them to see top weight for a matched doc.
These are my topics and corresponding weight.  

print model[bow]

sorted based on second column
and finally got the result


 
(3425, 0.008556263269638363),
 
(3447, 0.008556263269638363),
 
(4477, 0.008556263269638363),
 
(704, 0.009971691436658773),
 
(1809, 0.01138711960367918),
 
(2055, 0.01704883227176081),
 
(1608, 0.01775654635527102),
 
(1974, 0.01846426043878122),
 
(1526, 0.02341825902335264),
 
(4944, 0.034033970276005694)]

Is there a way i can print the topics itself side by side. That would be helpful in observing when number of topics are large.

Regards
Suvir

On Thursday, March 20, 2014 9:05:59 PM UTC+1, Radim Řehůřek wrote:

Radim Řehůřek

unread,
Mar 25, 2014, 6:35:41 PM3/25/14
to gen...@googlegroups.com
Hello Suvir,

the wrapper doesn't read in the topic-word matrix, so there's no way to "print a topic" now.

I mean, the topic-word information is stored by Mallet as a text file, but the gensim wrapper doesn't parse this file back.

That would actually be a very nice functionality to add: it would allow `model.show_topics()` too.

Please consider sending a patch, the wrapper code is not complicated at all: https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamallet.py

Best,
Radim

suvir

unread,
Mar 26, 2014, 6:03:46 PM3/26/14
to gen...@googlegroups.com
Thanks.
Today i played with the topic word text file and i can now print top 10 topics sorted based on weight and printing them along with their topic.

For the patch, Is the desired function would be to print topics on the right hand side with the command: print model[bow]
Or maybe i should just try to keep same behavior as in 'model.show_topics()' of gensim lda.

Regards
Suvir

Radim Řehůřek

unread,
Mar 27, 2014, 5:00:56 AM3/27/14
to gen...@googlegroups.com

On Wednesday, March 26, 2014 11:03:46 PM UTC+1, suvir wrote:
Thanks.
Today i played with the topic word text file and i can now print top 10 topics sorted based on weight and printing them along with their topic.

For the patch, Is the desired function would be to print topics on the right hand side with the command: print model[bow]
Or maybe i should just try to keep same behavior as in 'model.show_topics()' of gensim lda.

Mimic `show_topics()`.

`model[bow]` is a core API and already has a set meaning in gensim, we cannot change that.

Thanks,
Radim

suvir

unread,
Mar 30, 2014, 1:29:30 PM3/30/14
to gen...@googlegroups.com
I ran into an issue of not able to import numpy at all.

----> 1 model.show_topic(1)


/home/test/software/gensim_dev/gensim/gensim/models/ldamallet.py in show_topic(self, topicid, topn)
   
150         bestn = np.argsort(topic)[::-1][:topn]
   
151         beststr = [(topic[id], self.id2word[id]) for id in bestn]
--> 152         return beststr
   
153
   
154     def convert_input(self, corpus, infer=False):


NameError: global name 'numpy' is not defined

i tried both import numpy and import numpy as np.
My setup is :

git clone develope branch.
do changes.
sudo python setup.py develop

Should i declare numpy in some other way to let it accept?

Frank Hedler

unread,
Mar 9, 2015, 2:02:07 PM3/9/15
to gen...@googlegroups.com
Hi Radim,

After some hiccups (e.g. setting MALLET_HOME environment variable, writing my string variable onto disk to stream into corpus, etc.)  I got it to run, but it returns the following:

 0 5
1 5
2 5
3 5
4 5
5 5
6 5
7 5
8 5
9 5

Infinite value after topic 0 0
<1000> LL/token: �

Total time: 30 seconds

Do you have any idea what the problem could be?

Any help would be greatly appreciated!
Thanks,
Frank
Reply all
Reply to author
Forward
0 new messages