classify unseen/new document using LDA

2,236 views
Skip to first unread message

kiran surya

unread,
Jun 21, 2016, 10:42:18 AM6/21/16
to gensim
Hi,

I had trained LDA model with 600 topics and ~ 50k vocabulary. To classify new document i'm using following code:

vec = dictionary.doc2bow(data.split())

topics_list = lda[vec]


However, the above code is taking 20 sec to classify new document. Is there any way to increase the speed ?


Regards,

Kiran.



Lev Konstantinovskiy

unread,
Jun 21, 2016, 12:43:08 PM6/21/16
to gensim
Are you using the Multicore version of LDA? How many cores are you using?

Radim Řehůřek

unread,
Jun 21, 2016, 9:47:00 PM6/21/16
to gensim
On Wednesday, June 22, 2016 at 1:43:08 AM UTC+9, Lev Konstantinovskiy wrote:
Are you using the Multicore version of LDA? How many cores are you using?

Transforming new documents to LDA space doesn't use multiple cores, so that's irrelevant.

20 seconds to transform a single vector is extremely unusual. My initial guess would be you're passing some non-sensical input, such as `data` being an entire gigabyte corpus, or something.

Kiran, can you post some stats about your model and data? (#train docs, #features, #topics, #`data` size etc).

Cheers,
Radim

kiran surya

unread,
Jun 21, 2016, 9:47:51 PM6/21/16
to gensim
I'm using normal LDA. I'm calling mallet through LDA wrapper. Training uses multiple threads, but classifying new document uses only 1 core.

kiran surya

unread,
Jun 22, 2016, 3:06:15 AM6/22/16
to gensim
Hi Radim,

Each Document is of 500 words. Here data is a single document. Total number of docs = 1.3 million, 50k number of features, 1000 topics, each doc has ~ 500 words.

Regards,
Kiran.

kiran surya

unread,
Jun 22, 2016, 6:33:59 AM6/22/16
to gensim
Hi Radim,

I trained LDA (mallet LDA through genesis wrapper) again just now. number of  docs: ~ 600k, number of topics: ~600, number of features: 50k, each doc has ~ 500 words.

Following are the timings:

0:00:00.107246

above is loading time

0:00:00.064976

above is pre-processing time

0:00:00.043155

above is doc2bow converting time

6 is length of new document vector [len(dictionary.doc2bow(rec.split()))], where rec is each document.

0:00:29.283126

above is testing time


Regards,

Kiran.

kiran surya

unread,
Jun 22, 2016, 6:49:35 AM6/22/16
to gensim

dictionary = gensim.corpora.Dictionary.load('ldadictionary.dict')

lda = gensim.models.wrappers.LdaMallet.load('ldamodel')
topics=lda[new_vec] //This line is is taking 29 sec.

Regards,
Kiran.

Radim Řehůřek

unread,
Jun 22, 2016, 9:35:05 PM6/22/16
to gensim
Hi Kiran,

the library is really called gensim -- though I like your "genesis" too :-)

Regarding the long transform time: Mallet is an external tool, written in Java. How gensim talks to Mallet is via launching Mallet as an external process, much like you would from a command line.

As you can imagine this is a pretty expensive operation, and for transformations, gensim serializes the input to disk, calls Mallet with the serialization filename, Mallet has to load its LDA model from disk first, then load the serialization file, compute result (which is super fast!), then serialize the result to disk, then gensim reads the result from disk and returns it to you.

As you can see, the transformation itself is fast, but there's a lot of overhead around it!

I think this is the core of why it takes so long. To verify, you can launch Mallet manually on the same input and time it.

Your best bet is to amortize this overhead by transforming many documents at once. That way, the extra overhead steps will only happen once, no matter how many documents you're transforming. 
That is, instead of `topics = model[new_vec]`, run `topics = model[sequence_of_new_vecs]`.

Another option is to convert the Mallet model (Java) to gensim model (Python), so you avoid all the process and serialization overhead completely. This is really just initializing LdaModel params (fitted alpha, beta matrices...) from the trained LdaMallet params. I think we had some utility Python function for this conversion somewhere.

HTH,
Radim

kiran surya

unread,
Jun 23, 2016, 4:43:30 AM6/23/16
to gensim
Hi Radim,

Thanks for the quick reply :)  Could you please provide the utility to convert from mallet model to genesis model. We couldn't find it. Our use case is real-time model. So bunch option might not work for us.

Regards,
Kiran.

Lev Konstantinovskiy

unread,
Jun 24, 2016, 12:03:06 AM6/24/16
to gensim
Hi Kiran,

 I don't we have a function to convert Mallet model to gensim but it should be easy to write. If you write it then it would a very welcome contribution to gensim in a pull request.

Regards
Lev

Devashish Deshpande

unread,
Jun 28, 2016, 10:45:10 AM6/28/16
to gensim
Hey Kiran,

I think you can do this simply by:

tm1 = LdaMallet('/home/devashish/mallet-2.0.8RC3/bin/mallet',corpus=corpus , num_topics=2, id2word=dictionary)  # Or however you have initialized it.
tm2 = LdaModel(corpus=corpus, num_topics=2, id2word=dictionary, alpha=tm1.alpha, passes=tm1.iterations)  # equating alphas
tm2.eta = tm1.wordtopics  # Impose asymmetric prior

and then you can simply use the LdaModel instead. However since LdaMallet uses Gibbs sampling and gensim (not genesis :-) ) LdaModel uses variational bayes for approximating a posterior probability, the results can be different. However I would wait for Radim to verify this once.

Thanks and regards,
Devashish

Radim Řehůřek

unread,
Jun 29, 2016, 12:49:49 AM6/29/16
to gensim
Close.

I think the following should do what you need:

model_gensim = gensim.models.LdaModel(id2word=mallet_model.id2word, num_topics=mallet_model.num_topics, alpha=mallet_model.alpha, iterations=100)
model_gensim
.expElogbeta[:] = mallet_model.wordtopics

That is, you copy over the trained model weights (alpha, beta...) from a trained mallet model into the gensim model. Then you don't have to train the gensim model any more, you just use it.

Note I didn't test the code above, so a sanity check on a few documents may be a good idea, making sure the mallet/gensim models return similar results.

Best,
Radim

kiran surya

unread,
Jul 1, 2016, 6:10:57 AM7/1/16
to gensim
Hi Radim/Devashish,

Thanks for the reply. I'm getting different results (topic 76 with mallet, topic 0 with gensim) with mallet and genesis model. I had run on a test instance 10 times. Following are the results:

0:02:53.144997

above is mallet model time

[0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113, 0.32368686868687113]

above is probabilities

[76, 76, 76, 76, 76, 76, 76, 76, 76, 76]

above is topics

0:00:00.438506

above is gensim model time

[0.3296713681035431, 0.32937899769094003, 0.3290911680265532, 0.33037637275912024, 0.3308946749415768, 0.3219509847307517, 0.330096109579114, 0.33630419366304487, 0.33079761620347337, 0.33273683751942856]

above is probabilities

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

above is topics


However, timing wise, there is a significant improvement.

Lev Konstantinovskiy

unread,
Jul 1, 2016, 6:42:38 AM7/1/16
to gen...@googlegroups.com
Hi Surya,

I would like to help with your query - could you please post the code that you are running? I need it to understand the question.

Thanks
Lev

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Devashish Deshpande

unread,
Jul 1, 2016, 12:58:43 PM7/1/16
to gen...@googlegroups.com
Hmmm strange... when I was testing the code the mapping was being done correctly...

kiran surya

unread,
Jul 5, 2016, 3:50:32 AM7/5/16
to gensim
Hi Lev,

Following is the code:

dictionary = gensim.corpora.Dictionary.load('ldadictionary.dict')

lda = gensim.models.wrappers.LdaMallet.load('ldamodel')

m_vals = []

m_inds = []

a = datetime.datetime.now()

for i in range(0,10):

    processed_line=preprocess(line.encode('utf-8'))

    new_vec = dictionary.doc2bow(processed_line.split())

    topic_dist=lda[new_vec]

    temp=[float(x[1]) for x in topic_dist]

    max_val=max(temp)

    m_vals.append(max_val)

    index=temp.index(max_val)

    m_inds.append(index)

b = datetime.datetime.now()

print b-a

print "above is mallet model time"

print m_vals

print "above is probabilities"

print minds

print "above is topics"


model_gensim = gensim.models.LdaModel(id2word=lda.id2word, num_topics=lda.num_topics, alpha=lda.alpha, iterations=200)

model_gensim.expElogbeta[:] = lda.wordtopics

a = datetime.datetime.now()

g_vals = []

g_inds = []

for i in range(0,10):

    processed_line=preprocess(line.encode('utf-8'))

    new_vec = dictionary.doc2bow(processed_line.split())

    topic_dist=model_gensim[new_vec]

    temp=[float(x[1]) for x in topic_dist]

    max_val=max(temp)

    g_vals.append(max_val)

    index=temp.index(max_val)

    g_inds.append(index)

b = datetime.datetime.now()

print b-a

print "above is gensim model time"

print g_vals

print "above is probability values"

print grinds

print "above is topics"

Devashish Deshpande

unread,
Jul 7, 2016, 6:59:16 AM7/7/16
to gensim
Hey Kiran,

We recently added a new function to ldamallet.py called "malletmodel2ldamodel" which can let you transform from an ldamallet model to a gensim LdaModel. You can check it out here.

Regards,

Devashish
On Tuesday, June 21, 2016 at 8:12:18 PM UTC+5:30, kiran surya wrote:

kiran surya

unread,
Aug 24, 2016, 6:04:09 AM8/24/16
to gensim
How to mention same number of iterations during inferring in mallet and genesis model ? 

kiran surya

unread,
Aug 25, 2016, 12:12:31 PM8/25/16
to gensim
We are getting different results with gensim and mallet models using this new function. Do we get the same topic using both models or can there be any difference ?

Regards,
Kiran

Radim Řehůřek

unread,
Aug 25, 2016, 9:26:53 PM8/25/16
to gensim
Hello Kiran,

there can be difference (different inference algorithms).

However, the difference should not be substantial (not completely different topics). It should just give slightly different numbers, especially on very short documents and/or with small number of inference iterations.

What difference exactly are you seeing? Do you have any examples?

Best,
Radim

kiran surya

unread,
Aug 26, 2016, 3:59:36 AM8/26/16
to gensim
Hi Radim,

Thanks for the quick turn-around. I got the fix. gensim model is displaying only top few topics rather than all topics, where as mallet model is displaying all the topics. Is there any way to display all topics.

Regards,
Kiran.
Message has been deleted

jayant jain

unread,
Aug 26, 2016, 9:18:33 AM8/26/16
to gensim
Hi Kiran, simply setting num_topics in a call to either show_topics or print_topics should work.

model.print_topics(num_topics=50)

Here's the documentation. Setting num_topics=-1 would display all topics.
Reply all
Reply to author
Forward
0 new messages