Speed of classifying new documents with get_document_topics

105 views
Skip to first unread message

Caleb Fleming

unread,
Mar 13, 2018, 4:05:08 PM3/13/18
to gensim
Hello-

I'm trying to train a model on 40k - 50k documents and then apply it to 500k - 5m unseen documents.

I'm finding that the time to apply a model (using get_document_topics over a full corpus or in a loop) to a set of documents takes significantly longer than the time to train the model. For example, training the model described below on 40K documents took approximately 10 seconds, whereas the time to apply it to a new set of documents of the same size took close to 40 seconds. I'm wondering why this apply step takes so much longer than the model create. This step is also accomplished in sklearn in a fraction of the time.

I'm using the multicore algorithm with the following parameters:
  • num_topics: 2
  • alpha: 0.1
  • eta: 0.01
  • passes: 2
  • corpus size: 40k
  • dictionary size: 1,200 words
I've pasted the apply step code below:

topics = model.get_document_topics(corpus, minimum_probability = 0.0)

maxprobs = [max(i, key=itemgetter(1)) for i in topics]
k_assignments = [i[0] for i in maxprobs]
prb = [i[1] for i in maxprobs]

modeled = pd.DataFrame({'iid': df.index, 'cluster' : k_assignments, 'prb' : prb})

I understand the get_document_topics function uses lazy evaluation, but the timing still seems odd relative to training. I read in this post that applying to unseen documents isn't parallel, but does that explain the entire difference or is something else going on?

Ivan Menshikh

unread,
Mar 13, 2018, 11:15:31 PM3/13/18
to gensim
Hello Caleb,

when I need to apply LDA faster - I parallelize this (spawn several processes and do it in parallel). 
For understanding "Why X faster than Y" need to spend some time on investigating code X and Y and benchmark it, I'm not familiar with the implementation of sklearn, for this reason, I can't answer, sorry.

We improve performance of LDA in the latest release, try to update to 3.4.0 (this should work faster).
Message has been deleted

Caleb Fleming

unread,
Mar 14, 2018, 10:50:08 AM3/14/18
to gensim
Hi Ivan,

Thank you for your reply. Manually parallelizing (generalized code below to get cluster assignments from a corpus) sped up the apply step significantly.

import multiprocessing
from joblib import Parallel, delayed

#parallelize function build
def parallelize_loop(i, corpus):
    #get max topic by corpus index
    topics = model.get_document_topics(corpus, minimum_probability = 0.0)[i]
    maxTuple = max(topics, key=itemgetter(1))

    #write local max to dict
    dicTemp = {'id': i, 'prb': maxTuple[1], 'cluster': maxTuple[0]}

    return dicTemp

#parallelize function apply
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(parallelize_loop)(i, corpus) for i in range(len(corpus)))

#join list of dicts
modeled = pd.DataFrame(results)



Subham Biswas

unread,
Jul 4, 2022, 9:23:49 PM7/4/22
to Gensim

I have a doubt regarding corpus . When we use the saved model to give topic distributions on unseen data , wont we need the unseen corpus to have any reference to previous corpus
Reply all
Reply to author
Forward
0 new messages