How to apply the lda model on new unseen documents?

Dixon Daniel

unread,

Apr 4, 2017, 9:21:09 AM4/4/17

to gensim

Hi,

I was able to get the lda model for a corpus. I now want to apply that lda model to unseen documents. In the gensim models page I see that the lda model creation is as follows:

lda = LdaModel(corpus, num_topics=10)

This model is then used as follows: doc_lda = lda[doc_bow]

The model is applied on doc_bow which i believe is the documents' bag of words. So how do i get the document bag of words?

I tried using the code provided below to get the bag of words, but the Dictionary and the doc2bow function don't seem to be taking the unseen document as input. I get the error (TypeError: doc2bow expects an array of unicode tokens on input, not a single string). This error wouldn't occur if I have a series of documents, but I want the bag of words for individual documents.

So I would like to know how to get the bag of words for individual documents and then apply the lda model on that bag of words model for each document? dictionary_1 = corpora.Dictionary(doc_clean)

doc_term_matrix = [dictionary_1.doc2bow(doc) for doc in doc_clean]

My ultimate goal is to determine the similarity between the corpus and the unseen documents.

Any suggestion would be very helpful.

Thanks!!!

Have a great day!

Jason King

unread,

Apr 4, 2017, 2:14:37 PM4/4/17

to gensim

Dixon,

Don't forget that you will need to tokenize your documents before feeding it into the doc2bow method. Assuming you have a function called tokenize that transforms documents into a token list, try the following:

doc_term_matrix = [dictionary_1.doc2bow(tokenize(doc)) for doc in doc_clean]

doc_lda = lda[doc_term_matrix]

Hope that helps,

Jason

Guilherme Passero

unread,

Apr 4, 2017, 10:43:50 PM4/4/17

to gensim

Besides tokenizing, as mentioned by Jason, you should also repeat text preprocessing techniques you used on the training corpus for LDA model (e.g. transform to lowercase, stemming etc).

Dixon Daniel

unread,

Apr 6, 2017, 8:34:13 AM4/6/17

to gensim

So this is how I am converting each document to its bag of words

tokens = tokenizer.tokenize(data)

def clean(doc):

stop_free = " ".join([i for i in doc.lower().split() if i not in stop])

punc_free = ''.join(ch for ch in stop_free if ch not in exclude)

normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())

return normalized

doc_clean = [clean(doc).split() for doc in tokens]

There is no issues in the bag of words model that I get using the above code. But the problem is that each document gets tokenized and I get a list of lists [ [word1, word2, ...], [ word1, word2, ,,,], [ ],...[ ] ] where each list is a sentence in the document. This gets easily processed by the following code:

dictionary_1 = corpora.Dictionary(doc_clean)

But I actually want each document to be a single list [ word1, word2,........ ] so i combine the list of lists using this:

combined = [item for sublist in doc_clean for item in sublist]

dictionary_1 = corpora.Dictionary(combined)

So now when i use this to get the bag of words model, I get this error (" TypeError: doc2bow expects an array of unicode tokens on input, not a single string ")

I am not sure how to overcome this issue.

Message has been deleted

Jason King

unread,

Apr 6, 2017, 10:30:49 AM4/6/17

to gensim

Actually, Dixon, scratch the code that I wrote above. It doesn't remove punctuation, so the output will be different. The big takeaway is that you are joining your doc tokens into a single string in your clean function. Examine doc_clean and I think you'll see it's not doing what you think it's doing.

Jason

Tolga SAGLIK

unread,

Mar 22, 2019, 12:12:29 PM3/22/19

to Gensim

Hi Jason,

Would you elaborate how doc_lda can be used?

Is there a way to extract information from this object such as perplexity, coherence and a visualized topic map?

Reply all

Reply to author

Forward