Predicting New Corpus

deepti patil

unread,

Mar 15, 2018, 11:19:26 PM3/15/18

to gensim

I have a trained LDA model on a particular topic(1000 articles on one topic). Is it possible to predict new corpus/articles based on that model?

Ivan Menshikh

unread,

Mar 19, 2018, 10:42:31 AM3/19/18

to gensim

Hello,

can you describe it more concretely:

1. Did you train LDA only with 1 topic?

2. What mean "predict new corpus/articles"? Do you want to generate a sequence of tokens? If yes - this is possible.

deepti patil

unread,

Mar 23, 2018, 6:48:54 AM3/23/18

to gensim

Hi,

I had collected 1000 news articles on "Drug". I created a LDA model. Now I want to predict new topics for a new unseen article on my trained LDA model?

What will be the python code for that?

Will the new corpus for the new article be generated by new dictionary for that document or old dictionary?

This is part of my code:

from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.

dictionary = Dictionary(docs) ## Docs here is list of 1000 articles - already pre-processed (removed stopwords, lemmatized etc.)

# Bag-of-words representation of the documents.

corpus = [dictionary.doc2bow(doc) for doc in docs] ## corpus for 1000 articles

Now, for generating new corpus for new unseen article :

# Create a dictionary representation of the docs4 which is the unseen article

dictionary4 = Dictionary(docs4)

# Bag-of-words representation of the documents. - using the dictionary for docs4 to create new corpus i.e. corpus4

corpus4 = [dictionary4.doc2bow(doc4) for doc4 in docs4]

# using it to create new topics for new corpus

new_topic = ldamodel[corpus4] #ldamodel - trained ldamodel on old corpus i.e. corpus- training on new corpus

for a in new_topic:

print (a)

[(1, 0.098211573818414402), (4, 0.028076146702543749), (8, 0.028542374478413981), (10, 0.18255508495305284), (15, 0.015144587592186069), (16, 0.18098460371101405), (26, 0.072777604705255586), (30, 0.012641000304970024), (33, 0.073507649419200294), (41, 0.088433736869442225), (45, 0.078896131665663172), (47, 0.085055118744042049), (48, 0.046267072369662939)]

Is this correct? I am generating 50 topics

My model is not accurate. What are the ways n which I can improve my model?

Ivan Menshikh

unread,

Mar 24, 2018, 5:13:37 PM3/24/18

to gensim

Hello,

1. You no need to create the new dictionary "dictionary4", you should use the old one "dictionary" for the transformation of a new corpus

2. The way for inferring topics is correct. You don't see all topics in output because some of the topics are non-relevant to your document (almost zero probability).

deepti patil

unread,

Mar 25, 2018, 10:43:03 AM3/25/18

to gen...@googlegroups.com

Can you make me understand the logic of using the old dictionary for the new corpus. How does it use the old dictionary? Please excuse me for the question, I am in the learning process. Is there any material for what happens in the backside of the dictionary which I can read to understand better? Thanks

Thanks & Regards,

Deepti R. Patil

+91-9654395874

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/w71qEPOy5vs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

deepti patil

unread,

Mar 26, 2018, 12:29:15 AM3/26/18

to gensim

So, I just need to change dictionary4 to dictionary, rest will remain same?

# Bag-of-words representation of the documents. - using the dictionary for docs4 to create new corpus i.e. corpus4

corpus4 = [dictionary.doc2bow(doc4) for doc4 in docs4]

# using it to create new topics for new corpus

new_topic = ldamodel[corpus4] #ldamodel - trained ldamodel on old corpus i.e. corpus- training on new corpus

for a in new_topic:

print (a)

Ivan Menshikh

unread,

Mar 26, 2018, 2:08:18 AM3/26/18

to gensim

Yes, you are right.

The general logic is simple - you fit your Dictionary (mapping token <-> id) only once and after - use it for all algorithms.

Reply all

Reply to author

Forward