update LdaMulticore/lda model with new document

983 views
Skip to first unread message

Yuval Shachaf

unread,
Dec 7, 2016, 3:21:38 AM12/7/16
to gensim
Hi there,

While trying to add new document including new terms (do not exist in original dic), i have realized that using add_documents  will not work as the model itself doesnt agree with respect to number of terms (am I wrong here - I prefer this way I think).

However while trying using HashDirectory with default params in a simple lda example https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
I have ran into the following error :

C:\Python27\lib\site-packages\gensim\models\ldamodel.py:545: RuntimeWarning: overflow encountered in exp2
  (perwordbound, numpy.exp2(-perwordbound), len(chunk), corpus_words))

[(0, '0.000*"set([u\'mother\'])" + 0.000*"set([u\'brother\'])" + 0.000*"set([u\'drive\'])"'), (1, '0.000*"set([u\'blood\'])" + 0.000*"set([u\'caus\'])" + 0.000*"set([u\'tension\'])"'), (2, '0.000*"set([u\'brocolli\'])" + 0.000*"set([u\'eat\'])" + 0.000*"set([u\'good\'])"')]

And print_topics gives me all zeros.


The only line I have changed in that code is dictionary = corpora.Dictionary(texts)
with dictionary = corpora.hashdictionary.HashDictionary(texts)


What am I am doing wrong here?
My goal is to train an LDA multicore model (which already done and working nicely) and then update the model sequently with new docs.


Many thanks
Yuval





Yuval Shachaf

unread,
Dec 12, 2016, 1:29:04 AM12/12/16
to gensim
Any ideas please?
thanks

Lev Konstantinovskiy

unread,
Dec 31, 2016, 11:54:49 AM12/31/16
to gensim
Hi Yuval,

I am unable to reproduce it with the corpus that you linked in even with HashDictionary

We have resolved overflow issues in a different part of code(word2vec sigmoid function) but it is not immediately obvious how to resolve them in a simple exp2 call. Are you on a 64-bit platform? What corpus are you using?

Regards
Lev

Yuval Shachaf

unread,
Jan 1, 2017, 3:58:46 AM1/1/17
to gen...@googlegroups.com
Lev hello
Many thanks for the reply.
I will put it simpler. 
Till now I understand that unless I use hash dic (prefer not to) I cannot either update a model nor infer a new doc containing words that do not exist in the model. 
In my project this is very likely  since I'm doing lda on tweets 
.
Is this correct? 

Regards
Yuval

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/2KzJNYQSJxA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lev Konstantinovskiy

unread,
Jan 1, 2017, 11:04:16 AM1/1/17
to gensim
Hi Yuval,

Gensim LDA is a fixed vocabulary technique. Once the model is trained there is no way to increase the vocabulary. However you can filter out the new out-of-vocabulary(OOV) words using VocabTransform.

There is research on infinite vocabulary LDA but it is not implemented in Gensim.

Gensim algos that support vocabulary extension are two word embeddings: FastText wrapper and word2vec.  What is your use case? For example, you can use either of them to find similar documents by taking an average of words to represent a document.

Let me know if you have further questions.

Regards
Lev

Unnati Arora

unread,
Mar 20, 2017, 3:07:09 AM3/20/17
to gensim
Hey!
I have trained my lda with few documents. Now I want to update my training set as I get new documents. how can I do that?
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Lev Konstantinovskiy

unread,
Mar 27, 2017, 7:35:10 PM3/27/17
to gensim
Hi Unnati,

You can update the model by calling `lda_model.update`. Please note that once the model is trained there is no way to increase the vocabulary, so you need to filter out the new out-of-vocabulary(OOV) words using VocabTransform.

Regards
Lev
Reply all
Reply to author
Forward
0 new messages