Gensim.Similarity Add document or Live training

Nitheen Rao T

unread,

Jan 23, 2018, 5:07:01 PM1/23/18

to gensim

Hi Everyone,

A little background about this project. I have corpus with an identifier and the text. for e.g. { name: "sports-football" , text: "Content related to football sports"}.

I need to find the right match for the given text input within this corpus.

However, I was able to achieve somewhat using Gensim.Similarity with LDA and LSI Model.

My Question here is, How to update the Genism.Similarity Index with new a document. The idea here is to keep training the model at live stage.

Here is the step I followed.

QueryText = "Guardiola moved Lionel Messi to the No 9 role so that he didn't have to come deep and I think Aguero drops back into deeper positions too often."

Note: some codes are just layman

The index is created using

similarities.Similarity(indexpath, model,topics)

1. Create A dictionary

dictionary = Dictionary(QueryText )

2. Create a corpus

corpus = Corpus(QueryText, dictionary)

3. Create an LDA Model

LDAModel = ldaModel(corpus,dictionary)

Update existing dictionary, model, and index

Update existing dictionary

existing_dictionary.add_document(dictionary)

Update existing LDA Model

existing_lda_model.update(corpus)

Update existing Similarity index

existing_index.add_dcoument(LDAModel[corpus])

Other than below warning update seems to be worked.

gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

Let's run the similarity for the query text

vec_bow = dictionary.doc2bow(QueryText) 
vec_model = existing_lda_model[vec_bow] 
sims = existing_index[vec_model]

However, it failed with below error.

Similarity index with 723 documents in 1 shards (stored under ..\Files\models\lda_model)

Similarity index with 725 documents in 0 shards (stored under ..\Files\models\lda_model)

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)
<ipython-input-32-dd0e855dc48a> in <module>()
     49 
     50 
---> 51 sims = lda_index[vec_model]
     52 sims = sorted(enumerate(sims), key=lambda item: -item[1])
     53 

~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in __getitem__(self, query)
    317         efficient than computing the similarities one document after another.
    318         """
--> 319         self.close_shard()  # no-op if no documents added to index since last query
    320 
    321         # reset num_best and normalize parameters, in case they were changed dynamically

~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in close_shard(self)
    265         if issparse:
    266             index = SparseMatrixSimilarity(
--> 267                 self.fresh_docs, num_terms=self.num_features, num_docs=len(self.fresh_docs), num_nnz=self.fresh_nnz
    268             )
    269         else:

~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_features, num_terms, num_docs, num_nnz, num_best, chunksize, dtype, maintain_sparsity)
    691             self.index = matutils.corpus2csc(
    692                 corpus, num_terms=num_terms, num_docs=num_docs, num_nnz=num_nnz,
--> 693                 dtype=dtype, printprogress=10000
    694             ).T
    695 

~\Anaconda3\envs\lf\lib\site-packages\gensim\matutils.py in corpus2csc(corpus, num_terms, dtype, num_docs, num_nnz, printprogress)
     94             indptr.append(posnext)
     95             posnow = posnext
---> 96         assert posnow == num_nnz, "mismatch between supplied and computed number of non-zeros"
     97         result = scipy.sparse.csc_matrix((data, indices, indptr), shape=(num_terms, num_docs), dtype=dtype)
     98     else:

AssertionError: mismatch between supplied and computed number of non-zeros

I really appreciate, helping me on this.

Looking forward to awesome replies.

Thanks,

Nitheen

Nitheen Rao T

unread,

Jan 23, 2018, 6:05:45 PM1/23/18

to gensim

Here is an new i am getting after I pass a big text.

Similarity index with 723 documents in 1 shards (stored under \Files\models\lda_model)
Similarity index with 725 documents in 0 shards (stored under \Files\models\lda_model)

\gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2
  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-8fe711724367> in <module>()
     45 trigram = Trigram.apply_trigram_model(queryText, bigram, trigram)
     46 vec_bow = dictionry.doc2bow(trigram)
---> 47 vec_model =  lda_model[vec_bow]
     48 print(vec_model)
     49 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in __getitem__(self, bow, eps)
   1103             `(topic_id, topic_probability)` 2-tuples.
   1104         """
-> 1105         return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
   1106 
   1107     def save(self, fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs):

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in get_document_topics(self, bow, minimum_probability, minimum_phi_value, per_word_topics)
    944             return self._apply(corpus, **kwargs)
    945 
--> 946         gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
    947         topic_dist = gamma[0] / sum(gamma[0])  # normalize distribution
    948 

~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats)
    442             Elogthetad = Elogtheta[d, :]
    443             expElogthetad = expElogtheta[d, :]
--> 444             expElogbetad = self.expElogbeta[:, ids]
    445 
    446             # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.

IndexError: index 718 is out of bounds for axis 1 with size 713

Ivan Menshikh

unread,

Jan 24, 2018, 2:03:08 AM1/24/18

to gensim

Hello Nitheen,

The problem happens here

Update existing dictionary, model, and index

Update existing dictionary

existing_dictionary.add_document(dictionary)

Update existing LDA Model

existing_lda_model.update(corpus)

LDA model has "fixed" vocab, i.e. you shouldn't update it after training.

If you want to add new documents to LDA, you should use a dictionary, that you already passed (no need to update it).

Nitheen Rao T

unread,

Jan 24, 2018, 9:06:17 AM1/24/18

to gensim

Hey Ivan,

As per your suggestion, I have tried below things, here in this I haven't updated the Exiting_lda_model.

new_dictionary = Dictionary(QueryText )
new_corpus = Corpus(QueryText, dictionary)
new_modelLDA =  lda.create_model(new_dictionary, new_corpus)

existing_dictionary.add_document(new_dictionary)
exiting_lda_index.add_documents(new_modelLDA[new_corpus])

vec_bow = existing_dictionary.doc2bow(trigram)
vec_model =  exiting_lda_model[vec_bow]
sims = exiting_lda_index[vec_model]

The error I am getting

gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
<gensim.interfaces.TransformedCorpus object at 0x000002A501258240>
Traceback (most recent call last):
  File "NLPServer\test.py", line 46, in <module>
    vec_model =  exiting_lda_model[vec_bow]
  File "\gensim\models\ldamodel.py", line 1105, in __getitem__
    return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
  File "\gensim\models\ldamodel.py", line 946, in get_document_topics
    gamma, phis = self.inference([bow], collect_sstats=per_word_topics)
  File "\gensim\models\ldamodel.py", line 444, in inference
    expElogbetad = self.expElogbeta[:, ids]
IndexError: index 713 is out of bounds for axis 1 with size 713

Ivan Menshikh

unread,

Jan 24, 2018, 9:10:19 AM1/24/18

to gensim

Can you share full version of the code? The problem I think with updating dictionary & use it after creating LDA model with an old non-updated dictionary.

Nitheen Rao T

unread,

Jan 24, 2018, 10:21:36 AM1/24/18

to gensim

Sure. I will post it on this afternoon. Thank you so much helping on here.

Nitheen Rao T

unread,

Jan 24, 2018, 5:16:56 PM1/24/18

to gensim

Hi Ivan,

I have created git repo with all the code. It also runs if you comment the below section in the code.

# ========================================
# Train the new document with existing one
# ========================================
GensimUtil.add_doc_to_dictionary(dictionary, new_dictionary)
lda_model.update(new_corpus)
lda_index.add_documents(new_ldaModel[new_corpus])

Git hub repo path.

https://github.com/nithinshiriya/NLPLive.git

Thanks in advance. Hope you will solve this issue.

Nitheen Rao T

unread,

Jan 25, 2018, 11:12:19 AM1/25/18

to gensim

Hi Ivan,

Did you get any chance to look at my GitHub repo.

Thanks,

Nitheen

Ivan Menshikh

unread,

Jan 26, 2018, 2:27:13 AM1/26/18

to gensim

I looked into,

https://github.com/nithinshiriya/NLPLive/blob/e3e2b7aba87524dd3a89a8fe77602800d6ee2ca8/server.py#L24 you no need this (line 24-27), you shouldn't construct new model /vocab here (because you already have this).

You need only update `lda_model` by using `dictionary` for doc2bow and add `lda_model[...]` to `lda_index`.

Nitheen Rao T

unread,

Jan 26, 2018, 8:27:35 AM1/26/18

to gensim

Hi Ivan,

Thanks for the input.

However, I have tried as per your suggestion still no luck with that. Please see what I did.

TrainingPhrase ="Knowing about the progress and performance of a model, as we train them, could be very helpful in understanding it’s learning process and makes it easier to debug and optimize them. In this notebook, we will learn how to visualize training statistics for LDA topic model in gensim. To monitor the training, a list of Metrics is passed to the LDA function call for plotting their values live as the training progresses."
documents  = []
documents.append(TrainingPhrase)
texts = [[word for word in document.lower().split()]
        for document in documents]
new_doc2bow = [dictionary.doc2bow(text) for text in texts]
lda_model.update(new_doc2bow)
lda_index.add_documents(lda_model[new_doc2bow])

Traceback (most recent call last):


  File "NLPLive\server.py", line 48, in <module>
    sims = lda_index[vec_model]
  File "\gensim\similarities\docsim.py", line 319, in __getitem__
    self.close_shard()  # no-op if no documents added to index since last query
  File "\gensim\similarities\docsim.py", line 267, in close_shard
    self.fresh_docs, num_terms=self.num_features, num_docs=len(self.fresh_docs), num_nnz=self.fresh_nnz
  File "\gensim\similarities\docsim.py", line 693, in __init__
    dtype=dtype, printprogress=10000
  File "\gensim\matutils.py", line 92, in corpus2csc
    indices[posnow: posnext] = [feature_id for feature_id, _ in doc]
ValueError: cannot copy sequence with size 2 to array axis with dimension 1

Ivan Menshikh

unread,

Jan 29, 2018, 2:42:39 AM1/29/18

to gensim

Can you share all of your files that you loaded, I'll try to reproduce your problem?

Nitheen Rao T

unread,

Jan 29, 2018, 10:30:35 AM1/29/18

to gensim

Hi IVan,

I have shared all the code and as well models in git hub repo. Let me know if you face any issue in accessing this repo.

https://github.com/nithinshiriya/NLPLive.git

Thanks,

Nitheen

Ivan Menshikh

unread,

Jan 30, 2018, 2:07:40 AM1/30/18

to gensim

unfortunately, you didn't add model files in the repo, for this reason, I can't run it

Traceback (most recent call last):

  File "server.py", line 34, in <module>
    lda_index.add_documents(new_ldaModel[new_corpus])
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 226, in add_documents
    self.reopen_shard()
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 283, in reopen_shard
    last_index = last_shard.get_index()
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 113, in get_index
    self.index = self.cls.load(self.fullname(), mmap='r')
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/utils.py", line 281, in load
    obj = unpickle(fname)
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/utils.py", line 930, in unpickle
    with smart_open(fname, 'rb') as f:
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 169, in smart_open
    parsed_uri = ParseUri(uri)
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 432, in __init__
    raise NotImplementedError("unknown URI scheme %r in %r" % (self.scheme, uri))
NotImplementedError: unknown URI scheme 'e' in 'E:\\Leaflet\\Development\\Python\\NLPServer\\Files\\models/lda_model.0'

Radim Řehůřek

unread,

Jan 30, 2018, 5:28:56 AM1/30/18

to gensim

Hi Nitheen and anyone following this thread,

if you mean to use NLP similarities for any serious work, your best bet is our commercial similarity engine, scaletext.com. It takes care of stuff like adding and deleting documents, index versioning and revisions, sharding and distributed model building etc.

Building a resilient, high-performance similarity engine is not as trivial as it may sound at first.

Best,
Radim

Reply all

Reply to author

Forward