Gensim.Similarity Add document or Live training

220 views
Skip to first unread message

Nitheen Rao T

unread,
Jan 23, 2018, 5:07:01 PM1/23/18
to gensim
Hi Everyone,

A little background about this project. I have corpus with an identifier and the text. for e.g. { name: "sports-football" , text: "Content related to football sports"}.  

I need to find the right match for the given text input within this corpus.

However, I was able to achieve somewhat using Gensim.Similarity with LDA and LSI Model. 

My Question here is, How to update the Genism.Similarity Index with new a document. The idea here is to keep training the model at live stage.

Here is the step I followed.

QueryText = "Guardiola moved Lionel Messi to the No 9 role so that he didn't have to come deep and I think Aguero drops back into deeper positions too often."


Note: some codes are just layman 

The index is created using 
similarities.Similarity(indexpath, model,topics)


1. Create A dictionary 
dictionary = Dictionary(QueryText )

2. Create a corpus   
corpus = Corpus(QueryText, dictionary)

3. Create an LDA Model
LDAModel =  ldaModel(corpus,dictionary)

Update existing dictionary, model, and index

Update existing dictionary
existing_dictionary.add_document(dictionary)

Update existing LDA Model
existing_lda_model.update(corpus)

Update existing Similarity index
existing_index.add_dcoument(LDAModel[corpus])

Other than below warning update seems to be worked.
gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

Let's run the similarity for the query text
vec_bow = dictionary.doc2bow(QueryText)
vec_model
= existing_lda_model[vec_bow]
sims
= existing_index[vec_model]

However, it failed with below error.
Similarity index with 723 documents in 1 shards (stored under ..\Files\models\lda_model)
Similarity index with 725 documents in 0 shards (stored under ..\Files\models\lda_model)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last) <ipython-input-32-dd0e855dc48a> in <module>() 49 50 ---> 51 sims = lda_index[vec_model] 52 sims = sorted(enumerate(sims), key=lambda item: -item[1]) 53 ~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in __getitem__(self, query) 317 efficient than computing the similarities one document after another. 318 """ --> 319 self.close_shard() # no-op if no documents added to index since last query 320 321 # reset num_best and normalize parameters, in case they were changed dynamically ~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in close_shard(self) 265 if issparse: 266 index = SparseMatrixSimilarity( --> 267 self.fresh_docs, num_terms=self.num_features, num_docs=len(self.fresh_docs), num_nnz=self.fresh_nnz 268 ) 269 else: ~\Anaconda3\envs\lf\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_features, num_terms, num_docs, num_nnz, num_best, chunksize, dtype, maintain_sparsity) 691 self.index = matutils.corpus2csc( 692 corpus, num_terms=num_terms, num_docs=num_docs, num_nnz=num_nnz, --> 693 dtype=dtype, printprogress=10000 694 ).T 695 ~\Anaconda3\envs\lf\lib\site-packages\gensim\matutils.py in corpus2csc(corpus, num_terms, dtype, num_docs, num_nnz, printprogress) 94 indptr.append(posnext) 95 posnow = posnext ---> 96 assert posnow == num_nnz, "mismatch between supplied and computed number of non-zeros" 97 result = scipy.sparse.csc_matrix((data, indices, indptr), shape=(num_terms, num_docs), dtype=dtype) 98 else: AssertionError: mismatch between supplied and computed number of non-zeros


I really appreciate, helping me on this.

Looking forward to awesome replies. 

Thanks,
Nitheen 

Nitheen Rao T

unread,
Jan 23, 2018, 6:05:45 PM1/23/18
to gensim
Here is an new i am getting after I pass a big text.

Similarity index with 723 documents in 1 shards (stored under \Files\models\lda_model) Similarity index with 725 documents in 0 shards (stored under \Files\models\lda_model)
\gensim\models\ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-3-8fe711724367> in <module>() 45 trigram = Trigram.apply_trigram_model(queryText, bigram, trigram) 46 vec_bow = dictionry.doc2bow(trigram) ---> 47 vec_model = lda_model[vec_bow] 48 print(vec_model) 49 ~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in __getitem__(self, bow, eps) 1103 `(topic_id, topic_probability)` 2-tuples. 1104 """ -> 1105 return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics) 1106 1107 def save(self, fname, ignore=('state', 'dispatcher'), separately=None, *args, **kwargs): ~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in get_document_topics(self, bow, minimum_probability, minimum_phi_value, per_word_topics) 944 return self._apply(corpus, **kwargs) 945 --> 946 gamma, phis = self.inference([bow], collect_sstats=per_word_topics) 947 topic_dist = gamma[0] / sum(gamma[0]) # normalize distribution 948 ~\Anaconda3\envs\lf\lib\site-packages\gensim\models\ldamodel.py in inference(self, chunk, collect_sstats) 442 Elogthetad = Elogtheta[d, :] 443 expElogthetad = expElogtheta[d, :] --> 444 expElogbetad = self.expElogbeta[:, ids] 445 446 # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w. IndexError: index 718 is out of bounds for axis 1 with size 713

Ivan Menshikh

unread,
Jan 24, 2018, 2:03:08 AM1/24/18
to gensim
Hello Nitheen,

The problem happens here
 
Update existing dictionary, model, and index

Update existing dictionary
existing_dictionary.add_document(dictionary)

Update existing LDA Model
existing_lda_model.update(corpus)

LDA model has "fixed" vocab, i.e. you shouldn't update it after training.
If you want to add new documents to LDA, you should use a dictionary, that you already passed (no need to update it).

Nitheen Rao T

unread,
Jan 24, 2018, 9:06:17 AM1/24/18
to gensim
Hey Ivan,

As per your suggestion, I have tried below things, here in this I haven't updated the Exiting_lda_model.

new_dictionary = Dictionary(QueryText )
new_corpus
= Corpus(QueryText, dictionary)
new_modelLDA
=  lda.create_model(new_dictionary, new_corpus)

existing_dictionary
.add_document(new_dictionary)
exiting_lda_index
.add_documents(new_modelLDA[new_corpus])

vec_bow
= existing_dictionary.doc2bow(trigram)
vec_model
=  exiting_lda_model[vec_bow]
sims
= exiting_lda_index[vec_model]

The error I am getting

gensim\utils.py:862: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings
.warn("detected Windows; aliasing chunkize to chunkize_serial")
<gensim.interfaces.TransformedCorpus object at 0x000002A501258240>
Traceback (most recent call last):
 
File "NLPServer\test.py", line 46, in <module>
    vec_model
=  exiting_lda_model[vec_bow]
  File "\gensim\models\ldamodel.py", line 1105, in __getitem__
   
return self.get_document_topics(bow, eps, self.minimum_phi_value, self.per_word_topics)
 
File "\gensim\models\ldamodel.py", line 946, in get_document_topics
    gamma
, phis = self.inference([bow], collect_sstats=per_word_topics)
 
File "\gensim\models\ldamodel.py", line 444, in inference
    expElogbetad
= self.expElogbeta[:, ids]
IndexError: index 713 is out of bounds for axis 1 with size 713


Ivan Menshikh

unread,
Jan 24, 2018, 9:10:19 AM1/24/18
to gensim
Can you share full version of the code? The problem I think with updating dictionary & use it after creating LDA model with an old non-updated dictionary.

Nitheen Rao T

unread,
Jan 24, 2018, 10:21:36 AM1/24/18
to gensim
Sure. I will post it on this afternoon. Thank you so much helping on here.

Nitheen Rao T

unread,
Jan 24, 2018, 5:16:56 PM1/24/18
to gensim
Hi Ivan,

I have created git repo with all the code. It also runs if you comment the below section in the code.

# ========================================
# Train the new document with existing one
# ========================================
GensimUtil.add_doc_to_dictionary(dictionary, new_dictionary)
lda_model.update(new_corpus)
lda_index.add_documents(new_ldaModel[new_corpus])


Git hub repo path.

Thanks in advance.  Hope you will solve this issue.

Nitheen Rao T

unread,
Jan 25, 2018, 11:12:19 AM1/25/18
to gensim
Hi Ivan,

Did you get any chance to look at my GitHub repo.

Thanks,
Nitheen

Ivan Menshikh

unread,
Jan 26, 2018, 2:27:13 AM1/26/18
to gensim
I looked into,

https://github.com/nithinshiriya/NLPLive/blob/e3e2b7aba87524dd3a89a8fe77602800d6ee2ca8/server.py#L24 you no need this (line 24-27), you shouldn't construct new model /vocab here (because you already have this).

You need only update `lda_model` by using `dictionary` for doc2bow and add `lda_model[...]` to `lda_index`.

Nitheen Rao T

unread,
Jan 26, 2018, 8:27:35 AM1/26/18
to gensim
Hi Ivan,

Thanks for the input.

However, I have tried as per your suggestion still no luck with that. Please see what I did. 

TrainingPhrase ="Knowing about the progress and performance of a model, as we train them, could be very helpful in understanding it’s learning process and makes it easier to debug and optimize them. In this notebook, we will learn how to visualize training statistics for LDA topic model in gensim. To monitor the training, a list of Metrics is passed to the LDA function call for plotting their values live as the training progresses."
documents = []
documents.append(TrainingPhrase)
texts = [[word for word in document.lower().split()]
for document in documents]
new_doc2bow = [dictionary.doc2bow(text) for text in texts]
lda_model.update(new_doc2bow)
lda_index.add_documents(lda_model[new_doc2bow])


Traceback (most recent call last):

 
File "NLPLive\server.py", line 48, in <module>
    sims
= lda_index[vec_model]
 
File "\gensim\similarities\docsim.py", line 319, in __getitem__
   
self.close_shard()  # no-op if no documents added to index since last query
 
File "\gensim\similarities\docsim.py", line 267, in close_shard
   
self.fresh_docs, num_terms=self.num_features, num_docs=len(self.fresh_docs), num_nnz=self.fresh_nnz
 
File "\gensim\similarities\docsim.py", line 693, in __init__
    dtype
=dtype, printprogress=10000
 
File "\gensim\matutils.py", line 92, in corpus2csc
    indices
[posnow: posnext] = [feature_id for feature_id, _ in doc]
ValueError: cannot copy sequence with size 2 to array axis with dimension 1

Ivan Menshikh

unread,
Jan 29, 2018, 2:42:39 AM1/29/18
to gensim
Can you share all of your files that you loaded, I'll try to reproduce your problem?

Nitheen Rao T

unread,
Jan 29, 2018, 10:30:35 AM1/29/18
to gensim
Hi IVan,

I have shared all the code and as well models in git hub repo.  Let me know if you face any issue in accessing this repo.


Thanks,
Nitheen

Ivan Menshikh

unread,
Jan 30, 2018, 2:07:40 AM1/30/18
to gensim
unfortunately, you didn't add model files in the repo, for this reason, I can't run it

Traceback (most recent call last):
  File "server.py", line 34, in <module>
    lda_index.add_documents(new_ldaModel[new_corpus])
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 226, in add_documents
    self.reopen_shard()
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 283, in reopen_shard
    last_index = last_shard.get_index()
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/similarities/docsim.py", line 113, in get_index
    self.index = self.cls.load(self.fullname(), mmap='r')
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/utils.py", line 281, in load
    obj = unpickle(fname)
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/gensim/utils.py", line 930, in unpickle
    with smart_open(fname, 'rb') as f:
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 169, in smart_open
    parsed_uri = ParseUri(uri)
  File "/home/ivan/.virtualenvs/p3/lib/python3.6/site-packages/smart_open/smart_open_lib.py", line 432, in __init__
    raise NotImplementedError("unknown URI scheme %r in %r" % (self.scheme, uri))
NotImplementedError: unknown URI scheme 'e' in 'E:\\Leaflet\\Development\\Python\\NLPServer\\Files\\models/lda_model.0'

Radim Řehůřek

unread,
Jan 30, 2018, 5:28:56 AM1/30/18
to gensim
Hi Nitheen and anyone following this thread,

if you mean to use NLP similarities for any serious work, your best bet is our commercial similarity engine, scaletext.com. It takes care of stuff like adding and deleting documents, index versioning and revisions, sharding and distributed model building etc.

Building a resilient, high-performance similarity engine is not as trivial as it may sound at first.

Best,
Radim
Reply all
Reply to author
Forward
0 new messages