WARNING:gensim.similarities.docsim:scanning corpus to determine the number of features (consider setting `num_features` explicitly

574 views
Skip to first unread message

Jeet Dadhich

unread,
Oct 26, 2016, 6:38:13 AM10/26/16
to gensim
Hi Team,

Please help me out here.
I didn't understood what happening in backend, Why this line pop and program not processing further after a long time of 16 hours.
Number of Document is 501513, system have 16 GB Ram.

# Code 
lda = models.LdaModel(corpus, num_topics=250,
                            id2word=dictionary,
                            update_every=1000,
                            chunksize=10000,
                            passes=1,random_state=50513,
                            minimum_probability=0,
                            decay=0.5)
lda.save('lda_step1.model')
print (lda)
tac()
print ("FF")
print (lda.gamma_threshold)
model_lda_step1 = models.LdaModel.load("lda_step1.model")
print (model_lda_step1.id2word)
print ("FFk")
print (model_lda_step1.show_topics())
print ("FFi")
print (model_lda_step1.minimum_probability)
print (model_lda_step1.alpha)
print (model_lda_step1)
tic()
index1 = similarities.MatrixSimilarity(model_lda_step1[corpus]) # program take long time to complete this.
Python Console Showing this Message :WARNING:gensim.similarities.docsim:scanning corpus to determine the number of features (consider setting `num_features` explicitly

JD

New_complete_Phase1.py

Lev Konstantinovskiy

unread,
Oct 28, 2016, 3:24:38 AM10/28/16
to gensim
Hi Jeet,

How many words are in your vocabulary?
Have you tried Similarity class as in this tutorial - maybe your machine is running out of RAM.

If you are looking for KNN then I suggest adding an LDA interface to our Annoy integration. That library is very fast.

Regards
Lev

Jeet Dadhich

unread,
Nov 9, 2016, 5:14:22 AM11/9/16
to gen...@googlegroups.com
Thanks Len,
I have made changes and pass parameter as below:
index1 = similarities.docsim.MatrixSimilarity(vec_Lda,num_features=300, chunksize=10000)
Now , my code gives error on  below line  as memory error:
sims = index1[lda[corpus]]

I have 16 GB RAM on system. using Window 10, Anaconda Spyder for Python latest version

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/DNKvm0BylYg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lev Konstantinovskiy

unread,
Nov 9, 2016, 5:30:26 AM11/9/16
to gensim
Hi Jeet,

There is a warning in this tutorial that might be helpful.

Warning: The class similarities.MatrixSimilarity is only appropriate when the whole set of vectors fits into memory. For example, a corpus of one million documents would require 2GB of RAM in a 256-dimensional LSI space, when used with this class. Without 2GB of free RAM, you would need to use the similarities.Similarity class. This class operates in fixed memory, by splitting the index across multiple files on disk, called shards. It uses similarities.MatrixSimilarity and similarities.SparseMatrixSimilarity internally, so it is still fast, although slightly more complex.

Regards
Lev

Jeet Dadhich

unread,
Nov 9, 2016, 5:57:58 AM11/9/16
to gen...@googlegroups.com
Hi Lev,
Thanks I have read that but is quiet difficult to understand , could you please share any small set of code how to implement.
Regards
Jitendra

Jeet Dadhich

unread,
Nov 10, 2016, 3:24:44 AM11/10/16
to gen...@googlegroups.com
Hi Lev,

index1 = similarities.docsim.MatrixSimilarity(lda[worpus_enron],num_features=100,chunksize=256) ... tooks long time

sims = index12[lda[corpus]]
Process get fail on this line when I am using 1,00,000 records with above mentioned parameter and same code run over 10,000 document in few minutes. Quiet strange I have 16GB RAM on system which good enough as per genism warning.

Any suggestion please
Additional Information :

INFO:gensim.models.ldamodel:-8.019 per-word bound, 259.5 perplexity estimate based on a held-out corpus of 100 documents with 14102 words

INFO:gensim.corpora.dictionary:built Dictionary(219787 unique tokens: ['tzurich', 'mediaplex', 'sistent', 'spptrcu', 'rvnssoaxrtwxxw']...) from 100000 documents (total 19823040 corpus positions)
INFO:gensim.corpora.dictionary:discarding 119787 tokens: [('smalll', 1), ('tooo', 1), ('photo_of_day', 2), ('homefield', 1), ('deadening', 1), ('bossing', 1), ('labrum', 1), ('tclupinski', 1), ('dpdenver', 1), ('jshana', 1)]...
INFO:gensim.corpora.dictionary:keeping 100000 tokens which were in no less than 0 and no more than 80000 (=80.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(100000 unique tokens: ['tzurich', 'mediaplex', 'sistent', 'spptrcu', 'mucus']...)

regards
JD

Jeet Dadhich

unread,
Nov 24, 2016, 12:14:15 AM11/24/16
to gen...@googlegroups.com
Hi Lev,


Gensim Creates index with all similarity however while performing Query its creates the Memory Error and when I checked It target numpy matrix issue.

Code which showing the Error :-

sims = index1[model_lda_step1[wtcprpus]]# perform a similarity query against the corpus.
Here above line creating issue.

After your suggestion I tried Similarity Matrix with Shard Size of 30,000, its runs well and creates (0-14).
If I check index,  15*30,000 = 4,50,000 is what I get from matrix of index,
Then I tried with 60 K its creates (0-7) indexes and if I check 8*60,000 =4,80,000.
Its not covering the entire set of data, I don't know where I missing some thing.
Could you please suggest how can I cover entire corpus data set with this Similarity Class.

This what I am using for creating index.

index1 = Similarity("50k_index.index",model_lda_step1[wtcprpus], num_features=model_lda_step1.num_terms,shardsize=50000)

JD


On Fri, Oct 28, 2016 at 12:54 PM, Lev Konstantinovskiy <lev....@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages