Hi Lev,
index1 = similarities.docsim.MatrixSimilarity(lda[worpus_enron],num_features=100,chunksize=256) ... tooks long time
sims = index12[lda[corpus]]
Process get fail on this line when I am using 1,00,000 records with above mentioned parameter and same code run over 10,000 document in few minutes. Quiet strange I have 16GB RAM on system which good enough as per genism warning.
Any suggestion please
Additional Information :
INFO:gensim.models.ldamodel:-8.019 per-word bound, 259.5 perplexity estimate based on a held-out corpus of 100 documents with 14102 words
INFO:gensim.corpora.dictionary:built Dictionary(219787 unique tokens: ['tzurich', 'mediaplex', 'sistent', 'spptrcu', 'rvnssoaxrtwxxw']...) from 100000 documents (total 19823040 corpus positions)
INFO:gensim.corpora.dictionary:discarding 119787 tokens: [('smalll', 1), ('tooo', 1), ('photo_of_day', 2), ('homefield', 1), ('deadening', 1), ('bossing', 1), ('labrum', 1), ('tclupinski', 1), ('dpdenver', 1), ('jshana', 1)]...
INFO:gensim.corpora.dictionary:keeping 100000 tokens which were in no less than 0 and no more than 80000 (=80.0%) documents
INFO:gensim.corpora.dictionary:resulting dictionary: Dictionary(100000 unique tokens: ['tzurich', 'mediaplex', 'sistent', 'spptrcu', 'mucus']...)
regards
JD