KL divergence in doc_sim module of new gensim 0.8.0

1,560 views
Skip to first unread message

Shivani

unread,
Aug 15, 2011, 12:18:01 PM8/15/11
to gensim
the similarities class uses cosine similarity only. What about
hellinger distance or KL divergence based measures for comparing LDA
documents. I know this question was asked before, but my question is
about implementing that as a part of gensim's similarity class.

I scanned the code of similarity class and could not really find a
location where the cosine similarity is calculated.

Thanks in advance for your help,
Shivani

Radim

unread,
Aug 16, 2011, 9:17:58 AM8/16/11
to gensim
The Similarity class doesn't support this. The whole class is built
around cosine similarity (=matrix multiplications).

There used to be code in gensim that allowed you to use arbitrary
function to determine similarity. It just did a linear scan, calling
the supplied metric function on each index document and the query.

It was removed because it was too slow (and trivial). You can just do
`sims = [(document, my_sim_fnc(document, query)) for document in
index]`.

Best,
Radim

Shivani

unread,
Aug 17, 2011, 12:26:50 PM8/17/11
to gensim
Hello Radim,

That makes sense. KL divergence is asymmetric and messy to calculate.

I just wanted to know if this is possible and if this will work at
all...

According to notes on KL divergence for retrieval purposes:

http://times.cs.uiuc.edu/course/410s11/kldir.pdf

The KL divergence comes down to nothing but a matrix product

KL(Q||D) =\sum_w p(w|Q) log p(w|D)

If this is the case, then given the topic representations of two
documents in the LDA model (which are like probability distributions)

KL divergence for retrieval would mean a matrix product but only after
taking a log of the document-topic matrix

Do you think this would speed up calculations?

Shivani

Radim

unread,
Aug 18, 2011, 5:15:17 PM8/18/11
to gensim
Yes, that would definitely speed up calculations! In fact, if you
managed to formulate the similarity metric in terms of matrix
products, you could still you the Similarity class and all the
scaffolding that comes with it.

If you manage to code it up, please consider contributing the code
back to gensim. Don't worry about cleaning up & optimizing the code
too much. As soon as you have working code and some examples to go
with it, we can all collectively improve on it.

Best,
Radim

Clint P. George

unread,
Jan 24, 2013, 12:13:23 PM1/24/13
to gen...@googlegroups.com
Hello Radim: 

I see a conversation on including KL divergence as part of the similarity class. Is there any update on this? Thanks! 

Clint

Radim Řehůřek

unread,
Jan 26, 2013, 6:41:28 AM1/26/13
to gensim
Hello Clint,

On Jan 24, 6:13 pm, "Clint P. George" <clin...@gmail.com> wrote:
>
> I see a conversation on including KL divergence as part of the similarity
> class. Is there any update on this? Thanks!

'fraid not. The issue is still open: https://github.com/piskvorky/gensim/issues/64
, contributions welcome.

Best,
Radim

Clint P. George

unread,
Feb 7, 2013, 11:14:46 AM2/7/13
to gen...@googlegroups.com
Hello Radim: 

I implemented the negative-KL (based on http://sifaka.cs.uiuc.edu/course/498cxz04f/kldir.pdf) for the similarity index and it can be seen in my branch (https://github.com/clintpgeorge/gensim). I modified the utils.py and similarities/docsim.py. In the case of SparseMatrixSimilarity class the change was a little tricky because I had to take a log on the scipy.sparse.csc_matrix. I'm glad to improve this code if you have any better idea of doing this.  Thanks! 

-- Clint

Radim Řehůřek

unread,
Feb 7, 2013, 3:44:38 PM2/7/13
to gensim
Hello Clint,

cool! Do you have a test suite with known (correct) similarity
results? Something to compare your implementation against.

I see some inefficiencies there, but as long as the results are
correct, that can be worked on later, no problem :)

Radim


On Feb 7, 5:14 pm, "Clint P. George" <clin...@gmail.com> wrote:
> Hello Radim:
>
> I implemented the negative-KL (based onhttp://sifaka.cs.uiuc.edu/course/498cxz04f/kldir.pdf) for the similarity

suvir

unread,
Apr 10, 2014, 5:40:24 AM4/10/14
to gen...@googlegroups.com
Hi,

I was testing KL distance on small corpus. I added this patch with gensim 0.9 branch. I'm getting some nan in the result with Negative_KL case.

2014-04-10 11:25:08,301 : INFO : topic #1 (0.050): 0.142*scene + 0.063*film + 0.032*time + 0.031*prison + 0.029*school + 0.021*part + 0.021*prisoner + 0.021*type + 0.021*death + 0.021*deal
2014-04-10 11:25:08,301 : INFO : topic diff=0.134907, rho=0.316228
2014-04-10 11:25:08,313 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2014-04-10 11:25:08,401 : INFO : creating matrix for 111 documents and 20 features
/home/test/software/gensim_dev/gensim/gensim/similarities/docsim.py:561: RuntimeWarning: divide by zero encountered in log
  result
= -numpy.dot(numpy.log(self.index), query.T).T # return #queries x #index
[(0, nan), (1, nan), (2, nan), (3, nan), (4, nan), (5, nan), (6, nan), (7, nan), (8, nan), (9, nan)]

This is how i'm calling it:

#smaller dataset
corpus
= MyCorpus('/home/test/unsup_pre_small') # create a dictionary
dictionary
= corpus.dictionary
corpora
.MmCorpus.serialize('testing_corpus.mm', corpus)
corpus
= corpora.MmCorpus('testing_corpus.mm')

model
= gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, update_every=1, chunksize=10000, passes=10, alpha = None)

doc
= read_texts()#some document
bow
= dictionary.doc2bow(utils.simple_preprocess(doc))
vec_lda
= model[bow]

#index = similarities.MatrixSimilarity(model[corpus], similarity_type=utils.SimilariyType.COSINE)
index
= similarities.MatrixSimilarity(model[corpus], similarity_type=utils.SimilariyType.Negative_KL)
sims
= index[vec_lda]
sims
= sorted(enumerate(sims), key=lambda item: -item[1])
print(sims[:10])

Cosine similarity case works as expected. All other code is untouched.

Suvir

Radim Řehůřek

unread,
Apr 10, 2014, 9:08:34 AM4/10/14
to gen...@googlegroups.com, cli...@gmail.com
On Thursday, April 10, 2014 11:40:24 AM UTC+2, suvir wrote:
Hi,

I was testing KL distance on small corpus. I added this patch with gensim 0.9 branch. I'm getting some nan in the result with Negative_KL case.

That's not "some NaNs", that's "all NaNs" :)

Clint, what was the status on this feature? I don't remember seeing a pull request, although the functionality is cool.

Best,
Radim

suvir

unread,
Apr 10, 2014, 10:21:48 AM4/10/14
to gen...@googlegroups.com, cli...@gmail.com
No editing options in old posts in google groups, so adding here.

I already have a basic KL_func that works ok for two documents. but couldn't extended it to make it work against the whole corpus.
i. e given the corpus and query, return similar documents.
def kl(p, q):
   
"""Kullback-Leibler divergence D(P || Q) for discrete distributions
    Parameters
    ----------
    p, q : array-like, dtype=float, shape=n
        Discrete probability distributions.
    thanks to gist : https://gist.github.com/larsmans/3104581
    """

    p
= np.asarray(p, dtype=np.float)
    q
= np.asarray(q, dtype=np.float)
    sum_pq
= np.sum(np.where(p != 0, p * np.log(p / q), 0))
    sum_qp
= np.sum(np.where(q != 0, q * np.log(q / p), 0))
   
return (sum_pq+sum_qp)/2 # symmetric

#first doc
doc
= read_texts(doc1)
bow
= corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
vec_1
= model[bow]# 'model' is used in mallet
vec_p
= np.array(vec_1[0])
#second doc
doc
= read_texts(doc2)
bow
= corpus.dictionary.doc2bow(utils.simple_preprocess(doc))
vec_2
= model[bow]
vec_q
= np.array(vec_2[0])

KL
= kl(vec_p[:,1],vec_q[:,1])
print KL


i liked the other approach as well as mentioned by Radim in this thread previously:
sims = [(document, my_sim_fnc(document, query)) for document in index]


query=vec_p or vec_q  but how do i get the index?

Suvir

Clint P. George

unread,
Apr 10, 2014, 10:38:14 AM4/10/14
to m...@radimrehurek.com, gen...@googlegroups.com, suvir
Hey Radim,

I do not have a test suite to test the changes I did to gensim. But
you can see the changes for implementing KL-divergence at --
https://github.com/clintpgeorge/gensim.

I used this file--https://github.com/clintpgeorge/gensim/blob/develop/gensim_test.py--for
testing Negative KL.

Best,

Clint

Skipper Seabold

unread,
Apr 10, 2014, 10:48:14 AM4/10/14
to gensim
On Thu, Apr 10, 2014 at 10:21 AM, suvir <hitl...@gmail.com> wrote:
> No editing options in old posts in google groups, so adding here.
>
> I already have a basic KL_func that works ok for two documents. but couldn't
> extended it to make it work against the whole corpus.
> i. e given the corpus and query, return similar documents.
> def kl(p, q):
> """Kullback-Leibler divergence D(P || Q) for discrete distributions
> Parameters
> ----------
> p, q : array-like, dtype=float, shape=n
> Discrete probability distributions.
> thanks to gist : https://gist.github.com/larsmans/3104581
> """
> p = np.asarray(p, dtype=np.float)
> q = np.asarray(q, dtype=np.float)
> sum_pq = np.sum(np.where(p != 0, p * np.log(p / q), 0))
> sum_qp = np.sum(np.where(q != 0, q * np.log(q / p), 0))
> return (sum_pq+sum_qp)/2 # symmetric
>

This isn't KL divergence anymore when you make it symmetric right?
FWIW, if you want an information theoretic measure that is also a
proper distance metric, you may want to use (normalized) Hellinger
distance.

https://en.wikipedia.org/wiki/Hellinger_distance

Also you can compute relative entropy using scipy.stats.entropy
(entropy is additive so just sum the results for 2d inputs). It may be
a little better at guarding against potential issues. See also
scipy.special.xlogy.
> --
> You received this message because you are subscribed to the Google Groups
> "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gensim+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,
Apr 10, 2014, 11:15:12 AM4/10/14
to gen...@googlegroups.com


On Thursday, April 10, 2014 4:48:14 PM UTC+2, jseabold wrote:
On Thu, Apr 10, 2014 at 10:21 AM, suvir <hitl...@gmail.com> wrote:
> No editing options in old posts in google groups, so adding here.
>
> I already have a basic KL_func that works ok for two documents. but couldn't
> extended it to make it work against the whole corpus.
> i. e given the corpus and query, return similar documents.
> def kl(p, q):
>     """Kullback-Leibler divergence D(P || Q) for discrete distributions
>     Parameters
>     ----------
>     p, q : array-like, dtype=float, shape=n
>         Discrete probability distributions.
>     thanks to gist : https://gist.github.com/larsmans/3104581
>     """
>     p = np.asarray(p, dtype=np.float)
>     q = np.asarray(q, dtype=np.float)
>     sum_pq = np.sum(np.where(p != 0, p * np.log(p / q), 0))
>     sum_qp = np.sum(np.where(q != 0, q * np.log(q / p), 0))
>     return (sum_pq+sum_qp)/2 # symmetric
>

This isn't KL divergence anymore when you make it symmetric right?

Yep.

There's an open Github issue for adding other similarity measures:


with a useful link there.

Suvir: here `index` is simply an iterable of documents = a corpus. For example, for Hellinger distance, you'd do:

hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
sims = [(docno, hellinger(matutils.sparse2full(query, num_topics), matutils.sparse2full(doc, num_topics))) for docno, doc in enumerate(index_corpus)]

Now that's a naive "pairwise" implementation; optimizations are another matter.

HTH,
Radim






suvir

unread,
Apr 11, 2014, 11:36:10 AM4/11/14
to gen...@googlegroups.com

Suvir: here `index` is simply an iterable of documents = a corpus. For example, for Hellinger distance, you'd do:

hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
sims = [(docno, hellinger(matutils.sparse2full(query, num_topics), matutils.sparse2full(doc, num_topics))) for docno, doc in enumerate(index_corpus)]

Now that's a naive "pairwise" implementation; optimizations are another matter.

HTH,
Radim


Thanks Radim. That worked!
For small corpus(100 docs), it works. For bigger corpus(50k), num_topics is not enough. i guess it can be replaced by number of items in dictionary. On bigger corpus, following works:
sims = [(docno, hellinger(matutils.sparse2full(vec_lda[0], 30000), matutils.sparse2full(doc, 30000))) for docno, doc in enumerate(corpus)]#vec_lda[0] as lda_mallet returns result with extra [ ].
where 30000 is number of dictionary items. I have to write hard coded number as 
len(corpus.dictionary.token2id.keys())

 
doesn't work. I used dictionary size as i guess its same as num_features( from docsim).
Here is the error when using lda.num_topics on big corpus.

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-191-9d387738e53b> in <module>()
----> 1 execfile('hellinger_test.py')


/home/test/code/similarity/hellinger_test.py in <module>()
     
57
     
58 hellinger = lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())
---> 59 sims = [(docno, hellinger(matutils.sparse2full(vec_lda, lda.num_topics), matutils.sparse2full(doc, lda.num_topics))) for docno, doc in enumerate(corpus)]
     
60 #sims_hl = sorted(enumerate(sims_hl), key=lambda item: item[1])
     
61 sims = sorted(sims, key=lambda x: x[1])


/home/test/software/gensim_dev/gensim/gensim/matutils.pyc in sparse2full(doc, length)
   
191     doc = dict(doc)
   
192     # overwrite some of the zeroes with explicit values
--> 193     result[list(doc)] = list(itervalues(doc))
   
194     return result
   
195


IndexError: index 2052 is out of bounds for size 200



I also tried Clint's test file for KL distance. that works but when extending it to bigger corpus, values becomes inf. Log:


Regards
Suvir




 

Radim Řehůřek

unread,
Apr 11, 2014, 1:36:37 PM4/11/14
to gen...@googlegroups.com
Hello Suvir,

On Friday, April 11, 2014 5:36:10 PM UTC+2, suvir wrote:
Thanks Radim. That worked!
For small corpus(100 docs), it works. For bigger corpus(50k), num_topics is not enough. i guess it can be replaced by number of items in dictionary. On bigger corpus, following works:
sims = [(docno, hellinger(matutils.sparse2full(vec_lda[0], 30000), matutils.sparse2full(doc, 30000))) for docno, doc in enumerate(corpus)]#vec_lda[0] as lda_mallet returns result with extra [ ].
where 30000 is number of dictionary items. I have to write hard coded number as 

this number has to match the #features of your `query` = #features of your `corpus`. Don't just put whatever number there :)

I.e., for LDA models, with `query = lda_model[bow]` and `corpus = lda_model[bow_corpus]`, this is `lda_model.num_topics`.

For BOW/TF-IDF models, with `query = tfidf_model[bow]` and `corpus = tfidf_model[bow_corpus], it is `len(dictionary)`.

In each case, the dimensionality of your query must match the corpus you're comparing it against... otherwise it's a conceptual mismatch. You can't directly compare an LDA vector to a BOW vector. Maybe this is the cause of your issues? Can you check both `query` and `corpus` were produced by the same pipeline?

 
I also tried Clint's test file for KL distance. that works but when extending it to bigger corpus, values becomes inf. Log:

I don't recall Clint's code anymore. I think having different similarity measures would be a great feature though.

If you have time, please check/fix the code, so we can merge it into gensim. It's on my backlog too, but god knows when I'll have the time... if you need it, it's best if you do it :)

Cheers,
Radim


 



Regards
Suvir




 

Radim Řehůřek

unread,
Apr 11, 2014, 2:21:36 PM4/11/14
to gen...@googlegroups.com
#vec_lda[0] as lda_mallet returns result with extra [ ].


This was an unrelated bug in LdaMallet. Thanks for letting me know Suvir!

`mallet_model[vector]` always returned a corpus, even when its input was a single vector.


I also added unittests for LdaMallet (run only if the MALLET_HOME env var is set).

Cheers,
Radim

suvir

unread,
Apr 14, 2014, 9:12:21 AM4/14/14
to gen...@googlegroups.com

this number has to match the #features of your `query` = #features of your `corpus`. Don't just put whatever number there :)

I.e., for LDA models, with `query = lda_model[bow]` and `corpus = lda_model[bow_corpus]`, this is `lda_model.num_topics`.

For BOW/TF-IDF models, with `query = tfidf_model[bow]` and `corpus = tfidf_model[bow_corpus], it is `len(dictionary)`.

In each case, the dimensionality of your query must match the corpus you're comparing it against... otherwise it's a conceptual mismatch. You can't directly compare an LDA vector to a BOW vector. Maybe this is the cause of your issues? Can you check both `query` and `corpus` were produced by the same pipeline?

Thanks Radim for explaining the dimensionality of comparison.  I think i get it correct now as both 
vec_lda(from query doc) and lda_model[doc] belongs to 200(num_topics) dimensional space.
Below is the code.

import logging
logging
.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#Query read function
def read_texts():
    file
= open('/home/sample/sample.txt', 'r')
    df
= file.read()
   
return df

#Reading the corpus
def iter_documents(reuters_dir):
   
"""Iterate over Reuters documents, yielding one document at a time."""
   
for fname in os.listdir(reuters_dir):
       
# read each document as one big string
        document
= open(os.path.join(reuters_dir, fname)).read()
       
# parse document into a list of utf8 tokens
       
yield utils.simple_preprocess(document)
 
class ReutersCorpus(object):
   
def __init__(self, reuters_dir):
       
self.reuters_dir = reuters_dir
       
self.dictionary = corpora.Dictionary(iter_documents(reuters_dir))
       
self.dictionary.filter_extremes(no_below=2,keep_n=30000)  # remove stopwords etc. defalt is no less than 5 times for minimum.
 
   
def __iter__(self):
       
for tokens in iter_documents(self.reuters_dir):
           
yield self.dictionary.doc2bow(tokens)
 
# set up the streamed corpus
corpus
= ReutersCorpus('/home/test/unsup_pre/')#corpus 1k documents.

# train 10 LDA topics using MALLET
mallet_path
= '/home/test/software/mallet/bin/mallet'
lda_model
= models.LdaMallet(mallet_path, corpus, num_topics=200, id2word=corpus.dictionary) # with 200 topics

query_text
= read_texts()
bow
= corpus.dictionary.doc2bow(utils.simple_preprocess(query_text))
vec_lda
= lda_model[bow]


hellinger
= lambda vec1, vec2: numpy.sqrt(0.5 * ((numpy.sqrt(vec1) - numpy.sqrt(vec2))**2).sum())

#for LDA models, with `query = lda_model[bow]` and `corpus = lda_model[bow_corpus]`, this is `lda_model.num_topics`.
sims
= [(docno, hellinger(matutils.sparse2full(vec_lda, lda_model.num_topics), matutils.sparse2full(lda_model[doc], lda_model.num_topics))) for docno, doc in enumerate(corpus)]


sims
= sorted(sims, key=lambda x: x[1])
print sims[:10]


In order to verify, i added my query doc into the corpus of 100 docs and got the expected result of 0.0 for similarity against same query doc to the whole corpus.

[(936, 0.0),
 
(325, 0.66297567852570571),
 
(485, 0.66387883058569441),
 
(188, 0.66500773755211373),
 
(823, 0.66597174894280486),
 
(796, 0.67776506777695011),
 
(618, 0.67842412143836928),
 
(881, 0.68023703272641944),
 
(661, 0.68083020321217136),
 
(867, 0.68430156575065204)]
 

With logging enabled, i can actually see the slow speed of pair wise document comparison and the need of optimization here. If i manage to optimize it, i will update here.
Is there any doc/link i should look to for this kind of optimization work.

Radim Řehůřek

unread,
Apr 14, 2014, 10:04:44 AM4/14/14
to gen...@googlegroups.com
Great, glad to hear, Suvir!


With logging enabled, i can actually see the slow speed of pair wise document comparison and the need of optimization here. If i manage to optimize it, i will update here.
Is there any doc/link i should look to for this kind of optimization work.

How large is your corpus? How many queries do you want to run (against the same index corpus)?

If you have a static corpus and many queries to compare against it, I think optimization #1 would be pre-computing the numpy arrays in advance. That way you don't gave to call lda[doc] and sparse2full and sqrt every time: precomputed_vec2s = sqrt(corpus2dense(lda[corpus])). That alone should speed up your queries orders of magnitude.

This is also what MatrixSimilarity does, for cosine similarity. You can look there for inspiration.

Let us know how it goes Suvir, I for one would be happy to help!
Radim



Radim Řehůřek

unread,
Apr 14, 2014, 10:20:53 AM4/14/14
to gen...@googlegroups.com
In case I wasn't clear: if your corpus is small (fits in RAM) and static, you can precompute

index = numpy.sqrt(corpus2dense(lda[corpus], lda.num_topics).T)

and then for queries:

q = numpy.sqrt(sparse2full(lda[query], lda.num_topics))
sims = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))

(not tested, but that's the idea)

-rr

suvir

unread,
Apr 14, 2014, 10:34:19 AM4/14/14
to gen...@googlegroups.com
On Monday, April 14, 2014 4:20:53 PM UTC+2, Radim Řehůřek wrote:
In case I wasn't clear: if your corpus is small (fits in RAM) and static, you can precompute

index = numpy.sqrt(corpus2dense(lda[corpus], lda.num_topics).T)

and then for queries:

q = numpy.sqrt(sparse2full(lda[query], lda.num_topics))
sims = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))

(not tested, but that's the idea)

-rr

 
Thanks. I'm going to test the idea. 
Corpus is 50K docs of size 25MB. I ran similarity test on 1k docs and it took around 30-40 min. Now, i'm running it on 50K docs (all preprocessed) and will see when do i get the results(may be 10+ hours or so ).  Corpus is static, atleast for now and near future.

Radim Řehůřek

unread,
Apr 14, 2014, 11:15:24 AM4/14/14
to gen...@googlegroups.com
With preprocessed corpus of 50k docs, you can expect ±1s for one Hellinger similarity query (depending on your `num_topics`).

-rr

suvir

unread,
Apr 15, 2014, 7:56:50 AM4/15/14
to gen...@googlegroups.com
On Monday, April 14, 2014 4:20:53 PM UTC+2, Radim Řehůřek wrote:
In case I wasn't clear: if your corpus is small (fits in RAM) and static, you can precompute

index = numpy.sqrt(corpus2dense(lda[corpus], lda.num_topics).T)

and then for queries:

q = numpy.sqrt(sparse2full(lda[query], lda.num_topics))
sims = numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))

ok, so this works in fast way but result is not correct:

index = numpy.sqrt(matutils.corpus2dense(lda_model[corpus], lda_model.num_topics).T)
q
= numpy.sqrt(matutils.sparse2full(vec_lda, lda_model.num_topics))
sims_opt
= numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))
sims_opt
= sorted(sims_opt, key=lambda x: x)
print sims_opt[:10]

[0.076513745,
 
0.66584373,
 
0.66920334,
 
0.67540497,
 
0.67766148,
 
0.67878348,
 
0.68208182,
 
0.68379384,
 
0.68432146,
 
0.68683475]

This works is old slow way but atleast allows a way to save index+ result is correct.

index = numpy.sqrt(matutils.corpus2dense((lda_model[doc] for docno, doc in enumerate(corpus)), lda_model.num_topics).T)
q
= numpy.sqrt(matutils.sparse2full(vec_lda, lda_model.num_topics))
sims_opt
= numpy.sqrt(0.5 * numpy.sum((q - index)**2, axis=1))
sims_opt
= sorted(sims_opt, key=lambda x: x)
print sims_opt[:10]

[0.0,
 
0.66941172,
 
0.66976386,
 
0.67768055,
 
0.67840725,
 
0.68287355,
 
0.68313229,
 
0.6832425,
 
0.68588322,
 
0.68718135]


And the code that makes the whole fast-slow difference is (using mallet lda) :

for docno, doc in enumerate(corpus):
    some_temp
=  lda_model[doc] <------------------------slow but correct
                                              vs
some_temp = lda_model[corpus] <------------------------fast but not correct

All other code is as mentioned in previous post.
Unrelated to this post, the original experiment of slow version hellinger distance on 50k docs is still running from yesterday, so nothing to report yet. 

Suvir

Radim Řehůřek

unread,
Apr 15, 2014, 8:14:13 AM4/15/14
to gen...@googlegroups.com
ok, so this works in fast way but result is not correct:


Both versions you posted use the "fast" precomputed index array.

So I take it my untested code worked :)



for docno, doc in enumerate(corpus):
    some_temp
=  lda_model[doc] <------------------------slow but correct
                                              vs
some_temp = lda_model[corpus] <------------------------fast but not correct


If MALLET returns something else when inferring a whole corpus vs. inferring one document at a time from that same corpus, then that's a bug.

I'm not sure why that happens, and I won't have time this week to investigate. Can you open an issue on github, so I don't forget?

Of course, if you yourself get a chance to find why the MALLET wrapper (or MALLET itself?) does that, that would be perfect.

Cheers,
Radim
Message has been deleted
Message has been deleted

suvirbhargav

unread,
Apr 15, 2014, 9:34:31 AM4/15/14
to gen...@googlegroups.com

If MALLET returns something else when inferring a whole corpus vs. inferring one document at a time from that same corpus, then that's a bug.

I'm not sure why that happens, and I won't have time this week to investigate. Can you open an issue on github, so I don't forget?

Of course, if you yourself get a chance to find why the MALLET wrapper (or MALLET itself?) does that, that would be perfect.

Cheers,
Radim
 
At-least logs look the same for both  lda on whole corpus vs individual docs. 

In [433]: some_temp = lda_model[corpus]
2014-04-15 15:04:49,104 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:04:51,456 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
 
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
  
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:04:52,578 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'

In [434]: %paste
for docno, doc in enumerate(corpus):
    some_temp 
=  lda_model[doc]
## -- End pasted text --
2014-04-15 15:05:28,265 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:05:28,266 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
 
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
  
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:05:29,172 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'
2014-04-15 15:05:29,897 : INFO : serializing temporary corpus to /tmp/3cbda2_corpus.txt
2014-04-15 15:05:29,898 : INFO : converting temporary corpus to MALLET format with /home/test/software/mallet/bin/mallet import-file --keep-sequence --remove-stopwords --token-regex '\S+' --input /tmp/3cbda2_corpus.txt --output /tmp/3cbda2_corpus.mallet.infer --use-pipe-from/tmp/3cbda2_corpus.mallet
 
Rewriting extended pipe from /tmp/3cbda2_corpus.mallet
  
Instance ID = b97159eb87b2eb32:366d52c0:14564908883:-7ff9
2014-04-15 15:05:30,687 : INFO : inferring topics with MALLET LDA '/home/test/software/mallet/bin/mallet infer-topics --input /tmp/3cbda2_corpus.mallet.infer --inferencer /tmp/3cbda2_inferencer.mallet --output-doc-topics /tmp/3cbda2_doctopics.txt.infer --num-iterations 100'
----------------------goes like this for rest of the 1k docs in the corpus.
ctrl
-c:KeyboardInterrupt:

Time to look into its code, may be some kind of approximation is happening when doing for the whole corpus( atleast both results are close).

Can you open an issue on github, so I don't forget?
 
ok.

suvir

unread,
Apr 16, 2014, 7:20:56 AM4/16/14
to gen...@googlegroups.com
By the way, after running similarity with HL distance on 50K docs with slow version (took 50 hours on corei5 laptop), the result is worth it :)

Suvir

suvir

unread,
Apr 16, 2014, 7:23:32 AM4/16/14
to gen...@googlegroups.com
2014-04-14 14:58:32,278 :
2014-04-16 09:15:55,953 :
Around 42 hours.

suvir

unread,
Apr 16, 2014, 11:26:01 AM4/16/14
to gen...@googlegroups.com
Keeping everything else same and using genism LDA model, result is different.
lda_model = models.ldamodel.LdaModel(corpus=corpus, id2word=corpus.dictionary, num_topics=200, update_every=1, chunksize=10000, passes=100, alpha = 'auto')#tried between 20 passes to 100 passes for 1k doc corpus.

For top 10 documents,  the distance goes to 0.81

In [90]: sims[:10]
Out[90]:
[(936, 5.0610560625263156e-06),#this should have been 0 but its ok with small value as well
 
(591, 0.49706860375251177),
 
(68, 0.64225190296834078),
 
(575, 0.65321187258382452),
 
(883, 0.76119358837155626),
 
(921, 0.79946005153274213),
 
(847, 0.80047784035037362),
 
(974, 0.80508864690650472),
 
(199, 0.8072571326632092),
 
(97, 0.81616722148694387)]

In previous thread with mallet LDA, distance between query doc with top 10 similar docs stays between 0.66 to 0.68.

Radim Řehůřek

unread,
Apr 16, 2014, 3:38:54 PM4/16/14
to gen...@googlegroups.com
And are the top 10 docs related? What kind of numbers are you expecting here?

The trouble with unsupervised learning is always how to evaluate it :)

(btw 200 topics may be too much, for a 1k corpus)

-rr
Reply all
Reply to author
Forward
0 new messages