Example of using gensim.Similarity?

Michael Beyer

unread,

Aug 11, 2015, 3:22:42 PM8/11/15

to gensim

Can someone provide an example of using gensim.Similarity?

Radim Řehůřek

unread,

Aug 12, 2015, 12:53:54 AM8/12/15

to gensim

Hello Michael,

examples are in the documentation:

http://radimrehurek.com/gensim/similarities/docsim.html#how-it-works

and some more info in the tutorial:

http://radimrehurek.com/gensim/tut3.html

What are you looking for specifically?

Best,

Radim

Michael Beyer

unread,

Aug 12, 2015, 1:04:22 AM8/12/15

to gensim

It wasn't clear where you specify the method to use to convert the corpus to a vector-space representation. However, I think you are using corpus to refer to the post-analysis (post LDA, LSI, word2vec etc) vector space model of the corpus, not the raw text. The genserver tool seemed to be along the lines of what I was thinking as a "soup to nuts" solution, but it's not supported as an open source application.

Radim Řehůřek

unread,

Aug 12, 2015, 5:50:54 AM8/12/15

to gensim

That's right -- the input is a corpus = sequence of sparse vectors. This can be an LDA corpus, LSI corpus etc.

Just make sure you specify the correct number of feature in Similarity constructor (vector dimensionality) -- for topic models, this is the number of topics.

HTH,

Radim

Michael Beyer

unread,

Aug 12, 2015, 11:20:49 AM8/12/15

to gensim

Thanks Radim. Now that I'm using 32-bit Python 2.7, all the analyses are working great. However, I noticed that the the cosine similarity doesn't appear to be normalized. Note below that the similarity of the first document in the corpus with itself is not 1. Since I'm new to gensim, I could easily be doing something wrong or interpreting the results incorrectly, but I usually think of cosine similarity as a normalized measure.

import gensim


documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
sentences = [i.lower().split() for i in documents]
model = gensim.models.Word2Vec(sentences, min_count=1, size=10,iter=25)
vocab = list(model.vocab)
modelvect = model[vocab]
index = gensim.similarities.MatrixSimilarity(corpus=modelvect, num_features=10)
print(index[modelvect[0]])

[ 0.14651574 -0.00235683 -0.02302131  0.12491421 -0.01356048  0.01349149
 -0.01606552 -0.10249705  0.07451458 -0.02230292 -0.0505394  -0.03286091
  0.02436696  0.07158656 -0.00208322 -0.11807165  0.12597062  0.0387757
 -0.04656203  0.04421109 -0.07506854 -0.12972374 -0.09153353 -0.13700466
 -0.18830855  0.09221681  0.05833372  0.13714077  0.13272005 -0.02978346
 -0.04873197  0.00358995 -0.01243957 -0.05738515 -0.07390147 -0.01589836
  0.04276333 -0.00409395 -0.05644933  0.05689331 -0.13340497 -0.01412689]

Michael Beyer

unread,

Aug 12, 2015, 11:39:27 AM8/12/15

to gensim

Ah...I see you actually mentioned this in your source code:

Lines: 521-523

# individual documents in fact may be in numpy.scipy.sparse format as well.
                # it's not documented because other it's not fully supported throughout.
                # the user better know what he's doing (no normalization, must
                # explicitly supply num_features etc).

when I test the type of input __init__ would see, I get:

print(isinstance(modelvect[0], numpy.ndarray))
True

In this case, would your code just skip over normalization? As per:

for docno, vector in enumerate(corpus):
                if docno % 1000 == 0:
                    logger.debug("PROGRESS: at document #%i/%i", docno, corpus_len)
                # individual documents in fact may be in numpy.scipy.sparse format as well.
                # it's not documented because other it's not fully supported throughout.
                # the user better know what he's doing (no normalization, must
                # explicitly supply num_features etc).
                if isinstance(vector, numpy.ndarray):
                    pass
                elif scipy.sparse.issparse(vector):
                    vector = vector.toarray().flatten()
                else:
                    vector = matutils.unitvec(matutils.sparse2full(vector, num_features))
                self.index[docno] = vector

If so, how could I have fed my matrix to Similarity to get it to normalize it?

Many thanks for your patience with me. Topic models are something that I am getting into here at work, and I'd like to use gensim since it a nice python package.

Michael Beyer

unread,

Aug 12, 2015, 11:52:36 AM8/12/15

to gensim

Would it make sense to change Line 525-5526 (in MatrixSimilarity.__init__) to read:

if isinstance(vector, numpy.ndarray):
    pass
elif scipy.sparse.issparse(vector):
    vector = vector.toarray().flatten()
else:


    pass


vector = matutils.unitvec(matutils.sparse2full(vector, num_features))
self.index[docno] = vector

It seems like only vectors that are neither numpy.ndarrays nor sparse will get normalized to unit vectors....I don't understand why this is the case.

Michael Beyer

unread,

Aug 12, 2015, 12:01:41 PM8/12/15

to gensim

Sorry for all the posts, but I realized that the proposed edit would not work. However, the below, minor change gave me the cosine similarities I expected:

Lines 5329-531 of docsim.py:

change:

else:

vector = matutils.unitvec(matutils.sparse2full(vector, num_features))

self.index[docno] = vector

to

else:

vector = matutils.sparse2full(vector, num_features)

self.index[docno] = matutils.unitvec(vector)

This way, we handle different formats in the conditionals (if they are not dense arrays), then normalize before adding to the index.

Your thoughts?

Thanks again! :-)

Michael Beyer

unread,

Aug 12, 2015, 12:52:39 PM8/12/15

to gensim

Verified that with my edit, I can get what I expect for an entire corpus. See example code:

import gensim
import numpy



documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]


sentences = [i.lower().split() for i in documents]

model = gensim.models.Word2Vec(sentences, min_count=1, size=10,iter=25)

vocab = list(model.vocab)

modelvect = model[vocab]

index = gensim.similarities.MatrixSimilarity(corpus=modelvect, num_features=10)

X = numpy.array([gensim.matutils.unitvec(i) for i in modelvect])

print(index[X])

[[ 1.         -0.02032484 -0.14800596 ...,  0.23107025 -0.73747933
  -0.06363575]
 [-0.02032484  1.          0.33982417 ..., -0.03141022  0.06709643
  -0.27975121]
 [-0.14800596  0.33982417  0.99999994 ..., -0.11238585  0.63590127
  -0.26275131]
 ..., 
 [ 0.23107025 -0.03141022 -0.11238585 ...,  1.         -0.10223977
   0.85874605]
 [-0.73747933  0.06709643  0.63590127 ..., -0.10223977  0.99999994
  -0.02586098]
 [-0.06363575 -0.27975121 -0.26275131 ...,  0.85874605 -0.02586098  1.        ]]

so if i feed it a normalized array, then it will work, but if I feed it:

print(index[modelvect])

I get:

[[ 0.14651577 -0.00297791 -0.02168521 ...,  0.03385543 -0.10805234
  -0.00932364]
 [-0.00235684  0.115959    0.03940568 ..., -0.00364229  0.00778043
  -0.03243967]
 [-0.02302131  0.05285733  0.15554313 ..., -0.01748085  0.09891008
  -0.04086917]
 ..., 
 [ 0.05689333 -0.00773371 -0.02767126 ...,  0.24621657 -0.02517313
   0.21143754]
 [-0.13340497  0.01213728  0.11503019 ..., -0.01849448  0.18089315
  -0.00467807]
 [-0.01412686 -0.06210357 -0.05832968 ...,  0.19063795 -0.00574102
   0.22199573]]

This not makes sense, since we see the dot product of each input vector (unnormalized) with the set of normalized index vectors.