Get a similarity matrix from word2vec in python (Gensim)

MMM

unread,

Nov 7, 2017, 10:54:22 AM11/7/17

to gensim

I am using the following python code to generate similarity matrix of word vectors (My vocabulary size is 77).

similarity_matrix = []
index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0))

for sims in index:
    similarity_matrix.append(sims)
similarity_array = np.array(similarity_matrix)

The dimensionality of the similarity_array is 300 X 300. However as I understand the dimensionality should be 77 x 77 (as my vocabulary size is 77).

i.e.,
      word1, word2, ......, word77
word1 0.2,     0.8,    ...,  0.9
word2 0.1,     0.2,   ....,  1.0
...  ....,    ....., .....,   ....
word77 0.9,  0.8,    ...,    0.1

Please let me know what is wrong in my code.

Moreover, I want to know what is the order of the vocabulary (word1, word2, ..., word77) used to calculate this similarity matrix? 
Can I obtain this order from model.wv.index2word? Please help me!

Ivan Menshikh

unread,

Nov 8, 2017, 1:05:48 AM11/8/17

to gensim

Hi,

I write small example for you, I hope that it will be helpful for you.

from gensim.models import Word2Vec
import numpy as np
from scipy.spatial.distance import cdist

sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
model = Word2Vec(sentences=sentences, size=10, window=1, iter=2000, min_count=1)

_ = {word: idx for (idx, word) in enumerate(model.wv.index2word)}  # for comfort
assert len(_) == 6

similarity = 1 - cdist(model.wv.syn0, model.wv.syn0, metric='cosine')
assert similarity.shape == (6, 6)

similarity[_["cat"], _["dog"]]  # similarity between 'cat' and 'dog'

Radim Řehůřek

unread,

Nov 8, 2017, 9:49:43 AM11/8/17

to gensim

Hi MMM,

On Tuesday, November 7, 2017 at 4:54:22 PM UTC+1, MMM wrote:

I am using the following python code to generate similarity matrix of word vectors (My vocabulary size is 77).
similarity_matrix = []
index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0))

for sims in index:
    similarity_matrix.append(sims)
similarity_array = np.array(similarity_matrix)
The dimensionality of the similarity_array is 300 X 300. However as I understand the dimensionality should be 77 x 77 (as my vocabulary size is 77).

That's weird. Can you post the log (at INFO or DEBUG level)?

i.e.,
      word1, word2, ......, word77
word1 0.2,     0.8,    ...,  0.9
word2 0.1,     0.2,   ....,  1.0
...  ....,    ....., .....,   ....
word77 0.9,  0.8,    ...,    0.1

Please let me know what is wrong in my code.

Moreover, I want to know what is the order of the vocabulary (word1, word2, ..., word77) used to calculate this similarity matrix? 
Can I obtain this order from model.wv.index2word? Please help me!

Yes, you can get the words in the same order from model.wv.index2word. What exactly is the problem? The word at index `i` is `model.wv.index2word[i]`.

HTH,

Radim

Volka

unread,

Nov 16, 2017, 11:03:09 PM11/16/17

to gensim

Hi Radim,

Thanks a lot for your reply. Mentioned below is my INFO log

2017-11-17 14:26:51,207 : INFO : loading Word2Vec object from od_w2v_2

2017-11-17 14:26:51,227 : INFO : loading wv recursively from od_w2v_2.wv.* with mmap=None

2017-11-17 14:26:51,227 : INFO : setting ignored attribute syn0norm to None

2017-11-17 14:26:51,229 : INFO : setting ignored attribute cum_table to None

2017-11-17 14:26:51,232 : INFO : loaded od_w2v_2

2017-11-17 14:26:51,304 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)

2017-11-17 14:26:51,317 : INFO : creating matrix with 300 documents and 74 features

Please let me know where I have made the mistake?

Ivan Menshikh

unread,

Nov 17, 2017, 1:00:19 AM11/17/17

to gensim

Current logging looks OK.

Try to replace index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0)) to index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0.T))

Message has been deleted

Volka

unread,

Nov 17, 2017, 8:32:17 AM11/17/17

to gensim

Thanks a lot Ivan. That works :) Can you please tell me what happens with the `(model.wv.syn0.T)`?

By the way, if I want to build a cosine distance matrix, is it correct to do as follows (or is there any special way to calculate cosine distance for word vectors)?

from sklearn.metrics.pairwise import cosine_distances

word_cosine = cosine_distances(model.wv.syn0)

Thanks a lot once again!

Ivan Menshikh

unread,

Nov 20, 2017, 4:18:52 AM11/20/17

to gensim

I transpose original matrix, that's all.

You variant with sklearn is correct too.

Volka

unread,

Nov 21, 2017, 8:30:40 AM11/21/17

to gensim

Thanks a lot Ivan for your valuable feedback :)

Message has been deleted

tedo.v...@gmail.com

unread,

Mar 14, 2018, 4:06:37 PM3/14/18

to gensim

@Ivan

What if I want similarity matrix between sentences, ie. if I have a 2D list of list, with sentences containing words. How to make similarity matrix between sentences?
Now I'm using:
dim=len(list)
w2v_simmatrix = [[1 for x in xrange (dim)] for y in xrange(dim)]
for i in xrange(0, dim):
    for j in xrange(0, dim):
        if i>j:
            w2v_simmatrix[i][j] = w2v_simmatrix[j][i]
        else:
            w2v_simmatrix[i][j] = w2v_model.wv.n_similarity(list[i], list[j])

I don't think this is optimal.

Ivan Menshikh

unread,

Mar 14, 2018, 9:27:15 PM3/14/18

to gensim

Hello,

the simplest way - construct the vector for each sentence and use the previous approach that was be already mentioned by you.

tedo.v...@gmail.com

unread,

Mar 15, 2018, 5:25:46 PM3/15/18

to gensim

Sorry for bothering you, but how to construct a vector for each sentence? Maybe by summing the word vector and then normalizing it? Do I have to nomalize it?

Will it then be the same or very similar result as I get them now?

Ivan Menshikh

unread,

Mar 15, 2018, 10:34:50 PM3/15/18

to gensim

Hello,

how to construct a vector for each sentence? Maybe by summing the word vector and then normalizing it?

yes, this is the same things (you can "mix" word vectors in any available way, averaging is simplest one), this is the way how "n_similarity" works.

On Friday, March 16, 2018 at 2:25:46 AM UTC+5, tedo.v...@gmail.com wrote:

Sorry for bothering you, but how to construct a vector for each sentence? Maybe by summing the word vector and then normalizing it? Do I have to nomalize it

Ivan Menshikh

unread,

Mar 15, 2018, 10:35:14 PM3/15/18

to gensim

Have a look at the code - https://github.com/RaRe-Technologies/gensim/blob/122dad657688b51f0176a81a20bd1fa6d0986b8b/gensim/models/keyedvectors.py#L830

Reply all

Reply to author

Forward