Using Word2vec for Clustering (with Scikit Learn): How to get the features

2,553 views
Skip to first unread message

Dale Cooper

unread,
Aug 10, 2017, 11:35:24 AM8/10/17
to gensim
Dear Gensim-Community,

 I am currently trying to use the vectors from my word2vec model for kmeans-clustering with Scikit Learn. I have a problem in deciding what to use as X as input for kmeans().fit(X).

First i went with this Site

There they set X =  model.wv.syn0

But using this approach my results are not really satisfying, the clusters don't really fit.

Then I found another approach here

In this Code you can find:

word2vec_dict={}
for i in model.wv..vocab.keys():
    try:
        word2vec_dict[i]=model[i]
    except:    
        pass

X = np.array([i.T for i in word2vec_dict.itervalues()])


When using this as X my results are kind of good, even after multiple testing.

But I still don't understand the difference between both X. The ndarray and the vectors have the same shape, but the values are different.
Could somebody with more brainpower explain this to me? =)


Ivan Menshikh

unread,
Aug 11, 2017, 3:02:09 AM8/11/17
to gensim
Hi Dale,

This is same approaches, I'm trying to reproduce your problem, but all OK for me
I used this code

from gensim.models import Word2Vec
import numpy as np 

text = [["a", "b", "b", "a"],
        ["a", "b", "a", "c", "a"],
        ["a"] * 4,
        ["b"] * 4]

model = Word2Vec(sentences=text, size=30, negative=2, window=1, iter=500, min_count=1)

word2vec_dict = {}
words = model.wv.index2word  # order from model.wv.syn0

for i in words:
    word2vec_dict[i] = model[i]

X = np.array([word2vec_dict[i].T for i in words])

assert model.wv.syn0.shape == X.shape
np.testing.assert_almost_equal(X, model.wv.syn0)

Please add an code example where your values is different.

Dale Cooper

unread,
Aug 15, 2017, 5:29:54 AM8/15/17
to gensim
Hello Ivan,

thank you for your answer and sorry for my late response. I wasn't able to use my computer until now.

When I try your code it also works for me, but here is an example of my code:

word2vec_dict={}
for i in modelC.wv.vocab.keys():
   
try:
        word2vec_dict
[i]=modelC.wv[i]
   
except:    
       
pass

X
= np.array([i.T for i in six.itervalues(word2vec_dict)])
Y
= modelC.wv.syn0

When I look into the arrays they are different:

X
[0]
Out[43]:
array
([ 0.18384653, -0.02168597,  0.16721378, ...,  0.50958902,
       
-0.70173872, -0.02091845], dtype=float32)

Y
[0]
Out[44]:
array
([-0.00280283,  0.00300408,  0.00404751, ...,  0.00016425,
       
-0.00118467,  0.00060632], dtype=float32)

But since I use the same model I don't know why.





Ivan Menshikh

unread,
Aug 15, 2017, 8:03:01 AM8/15/17
to gensim
Hi Dale,

Looks like different order, dict in python don't store order of keys, for this reason, for comparing you should use same order that used in syn0 matrix. For this reason, in my example, I used model.wv.index2word.

Dale Cooper

unread,
Aug 15, 2017, 8:11:27 AM8/15/17
to gensim
You are right. So if I want to go on like this:

kmeans = KMeans().fit(X)
labels
= kmeans.labels_
vocab
= list(modelC.wv.vocab)
clusters
= [list(a) for a in zip(vocab, labels)]



I have to use model.wv.syn0 = X, because otherwise it does not match with the order in vocab, right?
Which is awkward, because the clusters made more sense to me the other way.


Ivan Menshikh

unread,
Aug 15, 2017, 8:26:02 AM8/15/17
to gensim
Yes, you are correct, but please use model.wv.index2word instead of model.wv.vocab, because model.wv.vocab is dictionary too (and you have no any guarantees for key order).

Dale Cooper

unread,
Aug 15, 2017, 9:14:27 AM8/15/17
to gensim
Now I know why my results were bad. Thank you, Ivan, you helped me a lot.

Raj kumar

unread,
Jan 17, 2018, 11:53:17 AM1/17/18
to gensim
heyyy dale, can you share ur code
Reply all
Reply to author
Forward
0 new messages