Gensim Word2Vec changing the input sentence order?

624 views
Skip to first unread message

sam

unread,
Apr 24, 2016, 7:39:08 PM4/24/16
to gensim

In the gensim's documentation window size is defined as,

window is the maximum distance between the current and predicted word within a sentence.

which should mean when looking at context it doesn't go beyond the sentence boundary. right?


What i did was i created a document with several thousand tweets and selected a word (q1) and then selected most similar words to q1 (using model.most_similar('q1')). But then, if I randomly shuffle the tweets in the input document and then did the same experiment (without changing word2vec parameters) I got a different set of most_similar words to q1.


Can't really understand why that happens if only it's gonna look at is sentence level information? can anyone pls explain this?

Gordon Mohr

unread,
Apr 24, 2016, 8:30:40 PM4/24/16
to gensim
The Word2Vec/Doc2Vec algorithms make use of randomness, and unless you take very careful (and performance-limiting) steps, runs are not completely deterministic. But even eliminating that randomness, the stochastic-gradient-descent isn't going to wind up in exactly the same end-state if the text examples are provided in a different order. 

If you've got enough data and are doing enough iterations, the results from run to run should be largely similar, though not identical. 

See some other discussion about relevant factors in prior thread https://groups.google.com/forum/#!msg/gensim/7eiwqfhAbhs/qC0pmbw5HwAJ – the considerations are the same win Word2Vec or Doc2Vec. 

- Gordon

sam

unread,
Apr 25, 2016, 2:27:01 AM4/25/16
to gensim
Thanks a lot for the reply.

I understand about the randomness, but if i run word2vec with the same parameter setting on a given dataset i get the same output for any number of runs. This is due to having the same seed value at the beginning if i'm correct..?

>> model1 = word2vec.Word2Vec( sents1, size=100, window=5, min_count=5 )
>> model1.most_similar("show")
>> [('tweets', 0.2673164904117584), ('song', 0.26358550786972046), ('added', 0.2462688833475113), ('7pm', 0.24363331496715546), ('ma', 0.23817510902881622), ('found', 0.2252378612756729), ("they're", 0.22347548604011536), ('season', 0.22232073545455933)] .....


>> model2 = word2vec.Word2Vec( sents1, size=100, window=5, min_count=5 )
>> model2.most_similar("show")
>> [('tweets', 0.2673164904117584), ('song', 0.26358550786972046), ('added', 0.2462688833475113), ('7pm', 0.24363331496715546), ('ma', 0.23817510902881622), ('found', 0.2252378612756729), ("they're", 0.22347548604011536), ('season', 0.22232073545455933)] .....



So now when i change the order of the tweets in the input document, only thing that i could see changing is the context (..?) but the documentation says (as I mentioned in the previous post) context doesn't overlap between sentences: window is the maximum distance between the current and predicted word within a sentence.

Had this NOT being the case, it's clear to me why the weights change when you change the order of the tweets in the input document. Simply because the context of last and first few words of a given sentence (tweet) changes when you change the order and thus output weights will be different.

Is the documentation incorrect there? 

Gordon Mohr

unread,
Apr 25, 2016, 5:40:51 AM4/25/16
to gensim
The same `seed` won't necessarily be enough to get identical results, unless other factors are also controlled: running in a single thread (as your test code does), and (in Python 3) ensuring any PYTHONHASHSEED randomization is the same between runs. (Two runs without relaunching the interpreter will be stable in this respect, but running the same code in the next interpreter launch could iterate over vocabulary keys in a different order, so words are at different indexes, and thus randomly-sampled at different times, etc.)

The contexts don't overlap between sentences. (The highlighted documentation is correct.)

The stochastic gradient descent training process isn't oblivious to the order of training examples. Having a different example first means different errors are backpropagated first, which changes all future predictions and errors, and thus results in a different end-state. With enough data and passes, the final results (in terms of relative-distances between words) should be very similar. But not identical. 

- Gordon

shweta tiwari

unread,
Apr 25, 2016, 7:00:10 AM4/25/16
to gen...@googlegroups.com
Hi Sam,
It seems irrelevant to your post. Could you tell me how did you shuffle your tweets in the input document. I am new to gensim and python as well. Could you please help!
Thanks

Regards
Shweta 


--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sam

unread,
Apr 26, 2016, 11:52:46 PM4/26/16
to gensim
Thanks Gordon for taking time to ans.

@ssh26: I used the "shuf" tool in ubuntu to randomize the input.

shweta tiwari

unread,
Apr 27, 2016, 5:15:50 AM4/27/16
to gen...@googlegroups.com
thanks. @sam

Regards
Shweta Tiwari


--

sam

unread,
Apr 28, 2016, 12:50:34 AM4/28/16
to gensim

Hi Gordon,

I have attached a graph that i drew to understand how changing the order of input sentences matters when finding similar words to a given word.

model parameters used: model1 = word2vec.Word2Vec(sents1 , size=100, window=5, min_count=5, iter=n_iter, sg=0)

Graph: To draw the graph what i did was I ran word2vec with above parameters for the original document (D) and the shuffled document (D') and took the top 10 or 20 (two bars) most_similar('q') words to a specific query word q, and calculated the jaccard similarity score between the two sets of words for when iter=1,10,100.

 
It seems as the no of iterations increase, lesser and lesser similar words between the two sets of words got from running word2vec on D and D'.

I would really expect quite the opposite of this graph? I would have imagined that as the no of iterations increase, the word vectors would get stabilized and hence you would see more similar words from D and D'. 

Can I know what's your take on this or why this is happening? 

Gordon Mohr

unread,
Apr 28, 2016, 2:15:52 AM4/28/16
to gensim
I, too, would expect the similarity to grow with more iterations. 

Are you sure you're calculating jaccard-similarity (where most-similar is 1.0) rather than jaccard-distance (where most-similar is 0.0)? 

When you check individual real words (not just the abstract 'q'), do the most-similar results seem reasonable in both permutations? Do they seem more or less reasonable after more iterations? 

How big is your dataset? (Toy-sized datasets often don't give meaningful results, though using more iterations may be able to make up for that. I believe if the dataset is small compared to the model parameters, you can wind up with an overfit model that's 'great' at the prediction task but yields word-vectors that are semantically useless, because they've become more like keys-into-an-arbitrary-lookup-table than abstractions in a shared continuous space.) Can you demonstrate the same effect with a public dataset, such as the 'text8' or 'text9' word-runs from Wikipedia?

Do you see the same results in both CBOW or skip-gram modes? With both HS or negative-sampling? 

- Gordon
Reply all
Reply to author
Forward
0 new messages