In the gensim's documentation window size is defined as,
window is the maximum distance between the current and predicted word within a sentence.
which should mean when looking at context it doesn't go beyond the sentence boundary. right?
What i did was i created a document with several thousand tweets and selected a word (q1) and then selected most similar words to q1 (using model.most_similar('q1')). But then, if I randomly shuffle the tweets in the input document and then did the same experiment (without changing word2vec parameters) I got a different set of most_similar words to q1.
Can't really understand why that happens if only it's gonna look at is sentence level information? can anyone pls explain this?
>> model1 = word2vec.Word2Vec( sents1, size=100, window=5, min_count=5 )
>> model1.most_similar("show")
>> [('tweets', 0.2673164904117584), ('song', 0.26358550786972046), ('added', 0.2462688833475113), ('7pm', 0.24363331496715546), ('ma', 0.23817510902881622), ('found', 0.2252378612756729), ("they're", 0.22347548604011536), ('season', 0.22232073545455933)] .....
>> model2 = word2vec.Word2Vec( sents1, size=100, window=5, min_count=5 )
>> model2.most_similar("show")
>> [('tweets', 0.2673164904117584), ('song', 0.26358550786972046), ('added', 0.2462688833475113), ('7pm', 0.24363331496715546), ('ma', 0.23817510902881622), ('found', 0.2252378612756729), ("they're", 0.22347548604011536), ('season', 0.22232073545455933)] .....--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
most_similar('q') words to a specific query word q, and calculated the jaccard similarity score between the two sets of words for when iter=1,10,100.