How can I use sliding window based context words as input to find out target/center word?

Enamul Haque

unread,

May 12, 2017, 8:52:38 AM5/12/17

to gensim

Let's say my input sentence is: "I want to model sentences using word2vec model of gensim" and window size is 2.

So my inputs for the model are:

1. I want to model sentences

2. want to model sentences using

3. to model sentences using word2vec

4. model sentences using word2vec model

5. sentences using word2vec model of

6. using word2vec model of gensim

Output:

1. what is the probability of the center word "to" in the first input?

2. what is the probability of the center word "model" in the second input?

3. what is the probability of the center word "sentences" in the third input?

4. what is the probability of the center word "using" in the fourth input?

5. what is the probability of the center word "word2vec" in the fifth input?

6. what is the probability of the center word "model" in the sixth input?

Can I get this from gensim? I was using model.most_similar() function but the output is not clear to me. Please someone help me in this context.

Ivan Menshikh

unread,

May 15, 2017, 4:55:19 AM5/15/17

to gensim

Hello Enamul,

`most_similar` method returns `similarities` (not probabilities).

This method takes positive and negative words, calculate mean vector (positive vectors by default have weight +1, negative vectors have weight -1) and calculate cosine similarity between this mean-words-vector and other word-vectors that models see.

This method returns most relevant word ordered by cosine similarity

Auto Generated Inline Image 1

Gordon Mohr

unread,

May 15, 2017, 4:40:45 PM5/15/17

to gensim

Note that uneven windows are also used near sentence boundaries. So given your setup, the NN is also nudged during training to be better at:

* predicting `I` given context `* want to`

* predicting `want` given context `I * to model`

* predicting `of` given context `word2vec model * gensim`

* predicting `gensim` given context `model of *`

You may want to look at (in recent gensim versions) the "predict_output_word()` method:

https://github.com/RaRe-Technologies/gensim/blob/d3db946bc1a3fd3e22b97c4f85f669398f06ee4b/gensim/models/word2vec.py#L1286

However, even though making the NN incrementally better at predicting words is how a Word2Vec model is built, the applications of Word2Vec word-vectors are typically *not* in predicting words. (Some other kind of model might be better at that task.) Rather, the word-vectors that fall out of the Word2Vec NN are exported and used for other purposes where they sometimes perform well, matching human intuitions about how words are related.

The training process doesn't actually form complete interpretable probabilities for a particular context/output. (And indeed in the original word2vec.c Google code, and earlier versions of gensim Word2Vec, there was no API like `predict_output_word()` to return specific word predictions.)

Instead, for each training-example, the model is just nudged to be more compliant with that one example. So in your illustrative example: given target of `sentences` and context of `to model * using word2vec`, training doesn't form any absolute idea of the probability of `sentences` – it just takes whatever the model predicted, and nudges it slightly more in the direction of predicting 'sentences'. It's only the repetition of that nudge, over many contrasting examples that sometimes offset but other times reinforce each other, that the model ends up with a useful final many-dimensional relative arrangement of neural-network-weights and word-vectors.

To actually form a specific probability, the `predict_output_word()` function performs the quite-expensive calculation of testing the NN's outputs for *every* possible target vocabulary word, then normalizes those to sum to 1.0 (100%), then sorts all those probabilities and returns the top-N candidates. As I mentioned: expensive! Further caveats about this method's operation:

* For now, `predict_output_word()` is only implemented for Word2Vec models using negative-sampling, where there's a clear natural way to check each possible word's individual outputs. (The interpretation of the output-nodes of a hierarchical-softmax (`hs=1`) network, as probabilities for more-than-one possible prediction, is a bit murkier – though perhaps possible someday.)

* `predict_output_word()` is not applying the same sort of de-facto discounting of words further to the sides of the `window` (as happens during training), and may not be simulating skip-gram predictions right. (It's doing a CBOW-like summing of the window words, even if the model was trained with skip-grams. The net effect might be similar though; I'm not sure.)

- Gordon

Reply all

Reply to author

Forward