You may want to look at (in recent gensim versions) the "predict_output_word()` method:
However, even though making the NN incrementally better at predicting words is how a Word2Vec model is built, the applications of Word2Vec word-vectors are typically *not* in predicting words. (Some other kind of model might be better at that task.) Rather, the word-vectors that fall out of the Word2Vec NN are exported and used for other purposes where they sometimes perform well, matching human intuitions about how words are related.
The training process doesn't actually form complete interpretable probabilities for a particular context/output. (And indeed in the original word2vec.c Google code, and earlier versions of gensim Word2Vec, there was no API like `predict_output_word()` to return specific word predictions.)
Instead, for each training-example, the model is just nudged to be more compliant with that one example. So in your illustrative example: given target of `sentences` and context of `to model * using word2vec`, training doesn't form any absolute idea of the probability of `sentences` – it just takes whatever the model predicted, and nudges it slightly more in the direction of predicting 'sentences'. It's only the repetition of that nudge, over many contrasting examples that sometimes offset but other times reinforce each other, that the model ends up with a useful final many-dimensional relative arrangement of neural-network-weights and word-vectors.
To actually form a specific probability, the `predict_output_word()` function performs the quite-expensive calculation of testing the NN's outputs for *every* possible target vocabulary word, then normalizes those to sum to 1.0 (100%), then sorts all those probabilities and returns the top-N candidates. As I mentioned: expensive! Further caveats about this method's operation:
* For now, `predict_output_word()` is only implemented for Word2Vec models using negative-sampling, where there's a clear natural way to check each possible word's individual outputs. (The interpretation of the output-nodes of a hierarchical-softmax (`hs=1`) network, as probabilities for more-than-one possible prediction, is a bit murkier – though perhaps possible someday.)
* `predict_output_word()` is not applying the same sort of de-facto discounting of words further to the sides of the `window` (as happens during training), and may not be simulating skip-gram predictions right. (It's doing a CBOW-like summing of the window words, even if the model was trained with skip-grams. The net effect might be similar though; I'm not sure.)
- Gordon