Probability of a word

Mitchell Eccles

unread,

Mar 19, 2017, 8:15:10 AM3/19/17

to gensim

Hi,

I'm just starting out on my journey with word2vec. I have a fairly large corpus (~6m lines/sentences) of text that I am using to train a word2vec model.

I have an idea that I can use the model to infer words from a string of untokenized text, e.g: given "astringoftext", I will be able to generate "a string of text", as the most likely tokenization of the string. The way I see this working is to evaluate every possible tokenization of a string and keep the one with the best score. For example, "astrin gof te xt" will not score as well as "a string of text".

The way I plan to calculate the score, is to calculate the probability of every word/token, given a context. So for example, the probability of "string", given the context ["a", "of", "text'], the probability of "string" given ["a", "oftext"] etc And then keep the tokens with the highest probability that make a valid tokenization of the input string.

I figure word2vec can help me with this problem, because of the way word2vec trains - it calculates probabilities of words given a context.

However, I'm struggling to work out how I can use the trained model to give me the probability of a word given a context... I get that a word2vec model can identify synonyms, related concepts, and analogies. But I guess my question boils down to, how do I access the trained hidden layer(?), so I can compute probabilities of potentially previously unseen words, and thus the probability of a word given a context? - if that makes sense?

Thanks

Leo Vogels

unread,

Mar 19, 2017, 1:58:10 PM3/19/17

to gensim

Hello

in chapter 20 of the free ebook "data science from scratch" you will find the complete algorithm to calculate (conditional) wordprobalities

Leo Vogels

Op zondag 19 maart 2017 13:15:10 UTC+1 schreef Mitchell Eccles:

Lev Konstantinovskiy

unread,

Mar 27, 2017, 7:40:04 PM3/27/17

to gensim

Hi Mitchell,

Predicting a word from a context is implemented in the github version of Gensim and will be released in the next release.

model.predict_output_word(['put-down-D', 'stack-C-B'])

Regards

Lev

Reply all

Reply to author

Forward