Equivalent function for predict_output_word in gensim.models.fasttext?

Shonda W.

unread,

Feb 26, 2020, 9:53:38 PM2/26/20

to Gensim

Hello,

I am trying to predict a target word given context words in gensim.models.fasttext, but I don't see a function to do so.

Curiously, is seems there was such a function in the now deprecated version of fasttext (https://radimrehurek.com/gensim/models/deprecated/fasttext.html):

>> predict_output_word(context_words_list, topn=10)

>> Report the probability distribution of the center word given the context words as input to the trained model.

Unfortunately, the documentation does not list an equivalent function in the current version of fasttext, only stating that this has been "deprecated since version 3.3.0" and to use gensim.models.fasttext instead.

I am wondering if there is even an implementation of predict_output_word in the current version of gensim fasttext, and if not, are they plans to include it soon?

Thanks in advance

Gordon Mohr

unread,

Feb 27, 2020, 12:58:50 PM2/27/20

to Gensim

Your best bet would be to study the implementation of that `predict_output_word()` method where it still exists, and adapt that code to work on your FastText model, from outside the model, to meet your needs. For example you can see the source for that method on `Word2Vec` at:

https://github.com/RaRe-Technologies/gensim/blob/68ec5b8ed7f18e75e0b13689f4da53405ef3ed96/gensim/models/deprecated/word2vec.py#L1497

But it's important to note:

* these models aren't necessarily very good at such center-word predictions. While they vaguely attempt something like such a prediction, as their training-goal, training doesn't involve a rigorous prediction (which would be very expensive). Rather, there are just repeated 'sparse' nudges to make the model a little-better at whatever current training-example it's looking at. It's the interesting side-effect – useful arrangements in distance/direction of word-vectors – that's the point, not ever being actually good at such predictions. (It's likely that even more-simple cooccurrence tables could give better predictions.)

* where this method is implemented, it only works for models with 'negative' sampling; it uses a CBOW-like process even in skip-gram trained models; and it doesn't overweight nearby words the way training does (instead treating all words in the full `window` equally). So it's best considered an experimental curiosity, rather than a true reading of the training-time pseudo-predictions. Many (most?) word2vec implementations don't provide such a utility method at all.

- Gordon

Shonda W.

unread,

Feb 27, 2020, 4:01:57 PM2/27/20

to Gensim

Thank you very much for your reply and for the link to the source of that method. I will definitely take a look.

However, I do hope you can help clarify one thing I am confused on:

>> But it's important to note:

>> * these models aren't necessarily very good at such center-word predictions.

Quite interesting, as to my understanding, the point of choosing CBOW or SG algorithm was how word vectors would be generated (all best explained in the famous "the quick brown fox jumps over the lazy dog" example with window size of 2):
SG: the input is the center word, while the outputs are the context words surrounding the center word

EXP: the quick brown fox jumps over the lazy dog -> (brown, the); (brown, quick); (brown,fox); (brown, jumps)

CBOW: the input is context words surrounding the target word, which is the output

EXP: the quick brown fox jumps over the lazy dog -> (the quick fox jumps, brown)

Unless I am missing something, isn't that why we specify either CBOW or SG during training, and why it should be good at predicting a target word?
Otherwise, why does gensim's own word2vec still include it if by your own admission "It's likely that even more-simple cooccurrence tables could give better predictions"? I understand that without a significant number of samples that the target word predictions wouldn't be perfect, but I thought it still seems useful to include since fasttext is supposed to be an extension of word2vec, which leaves me curious as to why predict_output_word still exists in one, but was removed from the other...

Gordon Mohr

unread,

Feb 27, 2020, 8:57:43 PM2/27/20

to Gensim

On Thursday, February 27, 2020 at 1:01:57 PM UTC-8, Shonda W. wrote:

Thank you very much for your reply and for the link to the source of that method. I will definitely take a look.

However, I do hope you can help clarify one thing I am confused on:

>> But it's important to note:
>> * these models aren't necessarily very good at such center-word predictions.

Quite interesting, as to my understanding, the point of choosing CBOW or SG algorithm was how word vectors would be generated (all best explained in the famous "the quick brown fox jumps over the lazy dog" example with window size of 2):
SG: the input is the center word, while the outputs are the context words surrounding the center word
EXP: the quick brown fox jumps over the lazy dog -> (brown, the); (brown, quick); (brown,fox); (brown, jumps)

CBOW: the input is context words surrounding the target word, which is the output
EXP: the quick brown fox jumps over the lazy dog -> (the quick fox jumps, brown)

Unless I am missing something, isn't that why we specify either CBOW or SG during training, and why it should be good at predicting a target word?

Yes, the choice of either SG or CBOW affects what particular training 'micro-examples' are created, and presented to the shallow neural-network, and how the neural-network weights are gently corrected.

In SG training, the NN will be presented with input that's single context-word-vectors. In CBOW training, the NN will be presented with input that's an average of multiple context-word-vectors. In both cases, the NN's outputs will be (loosely, 'sparsely') compared to outputs that indicate specific output words, and nudged to do slightly better on that one example, before moving on to the next micro-example.

But at no point during training or typical use of the word2vec algorithm is any actual full prediction – of a single most-likely word, or of a ranked list of words by their relative probability – made. To do so during training would be so expensive as to be impractical - requiring calculation of the NN's output for every output node, not just the tiny handful of nodes checked/updated in actual 'sparse' training.

And, none of the word2vec papers promote such shallow networks as an objectively "good" way to predict target words. And, the original `word2vec.c` code released by the Google researchers had no function/utility for doing exact predictions, during or after training.

The point of word2vec is obtaining word-vectors that happen to be useful for many other applications. The training process that pokes-and-prods the NN in ways to make it gradually more predictive of bulk text is just a useful way to force those vectors as a by-product, even if that NN is never very good at those predictions, nor even ever fully evaluated for its individual predictions.

Otherwise, why does gensim's own word2vec still include it if by your own admission "It's likely that even more-simple cooccurrence tables could give better predictions"? I understand that without a significant number of samples that the target word predictions wouldn't be perfect, but I thought it still seems useful to include since fasttext is supposed to be an extension of word2vec, which leaves me curious as to why predict_output_word still exists in one, but was removed from the other..

Someone contributed the `predict_output_word()` method to `Word2Vec`, even with all its caveats & limitations. For a while, the way the class hierarchy worked, FastText inherited it for free – though whether it was ever tested or fit for any purpose, in Word2Vec or elsewhere, I don't know.

Then someone else reorganized the class hierarchy such that the method wasn't shared by both classes any more. (There wasn't any conscious decision to "remove" it. Just that, as an idiosyncratic, incomplete, not-often-used and not-often-useful feature, it wasn't maintained for `FastText` across other changes.) That's where we are now.

If you adapt the ~16 lines of code from the `Word2Vec` implementation to run against your `FastText` model, you'll get as good a result as ever worked in gensim. But for all the reasons previously mentioned, I wouldn't expect such results to be especially meaningful or valuable. And, if your real end goal is word-prediction, simpler methods, including a strict look-up from a table of all training-text cooccurrences, might perform better than something compressed-through the Word2Vec/FastText process.

- Gordon

Reply all

Reply to author

Forward