Word2Vec: OOV (Out of Vocabulary) options and normalized word vectors

Ilias Chalkidis

unread,

Apr 24, 2017, 7:32:59 AM4/24/17

to gensim

Hi there,

Word2vec model (Gensim 0.13.3)

I'm using gensim word2vec implementation for almost a year. I have trained my most recent word2vec model with gensim 0.13.3 and I saved it using save_word2vec_format() in a binary format.

Updating to Gensim 2.0.0

I recently updated my system to gensim 2.0.0 and I started using KeyedVectors class to load and use my word embeddings, as a simple dictionary as usual.

These days I'm researching for some optimisations on my neural networks and I basically started looking for ways to handle OOV (Out of Vocabulary) techniques, until today I'm just using random word embeddings. So I considered gensim similarity functions given the context words of the OOV word, but they don't look such a good idea when I look at some specific cases and their print outs.

Looking the actual code for Word2Vec class and KeyedVectors class, I found a possible solution (function) that looks much more reasonable in Word2Vec class. It's called predict_output_word(self, context_words_list, topn=10).

So I thought maybe I could use this function given the previous/next 2 words of my OOV word and get the most possible center word.

The problem here is that this function is working under Word2Vec class, while in gensim 2.0.0 we have to load word2vec models using Keyedvectors class, so we cannot directly call this function.

Can you think of any possible solution to this? Maybe is obvious and I'm just missing something....

Considering normalized word vectors

While I was looking gensim code, I also found the function word_vec(word, use_norm=False), so instead of using KeyedVectors as a simple dictionary, someone can use this function and get normalized word vectors, which according to the literature provide better results in some cases.

If you load your word2vec model with load_word2vec_format(), and try to call word_vec('greece', use_norm=True), you get an error message that self.syn0norm is NoneType.

Is this caused only with word2vec models trained using older versions of gensim, where probably syn0norm did not exist, or is this an actual bug (todo, whatever)?

Probably I could resolve all of these on my own through coding those functionalities, but this is not the case, I would to follow the gensim library API.

Thanks for any answer!

Gordon Mohr

unread,

Apr 24, 2017, 1:19:09 PM4/24/17

to gensim

The new `predict_word_output()` method requires a full trained model, with extra internal weights that are not saved in the final-vectors-only format used by `save_word2vec_format()`. If you save a model using gensim's native `save(filename)`, then reload it via `Word2Vec.load(filename)`, you'll have a fully-populated Word2Vec model against which you can use `predict_word_output()`.

The `syn0norm` normalized vectors are not saved with the vectors, since they can always be recalculated as needed. The built-in similarity methods ensure that's pre-calculated with a call to `init_sims()` – so if you're planning to use `word_vec(key, use_norm=True)` directly, you should also call `init_sims()` after loading the raw, non-normalized vectors, so that `syn0norm` exists.

- Gordon

Ilias Chalkidis

unread,

Apr 24, 2017, 1:56:47 PM4/24/17

to gensim

Dear Gordon, thanks a lot for your reply, I think your reply is really informative. Some extra comments for further discussion and understanding:

About predict_word_output()

So, if I get it right, in my case I have to retrain a word2vec model and save it using save() function instead of save_word2ved_format().

Some additional info about that:

In this case word2vec model is working as an actual classifier and can predict output words using context word? As fas as I can see (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py), the function first computes a centroid vector from context words' vectors as input (this is actually the first hidden layer output) and tries to predict using the weight matrix self.syn1neg (the only thing that we are missing right now), the output word.
Do you consider this as a reasonable solution for OOV words? It seems to me, comparing to completely random embeddings.
Is the saved model going to be much larger than the one in binary format? Right now the model is 417MB with a vocabulary of 500K X 200 dimensions, so memory space is

About normalised vectors

This seems pretty straight forward, I'm going to try it soon as a first step using init_sims(). In case we have saved the actual Word2Vec object, can we use normalized vectors? In other words would the self.syn0norm be already pre-computed? This is also a function of KeyedVectors, which means we can use it through Word2Vec object, right?

Thanks a lot, I appreciate it! It seems there are always more out there, that we can do to improve even the early input of our neural models :)

Ilias

Gordon Mohr

unread,

Apr 25, 2017, 4:21:59 PM4/25/17

to gensim

You will need a full model, as might be saved by `save()`, to use `predict_word_output()`.

Note it's likely to be very slow – and moreso the larger your vocabulary. The method essentially checks the model's output-weight for *every* word in the vocabulary, to find the most-predicted words. Also I hadn't looked closely at its implementation before – it seems to assume CBOW mode, even if skip-gram training occurred, which may or may not be a reasonable approach.

Whether this offers any benefit in the case of OOV words would depend on what you're using those deduced OOV word vectors for. I doubt it would help in the case where those OOV vectors are again being used to model the source context/sentence meaning – the exercise of deducing the OOV vector has still just used that limited surrounding context, which was already in the surrounding words. It might do OK if trying to deduce near-synonyms of the OOV word. In either case, synthesizing the OOV-vector from multiple new-contexts of its appearance, or the top-N of the model's predictions for those contexts, might do better.

I would expect ignoring OOV words entirely would be better than supplying either random vectors, or some shared `_UNK_` vector. (About which there are few examples, not much can be said... and trying to say much may be injecting noise. But eliding such words can sometimes, by shrinking de facto contexts, improve other words during training, and might also improve downstream goals.)

Methods like FastText take advantage of the fact that word-fragments (char-ngrams) may themselves in some languages be suggestive of meaning. By learning vectors for those fragments, and forcing the learned word-vectors to be combinations of the vectors of their fragments, the fragments can also be used, post-training, to make educated guesses at reasonable vectors for previously-unseen words. It does better than random, especially if the OOV words are just word-form-variants or typos of well-trained words, but you'd have to test it in your setup. (Gensim can call-out to the FastText package to do that training, and load the resulting vectors for evaluation/use.)

The saved model with full internal weights and retained vocabulary info will likely be a little more than double the size of the vectors-only save.

`syn0norm` is only populated when a built-in method (or your code) calls `init_sims()`. It's not saved even with `save()`, because it can be recalculated from the saved raw vectors. Word2Vec/KeyedVectors work the same way in this regard.

- Gordon

Ilias Chalkidis

unread,

Apr 25, 2017, 6:24:22 PM4/25/17

to gensim

I was discussing OOV words replacement methods in case of missing words during neural nets training and prediction. In case of sequence classification using RNNs, someone may completely remove the missing word and everything will be fine.

In case of sequence labelling (POS tagging, NER), there is not such option. You need to provide a word representation of any kind (e.g. random embedding, 'UNK' embedding or predict_word_output([Wt-2,Wt-1,Wt+1,Wt+2])) for every single word in your sequence. As I already mentioned right now, I just provide a random embedding and I hope that the neural net absorbs the noise (random embedding).

In my new word2vec model, the first step will be to replace the unknown words not with a single UNK token, but with multiple UNKs according to the POS tag of each word. So I will learn multiple UNK embeddings related with the part-of-speech tags/roles. Then during the neural net training and prediction I will replace each OOV word with the related UNK (e.g. UNK_V, UNK_NP, UNK_NNP, etc.).

I was just thinking if predict_word_output([Wt-2,Wt-1,Wt+1,Wt+2]) is also a good choice. But this is not the only factor to select such a practice, there are more important factors:

As you already mentioned the word2vec model will be (2+)x the current, close to 1GB.
Also in the rare situation of an OOV word - aprox. 1-3% in my case -, maybe calling predict_word_output([Wt-2,Wt-1,Wt+1,Wt+2]) would be a delay overkill, you also mentioned this important point.

In production systems like those I'm willing to support, not academic ones, time and memory are the real problems, in many cases these are more important than some borderline improvement of 1-2% in the metrics.

Anyway I can try all these different approaches and sort out the best available option, but I think probably a better-trained (bigger corpus) word2vec model with multiple UNKs and normalized vectors would lead to a borderline improvement without time delays and excessive memory.

I recently read an article in rare-technologies.com (https://rare-technologies.com/fasttext-and-gensim-word-embeddings/), which actually proves that FastText is better in cases where morpho-syntactic information is important, on the other hand word2vec is still the best available choice when we need semantic coverage. My task are mostly named-entity-recognition and sentence classification so I really need good semantics, I also use POS tag embeddings for syntactic information, also trained using word2vec.

But in the end of the day, char-level CNN papers are growing mad out there... Right now I'm considering those approaches a huge overkill with borderline improvement in the best case. Still word-level RNNs seem as good options for most NLP tasks/cases.

Derek Thomson

unread,

Jun 16, 2017, 8:02:12 AM6/16/17

to gensim

Hello,

I'm also trying to use the predict_output_word within the Word2Vec class (initally for performance evaulation).

It's a fresh installation and the functions within the keyedvectors class run successfully.

The steps taken so far:

1. Using the GoogleNews-vectors-negative300.bin, saved into text format, trimmed down to a more manageable 1/8th size with adjusted counter to match the contents.

2. This file is re-saved using model.save('filename') giving two files, the 2nd ending in .syn0.npy.

(I assume that the format the model was loaded from (txt) doesn't make a different to the gensim save.

The top line of the saved word2vec model file mentions 'gensim.models.keyedvectors')

3. In another script, trying to reload the word2vec model gives the error:

model = gensim.models.Word2Vec.load('filename')

File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1400, in load

if model.negative and hasattr(model.wv, 'index2word'):

AttributeError: 'KeyedVectors' object has no attribute 'negative'

I've had a look at the source, but am still a bit of a python newbie...

are there any other steps required, or parameters needed when creating or loading the word2vec model?

Best regards,

Derek

Gordon Mohr

unread,

Jun 16, 2017, 12:39:16 PM6/16/17

to gensim

You won't be able to use the `predict_output_word()` function with the GoogleNews vector set – that's just the learned vectors, while `predict_output_word()` needs a full NN model (with hidden->output weights, which aren't in the GoogleNews pretrained vectors).

(`predict_output_word()` will work with models trained in gensim, that are just native-`save()`d. But note even there it currently only works with negative-sampling models, and is quite slow. Despite improved-predictions being the *training goal* during word2vec vector creation, such predictions are not the usual true goal of the word-vectors – and many word2vec implementations, including the original Google word2vec.c, don't even have an API for reading the ranked word-predictions of a model.)

Specifically what's happening is that:

* when you use `load_word2vec_format()`, you're getting back a KeyedVectors object, to reflect that your model is vectors-only

* thus when you re-`save()`, it's still just a KeyedVectors object on-disk

* trying to re-`load()` that as a Word2Vec object triggers the error you're seeing – because `model` is not a Word2Vec object. (Gensim should probably emit a more descriptive error when this is attempted.)

- Gordon

Radim Řehůřek

unread,

Jun 17, 2017, 1:18:47 AM6/17/17

to gensim

* trying to re-`load()` that as a Word2Vec object triggers the error you're seeing – because `model` is not a Word2Vec object. (Gensim should probably emit a more descriptive error when this is attempted.)

This is a good idea -- Ivan can we add a little check to the `load` method, to emit a warning of the loaded object is not an instance of the class the load was called on? (in this case, KeyedVectors not an instance of Word2Vec)

Cheers,

Radim

Derek Thomson

unread,

Jun 17, 2017, 4:32:05 AM6/17/17

to gensim

Gordon,

Thank you very much for the prompt and comprehensive reply, that really clears things up. The top line of the file should have given me a clue.

It's an amazing technology (and I'm pleased to have been able to try it, thanks in large part to the forums, documentation, and the efforts of you both), but alas, probably not a fit for my purposes. The project I'm working is an assistive tool which predicts the next word(s) and disambiguates sparse user input. It currently uses an n-gram model, but I'm on the lookout for any compatible lower perplexity solutions, skip-grams, genre/topic detection to overcome the locality issues of n-grams.