updating vocabulary for word2vec: no training for new words

3,612 views
Skip to first unread message

Andrey Kutuzov

unread,
Jan 30, 2016, 9:04:15 PM1/30/16
to gen...@googlegroups.com
Hi,

For some tasks I need to update word2vec models with new data, including
adding new words to the model.

There is no such functionality in Gensim as of now (there is PR#435 on
Github, but it is inactive and outdated as far as I understand).

So I wrote my own function to extract new words from new training
corpus, add these words to the existing models' vocabulary and then
continue training.

The problem is that newly added words do not train. I mean, their
vectors in syn0 remain identical to those initialized randomly before
training. At the same time, vectors for `old' words do change. It is as
if new words are not listed somewhere, but I fail to understand where.

Are there some restrictions on what words update their vectors apart
from that the word must be present in the vocabulary and that its
sample_int should be larger than model.random.rand() * 2**32 ? I am
rather puzzled by this.

I attach the function itself. It works for CBOW and negative sampling,
with downsampling=0, taking as an input an existing model, enumerator
with new data, number of sentences in new data and min_count for new data).


--
Solve et coagula!
Andrey
update_model.py

Gordon Mohr

unread,
Feb 1, 2016, 8:23:47 PM2/1/16
to gensim
The most sound approach in such a case would be to merge your new text examples with the original ones, and train a new model from scratch with all the vectors of interest – then use those vectors from then on. 

Any incremental approach raises murky questions about whether the preexisting-word vectors should continue to adjust based on the new text examples, what's the appropriate amount of training to balance the influence of the original sessions and new session, and so forth. Whether the end-results would remain useful for your purpose would be an open question. 

Looking at your code:

* You don't show how the initial model was set up, but that will affect how this incremental training proceeds. (For example, while `newmodel` has a `sample=0` parameter, if the original `model` used sampling that would affect how training proceeds and what `sample_int` value *all* of the imported words inherit from `model.vocab[model.index2word[0]].sample_int`.)

* Replacing the learned-so-far `model.syn0neg` with an all-zeros array may not be what you want. (It definitely needs to grow to match the new predicted-vocabulary size, but whether starting-from-zeros or from-what-was-learned is better is one of the 'open questions' of this kind of exercise.)

* That there's no change at all by your `train()` indicates that perhaps the `data` isn't providing examples at all by then. Of course if *old* vectors are changing there must be some training happening. But, are you sure `data` is a restartable Iterable as opposed to a one-time Iterator/Generator? Does logging output suggest the expected number of new examples are seen in training?

- Gordon

Andrey Kutuzov

unread,
Feb 1, 2016, 10:31:19 PM2/1/16
to gen...@googlegroups.com
Hi Gordon,

Thanks for the answer.
I agree that an ideal approach would be to retrain the model from
scratch including new texts in the training corpus. But sometimes it is
impossible due to time limits, especially if the initial corpus is very
large. Also, tracking how distributional models develop after
additional training is an interesting research problem in itself.
This is why I try to update a model with new data, not simply retrain
from scratch.

Considering your other questions:
1) The original model hyperparameters precisely mimic those of the new
model, including `sample=0'. In this case, I do not use downsampling,
because stop words were removed from the training corpus beforehand.
It means that `sample_int' is the same for all words in the model, if I
understand correctly.
2) I tried to retain original syn1neg and simply append to it new zero
rows to match the new vocabulary size. It didn't change anything.
3) The `data' certainly provides examples. It is a instance of
LineSentence on a simple gzipped text file. Logging outputs proper
progress, with the right number of words in the end.
And the corpus from which `data' is loaded, surely contains my test
words. When I train a model from scratch on this corpus, these words get
quite good and meaningful vectors. So, I expected that this will hold
when I feed this new data to the original model: that it will acquire
these new words ad learn vectors for them.

Why the vectors stay unchanged is a big puzzle for me. May be, you could
try my code with your own data?
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Feb 2, 2016, 10:42:43 PM2/2/16
to gensim
I can reproduce a similar effect, but why the vectors aren't being further adjusted is still a mystery for me, too. (The `train()` is apparently iterating over the examples in the incrementally-provided corpus, and the model state looks to me like it should enable normal word and syn0-index lookups, and error-updates.) 

One difference in what I'm seeing, versus your initial-message report, is that after the incremental-training, I'm not seeing 'old' words change, either. 

Also: when I disable the cython-optimized routines to use the pure-python paths, the vectors *do* change with more training. (This is true both in the current 0.12.4/develop code, and the 0.12.3 release – so it's not a bug in any recent optimizations.)

So, still investigating.

- Gordon

Andrey Kutuzov

unread,
Feb 3, 2016, 10:38:42 AM2/3/16
to gen...@googlegroups.com
Hm, no, I confirm that here 'old' words do change their vectors after
incremental training (of course, if the new data is small, this change
can be microscopic).

By the way, how one can temporarily disable cython-optimized routines
for Gensim word2vec? What is the correct way to do this?
> <https://groups.google.com/d/optout>.

Gordon Mohr

unread,
Feb 3, 2016, 2:15:38 PM2/3/16
to gensim
To disable the cython routines, it's enough to just skip the loading of the compiled word2vec_inner routines. To force that:

You can edit the word2vec.py file so the import doesn't happen and the "except ImportError:" pure-Python alternative code always runs – see the lines at:


(I've sometimes just put a "raise ImportError()" as the first line inside the the try to temporarily force this effect.)

Or, you can rename aside the `word2vec_inner.so` compiled-library, so it's not found. 

However, as this can make training 100x slower (or more), it's note very attractive even as a temporary workaround... just a debugging aid.

- Gordon

Andrey Kutuzov

unread,
Feb 3, 2016, 3:27:40 PM2/3/16
to gen...@googlegroups.com
I confirm, without cython routines vectors for the new words do change
after incremental training.
Does it mean there is some bug in Cython implementation of word2vec
algorithms?

Gordon Mohr

unread,
Feb 4, 2016, 6:46:13 PM2/4/16
to gensim
Quite possibly it's a bug – the cython routines are (generally) supposed to be work-alikes. 

Though, as this is a particular extra function (expanding the vocabulary) that wasn't yet designed-for or tested-for, there may yet be a way to get it to work through both paths without patching the gensim code. (Without yet knowing the reason for the discrepancy, hard to say...)

- Gordon

Radim Řehůřek

unread,
Mar 15, 2016, 11:29:52 PM3/15/16
to gensim
Wasn't there some "lock" variable that affected whether/how much each word vector ought to change?

This locking was added later on, I don't remember the details, but I'm wondering whether it may be related to the mystery here.

Although, that still wouldn't explain why the Cython/Python versions differ.

-rr

Andrey Kutuzov

unread,
Mar 16, 2016, 7:18:48 AM3/16/16
to gen...@googlegroups.com
Hi Radim,

No, I explicitly switch this variable to ON state:
model.syn0_lockf = ones(len(model.vocab), dtype=REAL)

So, this is not the issue.
Reply all
Reply to author
Forward
0 new messages