Creating a Word2vec model using a vocabulary built with unused words in order to accommodate for training of the model on words not in the initial set but in additional training sentences

Stephen Firrincieli

unread,

Oct 5, 2016, 1:58:25 PM10/5/16

to gensim

Hi, from https://groups.google.com/d/msg/gensim/CBPl4aXN7Ao/4ITyOWvxDgAJ I see that it is not possible to add new words to an existing model through the train() method.

In our use case we are creating several models based on sentences from particular years, each of which is cumulative. So the 1998 model also includes 1997, 1996, etc. all the way to the beginning of our data. We would like to train each new year model on the new sentences using the train() method, but we need each new year's model to include the new words that appear in that year's sentences. The first year we have data for is relatively sparse in terms of vocab so calling train on that year's model ignores a lot of the vocab that appears in later years. Currently we are creating a new Word2vec model object for each year with every sentence from every year up to that point, but this seems unnecessary.

What we are thinking of doing is initializing an empty Word2vec model, calling build_vocab() on the empty model with a list of every sentence from every year as the argument. Then, for each year, calling train() on the model with that year's sentences, saving the model, and then doing the same for each additional year.

It would look something like this:

all_sentences = # load all sentences from all years into this variable

cumulative_model = gensim.models.Word2Vec()

cumulative_model.build_vocab(all_sentences)

cumulative_model.train(1992_sentences)

cumulative_model.save('1992_model')

cumulative_model.train(1993_sentences)

cumulative_model.save('1993_model')

# and so on

So I am wondering if you could offer insight on whether this will work the way that we want it to. We expect that this will cause there to be a lot of words with vectors with 0 for all dimensions, which would cause all of these unused words to appear exactly similar to each other. So we would need to check for that. But otherwise, will the unused words have an effect on the vectors for the used words?

Thanks in advance.

Andrey Kutuzov

unread,

Oct 5, 2016, 2:49:34 PM10/5/16

to gen...@googlegroups.com

Hi,

This will work, but only in the case when at training start you already
have all the subsequent training corpora (to extract vocabulary from
them). This will not work if you expect new unseen texts to come in the
future.

Your `unused' words will not have zeroes for their vectors: the vectors
will be initialized randomly during the building of vocabulary. So they
will be not similar to each other, but rather uniformly distributed
across the vector space.

> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Stephen Firrincieli

unread,

Oct 5, 2016, 4:37:01 PM10/5/16

to gensim

Thanks for the response. Our corpus is fixed so we do not need to worry about any new words if we set up the vocabulary in the beginning.

One thing I noticed was that calling the most_similar() method on models generated with the original method of creating a new model for each year will return slightly different results from calling the most_similar() method on models with the whole vocabulary and many unused words.

I'm guessing this is related to the random vector initialization that you mentioned. Would you expect this to be a problem? I only performed this test on the first few years, for which there is not a lot of data in our set so I am thinking they will converge as the text used for training gets larger in the later years.

Lev Konstantinovskiy

unread,

Oct 6, 2016, 12:49:45 AM10/6/16

to gensim

Hi Stephen,

There is a new feature to be included in the next release called Online word2vec that allows vocabulary expansion. Would appreciate your feedback on it.

See this tutorial: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

Regards
Lev

Stephen Firrincieli

unread,

Oct 6, 2016, 8:17:25 AM10/6/16

to gensim

Hi, thanks, that looks like exactly what we need. I'm guessing the "online" part of the name just in reference to the fact that it's useful for building models from newly created or changed text from the web?

I'd be happy to try this out, but we are using this for research with a very large corpus (48bn raw words). Is Online word2vec complete and tested enough that we would be okay to use it in this way?

Lev Konstantinovskiy

unread,

Oct 6, 2016, 9:23:06 AM10/6/16

to gensim

Hi Stephen,

Online word2vec should be ok. It has been tested with unit tests and on wikipedia.

"Online" stands for "online training" where the model changes as new data is being fed into it.

Looking forward to your feedback,

Lev

Stephen Firrincieli

unread,

Oct 6, 2016, 11:09:26 AM10/6/16

to gensim

Great - thank you for your help. We are going to go forward with online Word2Vec.

One thing I am wondering about is how much randomness is in these models? Is Online Word2Vec equivalent to generating a new model with the sentences currently in the model + the sentences added to the vocab and trained, and any difference we see in word similarities between the two ways of creating the models is just down to the randomness of Word2Vec?

Lev Konstantinovskiy

unread,

Oct 6, 2016, 8:52:07 PM10/6/16

to gensim

Hi Stephen,

The difference in random initialisation of vectors will make the models different but given the size of your data it should be negligible.

By the way, you might be interested in Procrustes alignment of word2vec models for different time periods in https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf

Gordon Mohr

unread,

Oct 7, 2016, 5:38:09 PM10/7/16

to gensim

For now, I disagree about the appropriateness of the recent work that's been called "Online Word2Vec".

"Online" is a bit of a misnomer - "online" often implies a model that can always take any increment of new input, and tends to gives meaningful (and improving) results for every increment. This recent work instead lets you supply a big new batch of examples to grow the known vocabulary. So a better term for this feature would be "vocabulary expansion".

To then train with just the new examples might improve *or deteriorate* the quality of word-vectors, and while the model will technically let you compare words in the new batch with words in the old batch, the quality of such comparisons is hard to know, and probably gets *worse* the more a later batch is trained, against the usual expectation that more training always helps. (Word-vectors are only comparable to the extent they were trained against each other; all training on a later batch is improving the words, with respect to only those later examples, at the likely cost of worsening them, with respect to the earlier examples.)

The testing that's occurred with this new feature has really only verified that new tokens are available with at-a-glance somewhat-meaningful vectors. The effect on existing tokens, or relations with tokens that don't appear in later training batches, hasn't been evaluated. (I'm also not sure it's doing the best thing with respect to features like frequent-word downsampling.)

I don't know of any project write-ups on the right way to choose a new `alpha` learning-rate decay, or relative number of passes, for meaningful results. It is likely the right choices for these values will vary a lot based on the relative sizes and vocabulary-overlap of your incremental batches. Unrealistic expectations for what this vocab-expansion feature achieves, given that reasonable/best practices are not yet known, may encourage wasteful fumbling trying to use it, with no or negative benefit.

I would only recommend people call Word2Vec/Doc2Vec `train()` with incremental subsets of data if they…

(1) …have a good understanding of what's happening behind the scenes and thus the limitations on strong interpretations of results; and…

(2) …have effective project-specific evaluation mechanisms to check whether this feature, using various parameter choices, are helping or hurting the resulting vectors.

(And if these do apply to anyone reading this who manages to profitably apply this feature, by all means please write-up and share what you learn!)

Stephen, if your actual goal is to study the evolution of word meanings over time, I would also suggest looking at possible techniques for creating mappings between different models, based on known-comparable words, as discussed (with references including the 'alignment' code pointer Lev shared) in our feature wishlist: https://github.com/RaRe-Technologies/gensim/wiki/Word2Vec-&-Doc2Vec-Wishlist#implement-translation-matrix-of-exploiting-similarities-among-languages-for-machine-translation

Another option might be to create a merged overall corpus that's expanded with synthetic words representing a word-in-a-single-year. For example, an original sentence in 1993 that's…

"The apple doesn't fall very far from the tree"

…might be included both as-is and with multiple era-specific transformations…

"The apple^1993 doesn't fall^1993 very far^1993 from the^1993 tree"

"The^1993 apple doesn't^1993 fall very^1993 far from^1993 the tree^1993"

…etc…

You'd then shuffle all data together, and ensure before training that 'apple', 'apple^1993', 'apple^1994', etc are all equivalently-initialized. In that way, all words (era-oblivious and era-specific) are trained against each other, in a misture of era-specific and era-oblivious contexts. The differences in era-specific words are then more likely to yield meaningful comparisons. (Though, I'd try to validate this assumption against words that are perhaps so old/common they're not expected to change, or random subsets of words for whom any indicated drift must just be an artifact of the subsetting.)

- Gordon

Lev Konstantinovskiy

unread,

Oct 11, 2016, 2:07:27 AM10/11/16

to gensim

Hi Gordon,

Added proper testing of Online word2vec to the Proposals Page

Regards

Lev

Gordon Mohr

unread,

Oct 11, 2016, 3:13:36 PM10/11/16

to gensim

To avoid creating/raising unrealistic expectations, can we also stop referring to it in issues/release-notes/etc as 'online'?

- Gordon

Message has been deleted

Stephen Firrincieli

unread,

Oct 20, 2016, 1:29:49 PM10/20/16

to gensim

Hi Gordon,

Thank you so much for the long explanation - that information is very helpful. I apologize for taking so long to reply, I had not seen your reply until now.

What we are trying to do is analyze the trends in Word2Vec similarities over time, with a focus (for now) on one particular term and similarities of other terms to it.

If I am understanding you correctly, you are saying that for each new set of sentences we would be better off generating new models based on the entire collection up to that point, rather than using online Word2Vec or building the first model with the entire vocabulary and calling `.train()` with each new set of sentences (which might behave in unexpected ways)?

The papers referenced in that link look interesting, I will try to figure out if we can use that in our implementation.

Gordon Mohr

unread,

Oct 20, 2016, 6:17:58 PM10/20/16

to gensim

Yes, anything which incrementally calls `train()` with a different subset of data brings up many murky issues about relative influence of examples and stability-of-comparisons (based on learning-rates, number-of-examples, similarity-of-examples, ordering-of-training, numper-of-iterations, etc).

I've not seen a good exploration of these trade-offs written up anywhere, but unless/until they're well characterized, the interpretability of cross-model (or cross-training-epoch) comparisons based on incremental training would be suspect.

- Gordon

Reply all

Reply to author

Forward