Doc2Vec: reset_weights(), hs vs negative, DBOW & window

299 views
Skip to first unread message

Gordon Mohr

unread,
May 10, 2016, 6:17:48 PM5/10/16
to gensim
`reset_weights()` is an internal method which allocates/initializes the model's big arrays before training starts. It's done automatically for you after the vocabulary is discovered and finalized (as for example at the end of `build_vocab()`). In most cases you wouldn't want to ever call it yourself, but it could make sense if you want to try re-training a model (from fresh array values) after tweaking a few of the parameters that don't require re-discovering the vocabulary. There's no official description of exactly when such a shortcut would be OK versus when it is insufficient – only reading and understanding the source code will allow that decision. The safe, supported thing is to create a fresh model in the usual way for each new set of parameters you try.

`reset_from()` is also usable as an advanced shortcut that lets a model re-use some of the structures from an already-initialized model, and thus save a bit of time and memory, if the vocabulary internals were going to be identical copies of each other anyway. But again. it's only safe if you're sure those structures would be equivalent, which requires understanding the code internals for each of the models' options. For example, a model that only uses negative-sampling wouldn't be a suitable `reset_from()` source for another model that only uses hierarchical-softmax, because it wouldn't have initialized the same structures. So until you've read & understood the internals, the safe/supported thing is not to attempt such shared-structure optimizations.

Note that `hs` stands for `hierarchical SOFTMAX` not `hierarchical SAMPLING` – though it's an easy mistake to make and one I've also made multiple times. They are two alternative options for how to calculate the *predictions* of the neural-network, and thus also the *errors* to be back-propagated. They are relevant options for both Word2Vec (both CBOW or Skip-Gram) and Doc2Vec (both DBOW and DM). Neither have any effect on whether `window` is consulted: `negative`-vs-`hs` are ways to construct/interpret the NN *output*, while `window` can affect the construction of the NN *input*.

If you enable both, in a sense you have two separate NNs, but they each share the same input (aka 'projection') layer, before implementing separate outputs. And, each training-example is run through both NNs before moving the the next training-example – an interleaved, dual-objective training.

I personally doubt the benefits of using both, and most published results seem to pick one or the other, but using both was a possibility in the original Google word2vec.c code, so the option was retained in gensim. (Using both may look impressive if you're only counting training-epochs – it seems to do more in one epoch, or especially the first epoch. But then when you then adjust for added time, or run for a more typical number of epochs, that fast-out-of-the-gate benefit seems to disappear.) I'm not sure if HS is ever faster or more-at-risk of interference between similarly-frequent words: you'd have to test what tradeoffs apply with your corpus and settings. 

I haven't seen any rules-of-thumb that choose the number of negative examples (value of `negative` parameter) based on document-sizes. One of the original Google Word2Vec papers (https://arxiv.org/abs/1310.4546) says: "Our experiments indicate that values of [negative-sample count] k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5." So if anything it's more a function of corpus size than average document/example size, and when you meta-optimize for your corpus you may be surprised to find very-small values work best. 

In pure DBOW, `window` is irrelevant because the doc-vectors are (by themselves) trained equally to predict every word across the whole document. (There's no sliding-context-window of mixing-with-neighboring words.) But people often want word-vectors along with the DBOW vectors, which means you'd add `dbow_words=1` to the Doc2Vec initialization, and then `window` is relevant again. Values of 2-10 seem common, and again, sometimes smaller values perform best on the downstream evaluations of word-vectors. 

You'd have to run your own tests as to whether different ways of breaking documents into smaller chunks affect your end evaluations. (My impression is that if you have enough data for useful vectors to be learned, the learning is *not* very sensitive to different choices of sentence/paragraph/example boundaries. But maybe some datasets/goals are sensitive to that.) The one gensim limitation to keep in mind is that text examples longer than 10,000 words will be truncated in the cython-optimized code to 10,000 words (with any overflow per example ignored) – so only then would you likely need to break up longer examples. 

- Gordon

On Tuesday, May 10, 2016 at 5:09:55 AM UTC-7, Izzet Pembeci wrote:

Gordon, these are excellent explanations. Thank you so much. I read the papers, spent time with gensim API docs and tutorials I could find but without your messages (this thread and previous long ones) a deep understanding of gensim's power and many options wouldn't be possible for me. 

Just to make this thread a reference one for future gensim word2vec/doc2vec users I have some other questions about details:

  • What is the use case of reset_weights? Radim's doc2vec notebook shows a nice trick to build vocabulary for a model first (Doc2Vec.build_vocab) and then copy it to other models by Doc2Vec.reset_from. If you want to test models with different parameters then this helps reducing model building time (different training is unavoidable but at least you skip the vocab building step). I suspect reset_weights can be used in a similar manner but not sure how and if it offers something additional to reset_from.
  • I am also having problem to fully understand hs (hierarchical sampling) and negative (negative sampling) parameters. From the paper Radim linked what I can understand is these are in play only in PV-DBOW (skip-gram/distributed bag of words) type of models (dm=0).When one of them is turned on parameter window is ignored. From what you wrote below, it doesn't make sense to use both of them at the same time. I am guessing using hs=1 will result in a bigger model footprint (output layer of NN is replaced by a binary tree which still contains all the words as leaves) but dramatically reduce training time. Is that true? Shall we lose some information when we use hs? For instance, shall some words be regarded as equal (effect the similarity prediction in the same way) just because of their frequency being similar? When we use negative sampling shall we try to set its value related to the doc size. For instance, if your documents at average has 200 tokens than may be suggested 5-20 values would be appropriate but if your docs are longer (2000 tokens at average) then a higher value will make more sense for negative?
  • May be related to the last one, if we are using DBOW and window is ignored than does it make any difference to conduct the training sentence by sentence as in word2vec or can we just feed the whole doc and the results will not be much different? Shall I try intermediate approaches (i.e. feed the doc as chunks of sentences) and test the models or I shouldn't waste my time with these tests since algorithmically they won't have any effect on the model produced? If we use the whole doc approach (and hopefully decrease training time) may be we should increase the iterations as a compensation to allow the NN to converge. 

Thanks again for all the insights and the detailed explanations.


iZzeT

Reply all
Reply to author
Forward
0 new messages