Does word2vec hierarchical softmax train on every single possible context?

1152 views
Skip to first unread message

usedforpr...@gmail.com

unread,
May 14, 2015, 5:29:13 PM5/14/15
to word2vec...@googlegroups.com
Just wanted to see if someone could check my understanding:

Example:
The cat ate the mouse.

Assuming a four window context:
The: <pad>, <pad>, cat, ate
cat: <pad>, The, ate, the
ate: The, cat, the, mouse
the: cat, ate, mouse, <pad>
mouse: ate, the, <pad>, <pad>


usedforpr...@gmail.com

unread,
May 14, 2015, 5:33:14 PM5/14/15
to word2vec...@googlegroups.com, usedforpr...@gmail.com
Also, what is the difference between negative and subsampling as described in the paper?

Are the same methods used in Paragraph Vector?

Gojo Git

unread,
May 15, 2015, 12:43:34 PM5/15/15
to word2vec...@googlegroups.com, usedforpr...@gmail.com
Here's my understanding, as applied to your questions:

Regarding your subject line question, "hierarchical softmax" describes one way of composing the desired *output* of the neural network, and acquiring the prediction errors to backpropagate. It doesn't describe how to generate the training examples (or input to the network). 

Your list of contexts is roughly correct as to the 5 training examples that your sentence would provide – specifically, using the "CBOW" model with a 'window' of 2. (2 words before, 2 words after.) The candidate vectors for those 4 context words would be summed (or averaged) to get the input to the neural network. The network forward-calculations are run, and then a small relevant subset of the network's output is checked for accuracy. (In "hierarchical softmax", a subset of nodes that binary-encode the desired target word are checked, so some of these nodes have a target value of 1.0 and others 0.0. In "negative sampling", a node representing the target word is checked for closeness to 1.0 and N nodes representing other random words are checked for closeness to 0.0.) The errors are backpropagated, nudging the network hidden weights and candidate vector representations to a do a little better on the current training example. 

I said 'roughly correct' above because in the CBOW code I've seen, there's no padding-word added/averaged in. The context just shrinks to the actual boundaries of the source text. (So your first training-context would be: [The: cat, ate].)

However, it seems typical to preserve punctuation – the '.' would be a context token with its own vector – and to create context windows from multi-sentence text. To the extent a <pad> token might help, punctuation tokens may too. (So, your last two contexts might be instead: [mouse: ate, the, '.'] and ['.': the, mouse].)

Also, the actual window used around any target word is randomly chosen from 1 up to the 'window' parameter, serving to give the closer words more weight. (So of your listed contexts, in an actual run about half would be smaller than you've shown, using an effective window of ±1 instead of ±2.)

In the "skip-gram" mode alternative to "CBOW", rather than averaging the context words, each is used as a pairwise training example. That is, in place of one CBOW example such as [predict 'ate' from average('The', 'cat', 'the', 'mouse')], the network is presented with four skip-gram examples [predict 'ate' from 'The'], [predict 'ate' from 'cat'], [predict 'ate' from 'the'], [predict 'ate' from 'mouse']. (The same random window-reduction occurs, so half the time that would just be two examples, of the nearest words.) 

Subsampling is a choice independent of any of the above: it will randomly drop some of the most-frequent words from the source text, before preparing any contexts. The training can then be a bit faster, since a lot of (plausibly-redundant) contexts will be skipped. Also, since the unsampled words disappear entirely, it effectively makes the context windows of remaining nearby words a little larger. (In your sentence, if the 2nd 'the' was randomly dropped, then 'cat' and 'mouse' would be in each other's context windows.)

All of these choices also apply to paragraph-vectors. There, the "DBOW" method is most analogous to skip-grams: its training examples are pairs like [predict 'cat' from 'DOC_1']. (In this mode, you don't necessarily even have to do any word-to-word prediction, so you might not use or create any word-vectors.)

The "DM" method is more analogous to CBOW: one of its training examples might be [predict 'cat' from average('DOC_1', 'The', 'ate, 'the')]. This will inherently train word and document vectors simultaneously. The paragraph-vectors paper also suggests concatenation as another option (rather than sum or average) for combining the context vectors for presentation to the prediction neural network. (In such a case, you'd need <pad> words and might not do any dynamic window-reduction.) 
 
- Gordon 

usedforpr...@gmail.com

unread,
May 19, 2015, 4:43:24 PM5/19/15
to word2vec...@googlegroups.com, usedforpr...@gmail.com
Thank you, very insightful. Just a few more questions...

Regarding negative sampling:

So before training, a set of contexts and words (w,c) is prepared each with a random window size under the selected size.

1) In "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method" on pg. 2, it's mentioned that the context for dog and the word vector for dog are different. Is this only relevant to contexts with one word?

Then, for each (w,c) I would select k negative examples.

2) Are the negative contexts generated by word, as in I create brand new contexts, randomly selecting each word from its unigram distribution? Or do I just select random contexts that I initially created from the dataset? It seems like entire contexts are drawn, but are not contexts very rare?

3) Should I check whether a negative example I selected is already a positive example for that word?

Gojo Git

unread,
May 19, 2015, 7:03:42 PM5/19/15
to word2vec...@googlegroups.com, usedforpr...@gmail.com


On Tuesday, May 19, 2015 at 1:43:24 PM UTC-7, usedforpr...@gmail.com wrote:
Thank you, very insightful. Just a few more questions...

Regarding negative sampling:

So before training, a set of contexts and words (w,c) is prepared each with a random window size under the selected size.

1) In "word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method" on pg. 2, it's mentioned that the context for dog and the word vector for dog are different. Is this only relevant to contexts with one word?

I can't say I fully understand that footnote. It is true that the representations of word-predictions (the NN outputs, to either the hierarchical-softmax coding nodes or the neg-sampling terminal nodes, one per word) are distinct from the representations of word-inputs (the continuous N-dimensional vectors that are being learned). So while the N-dimensional vector-representation of 'dog' will vary over training, the word 'dog' as an output/prediction will have exactly one discrete and unchanging target representation as a binary code of several nodes (HS) or activated single terminal node (negative-sampling). I think that's their point.

One extra note if comparing the implementations with the original paper: the paper describes predicting surrounding words ('contexts') from a central input word. In practice it seems implementations cycle through the surrounding words ('contexts'), to predict a central target word. (Comments outside the paper have suggested this reaches functionally similar end results, but is more efficient.) 

Then, for each (w,c) I would select k negative examples.

2) Are the negative contexts generated by word, as in I create brand new contexts, randomly selecting each word from its unigram distribution? Or do I just select random contexts that I initially created from the dataset? It seems like entire contexts are drawn, but are not contexts very rare?

The negative samples are alternate predicted-words, chosen randomly. They might not be the worst possible errors for the NN to make, but on average over many examples, the randomly-chosen words will be predictions-that-shouldn't-be-made. 

3) Should I check whether a negative example I selected is already a positive example for that word?

The implementations I've seen check whether a neg-word matches the *current* pos-word and skip the neg-word if by chance they match. See for example: <https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c#460>. But I doubt that's strictly necessary. After all, it's even possible one of the other randomly-chosen negative samples is the *best* possible prediction, elsewhere across the whole corpus, for the current inputs. But it's not the right target for this one training-example, and tallied over all the training the 'good' predictions will still be most-reinforced and 'bad' ones most-suppressed.

- Gordon
Reply all
Reply to author
Forward
0 new messages