What is the interpretation/utility of negative sampling in doc2vec?

795 views
Skip to first unread message

Deepak George

unread,
Jul 25, 2016, 5:23:31 AM7/25/16
to gensim
Hi

Can someone explain the negative sampling parameter in doc2vec? I went through few papers but couldn't fully understand it.
 model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)

Questions
1) What is the layman's interpretation of negative sampling?
2) Why is negative sampling required?
3) What is the math behind negative sampling?
4) Any recommended value of negative sampling for document clustering? Since this is an unsupervised task,
 i cannot use cross validation.
 


Gordon Mohr

unread,
Jul 25, 2016, 1:17:32 PM7/25/16
to gensim
Doc2Vec is a small variant on word2vec, and the formalities are discussed in section 2.2 "Negative Sampling" of one of the original Word2Vec papers:


Essentially, when you know locally that a certain stimulus (context) to the neural-network-in-training should predict word W1, you can trace those connections and nudge the weights to boost that one corresponding output node. But making the W prediction also implies other words shouldn't be predicted. (And maybe there are other words W2, W3, etc that are also good to predict given the same context.) You don't easily have the full global knowledge of all the words to predict or not, and don't want to iterate over *all* the neural-networks output nodes on every training-stimulus. So with negative-sampling, you randomly pick N other words, and just nudge the network to suppress those N corresponding output nodes. By bad luck, one of your N negative words might in fact be W2, a word that the same stimulus *should* predict, given another text example elsewhere in the corpus! But ultimately that doesn't matter, the N words are overwhelmingly likely to be real "shouldn't-predict-here" words, and W2 etc will get their chances to be boosted when they come up in the training data later, so it all works out.

I've seen negative-sampling counts from 2 to 30 mentioned in different papers. The default in the word2vec.c released by Google was 5. There's no always-best value; it depends on your corpus and task. With larger datasets it seems smaller values perform as well or better than larger values. 

- Gordon
Reply all
Reply to author
Forward
0 new messages