Doc2Vec deterministic results - hs vs. ns, sampling (and alpha)

Shani Shalgi

unread,

Nov 30, 2016, 3:32:41 AM11/30/16

to gensim

Hi,
First thank you so much for your quick and helpful responses. A few questions regarding gensim's implementation of Doc2Vec (applies also to word2vec).

1. It is unclear from the documentation what happens when hs=0 and negative=0. What method is used? Also, what if hs=1 and negative>0? I'm asking because we are running a hyper-parameter search and this has come up.

2. I found, that when sample=0 and hs=1 I get deterministic results when inferring the same document twice. When hs=0, negative>5, sample=0 infer vector is not deterministic. However, I was surprised to find that the opposite is true when I use sample=1e-05 - hs=1 is not deterministic and hs=0 with negative=5 is deterministic. Could you please help me understand what is happening?

3. A related question on the sample variable - I found that it deteriorates results when the vocabulary is relatively small. Is there a recommended size of vocabulary to start using sample>0?

4. Last but not least: I have been training using iter=1 and externally iterating over the data with randomization and setting min_alpha=alpha, then decreasing alpha linearly each epoch (I think this was suggested by Radim in a blog post I read once but cannot find anymore). I use this method as if I set iter>1 there is no randomization of document order. I found it gets good results, but would like to know how the word2vec/doc2vec training algorithm changes the alpha when alpha<>min_alpha and iter>1.

Thank you,
Shani

Gordon Mohr

unread,

Nov 30, 2016, 7:25:15 PM11/30/16

to gensim

(1)

`hs=0, negative=0` would literally mean no output layer(s) to generate training errors for backpropagation, so the behavior is undefined. (If it doesn't error, any results are likely junk.)

(2)

Any of the modes which include random-choices during training/inference – that includes negative-sampling (`negative` > 0) or frequent-word downsampling (`sample` > 0) or the varying window-sizes in "DM" mode (`dm=1`) – shouldn't be giving deterministic results for repeated `infer_vector()` invocations.

If you seemed to get deterministic repeated results with `sample=1e05, negative=5`, are you sure you didn't do something else in that case make that so, like start fresh with an identically-loaded model for each attempt?

There's more discussion of what could achieve deterministic inference results in <https://github.com/RaRe-Technologies/gensim/issues/447>. Using more iterations (the `steps` parameter of `infer_vector()`) should help improve the quality of inferred vectors, including making vectors from subsequent runs on the same text more similar to each other.

(3)

`sample` often helps to both speed training and improve vector-quality, for some downstream tasks (like the common analogies evaluation) – but there's no firm rules-of-thumb... it depends on your goals & data, so you need to explore different values.

(4)

When the starting `alpha` and `min_alpha` vary, each batch-of-text-examples (of total word size `batch_words`) is sent to a training worker thread with an updated effective alpha value. The value is proportionately between starting alpha and `'min_alpha` based on the proportion of training-words previously sent to training worker threads. (You can see the exact code at <https://github.com/RaRe-Technologies/gensim/blob/54871ba162edb4726b9a2b35b10f947c0dfdda1f/gensim/models/word2vec.py#L827>.)

Randomization of document order before each pass isn't strictly necessary; you may want to check if it really helps your vectors. (It may be worth one initial randomization, if there's a risk your original document order has grouped all occurrences of certain words/patterns-of-use/document-sizes together.)

- Gordon

Arjun Seshadri

unread,

Apr 3, 2018, 7:56:58 PM4/3/18

to gensim

Hi Gordon,

Thank you for the helpful response. Regarding your answer to (1):
Suppose I wanted to simply optimize CBOW over the full softmax, how would I do this? (My min count is set such that my vocab size ends up being around ~800, so running full softmax shouldn't be too much of a problem). The API docs suggest that setting negative=0 turns off negative sampling, and hs=0 turns off heirarchical softmax, which seems like it should be what I want. This doesn't error out, but as you've suggested across several posts, does not actually train the vectors and generates junk results.

Gordon Mohr

unread,

Apr 3, 2018, 9:51:30 PM4/3/18

to gensim

There's no code in gensim for softmax over all output nodes – just the alternate two 'sparse' optimizations (negative-sampling and hierarchical-softmax) necessary for practical results with large vocabularies. If they're both off, there's no outputs being calculated/corrected-via-backpropagation at all.

You'd have to write such code yourself. But, what benefit would you be expecting? (Other ways of spending more computation, such as more `negative` examples or more training-passes, might approximate the same result without any new coding.)

- Gordon

Arjun Seshadri

unread,

Apr 3, 2018, 11:09:54 PM4/3/18

to gensim

Thanks for the quick reply! Would setting negative sampling to the vocabulary size do the trick? Or does it sample with replacement? I am currently trying to isolate the effects of certain hyperparameter variations in word2vec, and turning off negative sampling would help a great deal with that.

Gordon Mohr

unread,

Apr 4, 2018, 1:52:19 PM4/4/18

to gensim

Word2Vec does negative-sampling with replacement, and further according to a frequency-weighted distribution that somewhat oversamples rarer words. So while a large (up to vocabulary-size) value for 'negative' in some ways works a little more like a full softmax, it's not really the same.

(Further, I've noticed the greater the value of 'negative', compared to the one positive example per training-example, the more the average-of-all-vectors in the space gets shifted in one direction away from the origin point. And this "All-but-the-top" paper <https://arxiv.org/abs/1702.01417v2> suggests a post-training adjustment to re-center the vectors improves their quality.)

Negative-sampling is the justifiable default in most word2vec implementations, and especially preferred with larger corpuses and larger vocabularies (because its training performance doesn't decline with size-of-vocabulary). So there's not a strong meaning to the effects of word2vec hyperparameters without negative-sampling.

Further, some hypothetical-form-of-word2vec-with-full-softmax isn't commonly used because of its much-higher training costs, and it's unclear it'd do better on real tasks than the negative-sampling/hierarchical-softmax approximations, even if it were a practical option. So knowing the effects of metaparameters in that hypothetical-other-word2vec might not offer any useful insights for elsewhere.

- Gordon

Arjun Seshadri

unread,

May 13, 2018, 1:28:01 PM5/13/18

to gensim

Thank you for the very thoughtful response, and the pointer to the paper. I really appreciate the help!

Reply all

Reply to author

Forward