doc2vec-lee.ipynb results ... not even close

John Cleveland

unread,

Jan 12, 2017, 8:47:13 PM1/12/17

to gensim

For this github tutorial: gensim/docs/notebooks/doc2vec-lee.ipynb
I have copied the code verabatim and I have been unable to reproduce anything near the 96% accuracy rate.

I am using gensim 0.13.4 on jupyter 4.3.1 notebook. I am using Anaconda Navigator.

In the tutorial for the assessment of the model :

In [12 ]collections.Counter(ranks)

Out[12] Counter({0: 292, 1: 8}) <-- Tutorial got

I am getting Counter({0: 31,

What could possibly be the problem. I am just copy pasting?
Thanks

Lev Konstantinovskiy

unread,

Jan 15, 2017, 12:09:11 PM1/15/17

to gensim

Hi John,

Thanks for spotting it. The accuracy and the similar documents vary a lot on such a tiny corpus due to random initialisation and different OS numerical libraries. I removed the reference to accuracy in the tutorial.

One needs a large corpus and tens of hours of training to get reproducible doc2vec results.

Regards

Lev

Gordon Mohr

unread,

Jan 16, 2017, 2:22:40 PM1/16/17

to gensim

Lev is right to point out that there is some initialization randomness, and possible other small differences in timing or precision on different platforms or repeated runs, such that results from sequential runs of Word2Vec/Doc2Vec need not closely match each other. And, the effect is larger on small/toy-sized datasets like the 'Lee' corpuses used in this tutorial. It is also the case that `infer_vector()` in most modes also shows jitter from run-to-run, though using a larger `steps` can help (and the default should probably be larger than the current `5`.).

However, I don't think you should be seeing that much difference - if using the same dataset and training parameters, the `ranks` should be in the same ballpark.

From the excerpt of ranks you're showing, it looks like there's been *some* progress towards an bulk-trained document vector being ranking closer-to-the-top than others. (The counts for ranks 1-4 are higher than those for later ranks.) But it seems as if less effective training/inference has occurred.

While it's *possible* that some inadvertent recent change in gensim (such as an alteration of defaults or new bug) could have caused such a discrepancy, I just ran the `doc2vec-lee.ipynb` notebook, exactly as it exists in the Github gensim `master` branch (and thus also latest 0.13.4.1 release), using Continuum's Python 3.5.1 (the one installed by miniconda). My `ranks` results were the same as in the published notebook output cells – and even altering initialization slightly (with a different `seed` value) gives either identical or very-similar results.

You mention "copy pasting"; are you sure none of the parameters or corpus prep has changed in your code? Is there a reason you're not also running the exact notebook as it is included in the gensim distribution - with no "copy pasting"? (That is, open exactly that file from the Jupyter Notebook interface, and clear/re-execute its cells?)

Can you verify that during your tests, you're running the version of gensim that you intend to (and not perhaps some older one)?

- Gordon

Lev Konstantinovskiy

unread,

Jan 16, 2017, 2:56:13 PM1/16/17

to gensim

Hi Gordon,

This is interesting. On my box the results match John's output. (Linux, Python 2.7, gensim 0.13.4.1)

Gordon Mohr

unread,

Jan 16, 2017, 4:11:25 PM1/16/17

to gensim

Aha! I see that I was inadvertently running from an older gensim, circa 0.12.4 - and when I update to 0.13.4.1, I get the same problem.

It seems to have been the change to ensure Doc2Vec respects the Word2Vec defaults – specifically:

https://github.com/RaRe-Technologies/gensim/commit/14f12f46612c98646962892070bfd2dce414cf95

Mainly, the change from `hs=1, negative=0` to (effectively, inherited) `hs=0, negative=5` seems to account for the bulk of the difference: if I add `hs=1, negative=0` to the notebook, the results are very similar to the older results. (There are some other small default changes as well, but they don't seem to make as much of a difference.)

My sense has been that hierarchical softmax (`hs=1`) tends to take more time per epoch, especially with larger vocabularies, but starts showing results after fewer epochs. So, its performance can look really good on tiny datasets and short runs. However, with larger vocabularies, and more passes, negative-sampling tends to get better overall results, often in less training time (with quicker individual epochs). So I believe respecting the Word2Vec defaults is still the proper course, but this (and perhaps other) tutorials/examples should be tuned to use modes that give more predictable results.

Here, it seems leaving the new mode defaults in place, but increasing the `iter` to 50, allows negative-sampling to give similar results in the inference-testing.

- Gordon

John Cleveland

unread,

Jan 17, 2017, 2:44:39 AM1/17/17

to gensim

Colleagues,

Thank you for your prompt responses.

model = gensim.models.doc2vec.Doc2Vec(size=55, min_count=2, iter=60, hs=1, negative=0) produced:

Wall time: 12.5 s
Counter({0: 292, 1: 8})
Wall time: 12 s
Counter({0: 291, 1: 9})
Wall time: 16.4 s
Counter({0: 290, 1: 10})
Wall time: 20.6 s
Counter({0: 295, 1: 5})
Wall time: 21.3 s
Counter({0: 292, 1: 8})
Wall time: 20.6 s
Counter({0: 292, 1: 8})
Wall time: 16.7 s
Counter({0: 296, 1: 4})
Wall time: 15.4 s
Counter({0: 292, 1: 8})
Wall time: 15.3 s
Counter({0: 295, 1: 5})
Wall time: 14.8 s
Counter({0: 292, 1: 8})

1. Do you have any other suggestions for additional parameters and how to tune these additional parameters that could yield better performance?

2. Given the current state of doc2vec, is a perfect score too much to hope for i.e. that the most similar document is itself 100 percent of the time (with this lee dataset)?

your work is much appreciated,

John

Gordon Mohr

unread,

Jan 17, 2017, 4:08:02 AM1/17/17

to gensim

I believe just switching to HS (`hs=1, negative=0`), *or* increasing the passes (`iter=50`), should closely match the prior notebook behavior.

The best parameters vary by goals, and toy-sized datasets like this aren't a good indicator of what works for larger datasets. (They're prone to overfitting, and the interesting, continuous arrangements of word/doc vectors in these dense models arise from the tug-of-war between many diverse examples.)

The defaults are a reasonable start. Plain DBOW (`dm=0`) works faster/better than many might expect, and DBOW with simultaneous word-training (`dm=0, dbow_words=1`) is also worth trying. These message archives contain other discussions of specific scenarios.

More tuning of the training parameters (and especially more iterations) may yield a better model, and more `steps` (an optional parameter for `infer_vector()` whose default is a very-small 5) can help improve inferred vectors, or make them more self-similar from run to run. But on a tiny dataset like this (which might even contain de facto duplicates), even reaching that goal might just be a peculiar/overfit endpoint unrepresentative of other uses.

- Gordon

Reply all

Reply to author

Forward