Do I have to call .train() to initialize training on a doc2vec model? And is it possible to see the convergence values?

708 views
Skip to first unread message

tha...@gmail.com

unread,
Feb 6, 2017, 6:02:32 PM2/6/17
to gensim
When I have code that looks like that:

model = Doc2Vec(Docs, min_count = 20, size=300,  iter = 20, negative = 5, workers=6, sample = 1e-5, alpha=0.01, window=15, in_alpha=0.0001)

does this automatically start the training for 20 epochs or do I still have to do model.train() to initialize it?

Also is it possible to get a read out after each epoch on how much the model converged ?

Thanks!

Gordon Mohr

unread,
Feb 6, 2017, 6:42:55 PM2/6/17
to gensim
If you supply your corpus (`Docs` in your snippet) in the constructor invocation, both the vocabulary-survey and training will happen automatically. If you've enabled INFO-level logging, you'll see lots of output describing the steps being taken. 

(If you leave the corpus out of the object-construction call, then the model is just awaiting a `build_vocab()` then `train()` call.)

There's no running tally or output of the model's internal predictive loss, though that's a wishlist feature for the future. 

FYI, that large of a `window` will slow training, and (depending on your goals) may not help results. Also, that's an atypically-small starting `alpha` value. Do you have specific reasons for varying these parameters? 

- Gordon

tha...@gmail.com

unread,
Feb 6, 2017, 7:50:46 PM2/6/17
to gensim
I'm aiming to recreate the model in this paper https://arxiv.org/abs/1607.05368 . In the paper those parameters seem to deliver relatively good similarity scores, when the model is trained on an external corpus (trained on Corpus A and trying to find similarities between documents in Corpus B) . I forgot that in the example above but I also train in the dbow mode.

Gordon Mohr

unread,
Feb 6, 2017, 9:09:55 PM2/6/17
to gensim
That's a really interesting paper, but I found it somewhat unclear how many other parameter choices they'd tested against, and how small of a subset of their 'development' corpuses were used to pick final parameters then used throughout. (For example, in their "Forum Question Duplication" task description, it's mentioned that the hyper-parameters were optimized on just the `tex` subforum, then re-used elsewhere.)

I'm also wary of taking those recommendations completely literally because technically, in pure PV-DBOW as described in the original paper (Doc2Vec `dm=0` mode), `window` is a non-operative parameter. It would only come into play if you also enable concurrent skip-gram word-vector training (with Doc2Vec non-default `dbow_words=1` option). Set it at 1, set it at 100 – if you're only in `dm=0` mode, any differences in results will just be the random jitter between runs. 

Also, where I see them mention an `alpha` of 0.01, they seem to be referring to *inference* after initial model training, along with hundreds of inference epochs. That is, I believe they mean as optional parameters for Doc2Vec `infer_vector()` – `alpha=0.01, steps=500`. The earlier mention of learning-rates suggests they left more typical starting defaults (like 0.025) in-place.

- Gordon

tha...@gmail.com

unread,
Feb 7, 2017, 4:00:41 AM2/7/17
to gensim
Hey thanks I really appreciate your advise and I will look into it! Coming as an outsider to the field these kind of nuances where lost to me.

bigdeepe...@gmail.com

unread,
Feb 7, 2017, 9:44:57 PM2/7/17
to gensim
Hm. This paper seems to aim to answer some of the questions I've struggled with. Interesting that it suggests that word vectors are important for doc vectors to cluster around.

bigdeepe...@gmail.com

unread,
Feb 9, 2017, 12:09:25 PM2/9/17
to gensim
Gordon, can you comment on the paper's assertion that pre-training the word2vec helps with the Doc2Vec training? Specifically I am interested to know if I have to separate
the word2vec training from the Doc2Vec training. In one of your answers you mentioned that with "dbow" model I can turn on word2vec training, but it was not clear to me
if that would qualify as "pre-training" that paper talks about, or is it just "parallel" training that will not affect the Doc2Vec training.
 
Thanks.


On Monday, February 6, 2017 at 9:09:55 PM UTC-5, Gordon Mohr wrote:

Gordon Mohr

unread,
Feb 9, 2017, 2:36:01 PM2/9/17
to gensim
I believe section 5 of that paper is somewhat confused in its explanations. 

Pure PV-DBOW (`dm=0, dbow_words=0`) mode is fast and often a great performer on downstream tasks. It doesn't consult or create the traditional (input/'projection-layer') word-vectors at all. Whether they are zeroed-out, random, or pre-loaded from word-vectors created earlier won't make any difference. 

PV-DBOW with concurrent skip-gram training (`dm=0, dbow_words=1`) will interleave wordvec-training with docvec-training. It can start with random word-vectors, just like plain word-vector training, and learn all the word-vectors/doc-vectors together, based on the current training corpus. The word-vectors and doc-vectors will influence each other, for better or worse, via the shared hidden-to-output-layer weights. (The training is slower, and the doc-vectors essentially have to 'share the coordinate space' with the word-vectors, and with typical `window` values the word vectors are in-aggregate getting far more training cycles.)

PV-DM (`dm=1`) inherently mixes word- and doc-vectors during every training example, but also like PV-DBOW+SG, can start with random word-vectors and learn all that's needed, from the current corpus, concurrently during training.  

In either PV-DBOW+SG or PV-DM, you could try to re-use word-vectors from an earlier session. I'd expect that starting the model like this, with some of its weights already in a somewhat-meaningful configuration, could give the model somewhat of a 'head-start' on achieving a useful doc-vector configuration. There's no separate phase where word-vectors are learned "first", so N iterations of Doc2Vec training will still take the same amount of time, but *maybe* the model would make a little more progress in the same number of iterations (or make do with fewer iterations).

However there's also some chance you'd be impairing the doc-vectors for some purposes, especially if the word-vectors come from a different corpus, by having brought in state from a prior word2vec training session which had different predictive objectives. You'd also want to consider any time/overhead for creating/optimizing the word-vectors. 

I'd suspect such pre-seeding to be most helpful with smaller datasets, where you're seeding using word-vectors left over from a much larger (but still believed usefully 'compatible') dataset. 

I doubt you'd want to train-up word-vectors as a separate optimized step of a now multi-step process. For example, given one large corpus, I'd expect 20 iterations of Doc2Vec, starting from random initialization, to give better results than 10 iterations of Word2Vec from random-initialization, then 10 iterations of Doc2Vec from reused-word-vector-initialization. The one combined training lets everything co-improve together from the very beginning, and gives the doc-vectors relatively more attention.

Considering another possible scenario: let's say you already have a well-performing Doc2Vec model, on an older/smaller dataset. Then you then want to train a new Doc2Vec model, with similar parameters, on larger/newer data from a similar domain. The case for re-using the older model's state as a starting point seems stronger to me here (compared to just importing other word-vectors). It's the same domain, and same training-objective, and maybe even mostly-the-same documents. (But an issue I'd see would be that the accumulated 'weight' of all the older/repeated documents might make the model less-influenced by any novel documents, compared to the alternative of retraining-from-scratch – and especially so if you're leveraging the 'head-start' to skimp on additional full training epochs with the current dataset.) 

- Gordon

bigdeepe...@gmail.com

unread,
Feb 13, 2017, 1:54:02 PM2/13/17
to gensim
Gordon, following the settings in the paper, with one change, I used min_count=3, I am now getting pretty decent results with some out of sample test text (just a couple). I am still using just a single unique tag. I wanted to clarify what happens if I add a tag that is shared between multiple examples. It seems I get an additional docvec for every additional tag, but what effect will it have more generally?

Gordon Mohr

unread,
Feb 13, 2017, 3:06:40 PM2/13/17
to gensim
Good! 

Since that paper describes a bunch of alternate settings, which specifically did you go with? (Was it parameters from their Table 4 on the smaller corpuses, or footnote 9 on the larger corpuses?)

I would still be careful about drawing lessons from that paper about optimal parameters for other datasets. As mentioned in my earlier message, some aspects of their optimization choices are narrowly-focused on a tiny subset, unclear, or (with regard to PV-DBOW) muddled.

In particular, I'll again suggest that with a dataset of your size, you should be trying higher, not lower `min_count` values. It may be a double win in better doc-vectors and lower training-time. 

What's the highest `min_count` value you've tried? 

Allowing multiple tags per document, and having some of those tags repeat across documents is a natural extension of the PV-Doc2Vec methods (and thus available in gensim) but I haven't seen much data about its general effects. Resulting tag-vectors might move to useful places, modeling the same aspects of the docs that caused them to be assigned those tags. When docs have known single labels, I've at times seen a benefit in adding those known-labels as tags during training. (The resulting model's vectors, even later-inferred for new texts, were somewhat more useful for classification tasks.) The effect will likely vary based on DM-vs-DBOW mode or other settings. As previously mentioned, trying to train up many more tags, from the same amount of source data, may dilute the power of the resulting tags. 

Like with many potential variants of Doc2Vec, you'd have to try it and test the results against other options. Please share any good tricks or rules-of-thumb you discover!

- Gordon

bigdeepe...@gmail.com

unread,
Feb 15, 2017, 7:45:47 PM2/15/17
to gensim
I am using the first line of the table setting, which appears to be same as the footnote. size=300, window=15, negative=5, sample=1e-5, iter=20. alpha=0.025
for infer_vector   alpha=0.01, min_alpha=0.0001

iter=20, window=15 ran for 33 hours, so trying options with my dataset and iter=400 seems impractical.

I am still testing and doing some experiments.

I wrote a little code to find similar sentences within two similar documents 1 raw input 2 within the corpus.

x=v1-v2
x.dot(x) worked better to identify similar sentences (as inferred from the model) than

x1=unitvec(v1)
x1.dot(unitvec(v2)

bigdeepe...@gmail.com

unread,
Feb 19, 2017, 12:50:44 PM2/19/17
to gensim
Gordon. you made the suggestion that min_count should be much higher  a couple of times and I'd like to explore this a bit further. My rationale for keeping it low is because the corpus I have
is specific to a distinct area of knowledge with very unusual jargon. I am concerned about missing rare terms which may not appear often in the corpus. That said I wanted to get rid of misspelled terms. I thought that setting min_count=3 would get rid of misspellings but keep all of the specialized jargon, no matter how rare.

Could you share your thoughts, in more depth, why you think results (training time aside) would be better with higher min_count, and what value would you suggest from either your experience or from the literature you have reviewed?

I am training now again, after I added some restructured tags (a lot fewer, I merged all unique tags, publication tags into a single tag) and it will probably take close to 48 hours to finish. So, I don't have the luxury of doing a lot of experiments because the run time is so long.

Thanks.

Gordon Mohr

unread,
Feb 19, 2017, 3:15:42 PM2/19/17
to gensim
Good word-vector representations are balanced and arranged usefully against all other words. That requires a variety of usage examples, and the continuing interleaved tug-of-war with other words during training. 

If a word only appears a few times, those few examples are unlikely to be fully representative of the word's full real meaning. And, that word is involved in relatively few training-cycles, compared to other more-frequent words. So rarer words are unlikely to acquire very meaningful/reliable vectors. 

Yet they're still taking up memory and training time. (Without them, you could give more dimensions to other words, or do more training cycles.) And during the time that their lower-quality interim representations are mixed in with other training – either via PV-DM/CBOW averaging or just interleaved shared hidden-weight-updates – they're to some extent noise/interference, compared to the higher quality influences of other more-frequent words. And in window-based modes, these rarer/lower-quality words are taking up window slots that, if the rare words were skipped instead, could let other more significant words more influence each other. 

So dropping more rare words will often improve the quality of the surviving words or doc-vectors. The optimal value will vary by corpus and goals, but it's almost certainly higher than you would guess, if your mental model is by default "more info must be better". 

Perhaps try `min_count=50' and see if either ad-hoc or other quality-evaluations improve compared to lower values?

Since run-time is a concern, make sure you've eliminated all text-preprocessing/regexing from the corpus iterator feeding Doc2Vec. In DM or DBOW-plus-words modes, a large `window` significantly adds to runtime, and 15 is an atypically large value (if you're still following the parameters from the Lau/Baldwin paper upthread). Did you ever check the running time and results from using the 'quick minimal trial' parameters suggested at: https://groups.google.com/d/msg/gensim/N5SCiq1F45w/MppHO0jgCwAJ ?

An ad hoc search for unquantified "seems a little better" improvements through parameter tweaking, when settings are tried one-at-a-time, and each trial takes 1-2 days, will be quite limiting. If you manage to devise an automated scoring method, and can temporarily rent more machines, you could potentially try dozens or hundreds of configurations in parallel in the same amount of calendar-time. 

- Gordon 

bigdeepe...@gmail.com

unread,
Feb 19, 2017, 3:48:07 PM2/19/17
to gensim
Yes, I ran the small models and I think it was around 13 hours or so to train. I remember, however, being underwhelmed by my eye-balling tests. I have not deleted those models yet.
I can load those smaller models at the same time with the new model when it's done and try side by side comparison.

I am only doing unavoidable processing, like splitting text and tags into tokens.

window=15 is right from the article. It specifically states that dbow benefits from larger window size.

Gordon Mohr

unread,
Feb 19, 2017, 5:47:58 PM2/19/17
to gensim
OK, but if you're doing anything more in the iteration than something like `s.split()` (on whitespace), it may be a bottleneck.

As pointed out previously, the Lau/Baldwin's discussion of PV-DBOW is suspect, because true PV-DBOW does not consult a `window` value at all. You could make it 0, you could make it 1000, but it has no effect on either runtime or calculations during PV-DBOW training (unless you also enable the non-default `dbow_words=1` setting). So if you are now using pure DBOW – which wasn't clear given the limited details you'd included – `window` can be left out of the parameters considered for optimization.  

- Gordon

bigdeepe...@gmail.com

unread,
Feb 19, 2017, 7:02:54 PM2/19/17
to gensim
I do enable dbow_words=1, based on their statement that word vectors tend to become anchors for the documents vectors. Obviously it's not clear from the paper what they mean by pre-training. On one hand the paper implies that simply setting the dbow_words=1 is enough, but the pre-training language is confusing.

The last model I did took about 33 hours with the first 1 line parameters from the paper (except min_count=3) it is about 11 hours longer than when I trained with the default dm=1 and window=9. They clearly state that dbow is helped with larger window size. Are they confused? Did they modify gensim? (I thought they used gensim)

Gordon Mohr

unread,
Feb 19, 2017, 8:35:35 PM2/19/17
to gensim
I find the paper unclear about all the variants they tried, so you'd have to ask the authors. 

I would expect using `dm=0, dbow_words=1, window=15` would, in the common case where each document has only one tag, result in training that takes about 15X longer. It has to train the tag, and the ~15 skip-gram predictions, for each target word – instead of just the tag. (This also means the model is spending 15X as much effort improving the word-vectors as the doc-vectors, though there clearly should be some cross-benefit. Still, if representativeness of the doc-vectors is the goal, I'd expect processes that spend the most computational effort tuning them could be best.)

While Lau/Baldwin report "performance degrades severely" without `dbow_words=1`, they don't provide any specific measurements, and given that 15X more training-examples are used with that option on, taking 15X as long, a fairer comparison against `dbow_words=0` might let the `dbow_words=0` session use 15X more iterations. (That is, are *those* doc-vectors better or worse than the doc-vectors co-trained with word-vectors?)

Some code they used appears to be at https://github.com/jhlau/doc2vec - which also references a gensim fork made to load word-vectors. (Though, I think they could have gotten the same effect with vanilla gensim with careful use of the `intersect_word2vec_format()` method.)

I see in that repository that they also used `dm_concat=1` in their PV-DM tests. That's a non-default option that makes models much bigger and slower – especially with large `window` values. I implemented this mode in gensim in an attempt to reproduce Le/Mikolov's results in the original Paragraph Vectors paper, which claims to use such a method. It didn't help much – certainly not enough to reproduce the Le/Mikolov results (which were almost certainly misreported in the original paper). I've not yet seen a dataset where `dm_concat=1` is worth the overhead – maybe there's some sweet spot in much-larger datasets and many-more iterations. So I would also consider the PV-DM results in Lau/Baldwin (described with the label `dmpv`) as not generalizable to the more commonly-used `dm=1, dm_concat=0` default mode. 

- Gordon

bigdeepe...@gmail.com

unread,
Feb 20, 2017, 2:49:07 PM2/20/17
to gensim
I guess one thing is clarified, they did modify gensim for the "pre-training" What's not clear to me still, if directly training the model with dbow_words=1 yields equivalent results. The question in my mind is what is the number of epochs of training needed to get vectors of similar "quality"? 

If I were to train word2vec first for 20 epochs, and then train doc2vec with the pre-trained word2vec vectors, do I get better better result than if I were to simply
train both word and doc vectors at the same time for 20 epochs? Is it faster to train only word2vec and then doc2vec with fewer epochs to lead to the same quality
results?

The last model I trained, in the end, took 58 hours. The results, eye-balling, similar example withing corpus with out of corpus input , seem reasonable on one random input example (I pulled a page off the net about glioblastoma symptoms, treatment).
 
Putting in additional tags resulted in smaller (magnitude) similarity values than when I had a single unique tag.

Can you clarify how "intersect_word2vec_format()" can be used instead of their customized version?

Gordon Mohr

unread,
Feb 20, 2017, 4:40:13 PM2/20/17
to gensim
I gave some opinions on whether a words-then-docs process might be helpful in the earlier message: https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/lBKGf7WNDwAJ

Seeding a model with pre-existing word vectors might give it head-start, in some modes, for some corpuses, for some end-applications, if the word-vectors are from a similar domain. But it might also impair the model, for other purposes, or wind up being a time-consuming no-op, if doing enough Doc2Vec training that the influence of a non-random starting-state is essentially 'diluted' to nothing. 

The only way to know the answers to most of these questions is to test & evaluate alternatives in a quantitative, repeatable way. "Eye-balling" and "seem reasonable on one random input example" are unlikely to point directly, or rapidly, towards better process or meta-parameter choices. I would only improvise new ad-hoc extensions to Doc2Vec if I had already confirmed its usefulness using some more standard options, and had a quantitative evaluation in place that could tell whether different parameter tweaks are helping or hurting. 

`intersect_word2vec_format()` has a descriptive doc-comment:


This method assumes your model has already discovered its own vocabulary (via `build_vocab()` over a corpus), but then scans a Google word2vec.c-format file, and for the words that already exist in the local model (the intersection of the two word sets), it loads the vector from the file instead. Further, it by default freezes those vectors against further training (using an experimental feature of gensim Word2Vec in the `syn0_lockf` array). You could instead pass the option `lockf=1.0` to allow the imported-vectors to be trained as normal. 

- Gordon
Reply all
Reply to author
Forward
0 new messages