Learning Doc2Vec from a pre-trained Word2Vec

2,792 views
Skip to first unread message

EHSANEDDIN ASGARI

unread,
Nov 19, 2015, 2:45:35 PM11/19/15
to gensim
Hi,

I want to train doc2vec from my pre-trained word2vecs. Can some one help with that?

I tried something like this:

model = Doc2Vec.load_word2vec_format(wordvec_path, binary=False)
model.build_vocab(documents)
for epoch in range(10):
print ("epoch"+str(epoch))
model.train(documents)
model.alpha -= 0.002  # decrease the learning rate
model.min_alpha = model.alpha  # fix the learning rate, no decay

But the results doesn't seem right to me.

Thank you,
Ehsan

Gordon Mohr

unread,
Nov 19, 2015, 7:58:18 PM11/19/15
to gensim
Doc2Vec (aka the 'Paragraph Vectors' method of Le/Mikolov) doesn't require pre-trained word-vectors as an input. 

One of the modes ("Distributed Bag Of Words" or DBOW) doesn't necessarily involve word-vectors at all. (Though, DBOW can be usefully interleaved with the word2vec skip-gram training that it closely resembles.) 

The other mode ("Distributed Memory" or DM) creates word-vectors simultaneous with the doc-vector training. 

So, it's not typical to start Doc2Vec training with pre-existing word vectors. 

In fact, when you execute 'Doc2Vec.load_word2vec_format()', you're actually invoking an inherited method from class Word2Vec, and getting back a Word2Vec model. (So, your code couldn't get doc-vectors from any further operations on that Word2Vec model object.) Further, when you `build_vocab()` on a model, you replace its existing state with new initial state, calculated based on the corpus provided as an argument. (So, your code is discarding the prior vectors just after they're loaded.)

With sufficient familiarity with the code, you could trick/patch a Doc2Vec model into re-using prior word-vectors. But, such an approach should be considered experimental/unsupported, and I'd recommend only trying that after using and really understanding/optimizing the usual approach. 

- Gordon

Ehsan Asgari

unread,
Nov 19, 2015, 10:35:06 PM11/19/15
to gen...@googlegroups.com
Thanks. OK, now I tried the normal version without pre-trained word-vectors. I tried the following code but the quality of document vectors is not comparable with a simple mean/summation over word-vectors from each document. Is this training very tricky? Do you have any suggestion to improve the quality of doc2vec? 

model = Doc2Vec(size=200, window=10, min_count=0, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
model.build_vocab(documents)
for epoch in range(10):
print ("epoch"+str(epoch))
model.train(documents)
model.alpha -= 0.002  # decrease the learning rate
model.min_alpha = model.alpha  

Thanks

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/lsvhf7499q4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Nov 20, 2015, 2:37:34 AM11/20/15
to gensim
It all depends. 

Is it a lot of data or a little? Are they long texts or short texts? What's the method of assessing quality? Are you assessing doc-vectors from the bulk-training, or from later inference? Have you tried other parameter values?

In the demo notebook that tries to replicate one of the experiments from the Paragraph Vectors paper (https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb), you can see that for the purpose of coarse-grained (positive/negative) sentiment-prediction on that relatively-small dataset, DBOW vectors are better at every iteration (1-20) than the two DM approaches, mean or concatenation, that are also compared. (The DM approach you're using, sum-of-vectors, isn't tried there... if I recall correctly it was left out because it did comparably but slightly worse than DM/mean.)

A few things I've noticed that may apply to your usage – though they could easily be different for different data sources and end-tasks – are:

* using 'sample' frequent-word downsampling often improves both training speed *and* word-vector quality (on the analogies evaluation); its effect on doc-vector quality has been mixed

* leaving in very-infrequent words seems to work like noise that dilutes what can be learned. Consider the extreme case of every document (or context-window) having some unique word that only appears once. Training devotes effort to making that word predictive of its context, the same as it's trying to make the doc-vector predictive of the same context. So "some" of the power/wisdom of the training process flows into that nearly-useless single word, rather than the doc-vec that you'd rather be representative.

* larger windows aren't necessarily better; in particular in DM mode (or in DBOW with simultaneous skip-gram word training, as enabled by the `dbow_words=1` parameter) larger windows mean proportionately more of the model's net-effort is being spent on updating words rather than doc-vectors. It might even be appropriate to think of the doc-vector, in those modes that also train word-vectors, as indicating some 'remainder-of-meaning' that *isn't* reflected by the words... so the most-powerful post-training representation of some text might be a mix of its word-vectors *and* its doc-vector. 

* sometimes, with the right inference parameters, re-inferring vectors for texts at the end can result in "better" vectors than the in-model vectors left over from the bulk training. Why might that be? Consider your 10 training passes: every document in the training set is the result of those 10 passes, but the early passes were on an almost-random, barely-trained model that was still changing quite a bit. If you instead infer a new vector at the end for the same text using 10 steps, all 10 steps happen on the frozen, final, "best" model. 

- Gordon
Message has been deleted

mao hbao

unread,
Dec 17, 2019, 4:36:06 AM12/17/19
to Gensim
Use this forked gensim to train your doc2vec from the pretrained word2vec model, it supports gensim 3.8.



在 2015年11月20日星期五 UTC+8上午3:45:35,EHSAN ASGARI写道:
Reply all
Reply to author
Forward
0 new messages