gensim doc2vec intersect_word2vec_format

71 views
Skip to first unread message

Peter carey

unread,
Sep 2, 2017, 6:22:40 AM9/2/17
to gensim
Hi 

I am a newbie currently learning word/doc vectors. 

Just reading through the doc2vec commands on the gensim page. 

I am curious about  the command"intersect_word2vec_format"

My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model and then train my doc2vec model using the pretrained word2vec values rather than generating the word vector values from my document corpus. The result is that I get a more accurate doc2vec model because I am using pretrained w2v values which was generated from a much larger corpus of data compared to my relatively small document corpus. 

Is my understanding of this command correct or not even close?  ;-) 






Gordon Mohr

unread,
Sep 3, 2017, 12:12:11 PM9/3/17
to gensim
Yes, the `intersect_word2vec_format()` will let you bring vectors from an external file into a model that's already had its own vocabulary initialized (as if by `build_vocab()`). That is, it will only load those vectors for which there are already words in the local vocabulary. 

Additionally, it will by default *lock* those loaded vectors against any further adjustment during subsequent training, though other words in the pre-existing vocabulary may continue to update. (You can change this behavior by supplying a `lockf=1.0` value instead of the default 0.0.)

However, this is best considered an experimental function and what, if any, benefits it might offer will depend on lots of things specific to your setup. 

The PV-DBOW Doc2Vec mode, corresponding to the `dm=0` parameter, is often a top-performer in speed and doc-vector quality, and doesn't use or train word-vectors at all – so any pre-loading of vectors won't have any effect. 

The PV-DM mode, enabled by the default `dm=1` setting, trains any word-vectors it needs simultaneous with doc-vector training. (That is, there's no separate phase where word-vectors are created first, and thus for the same `iter` passes, PV-DM training takes the same amount of time whether word-vectors start with default random values, or are pre-loaded from elsewhere.) Pre-seeding the model with some word-vectors from elsewhere might help or hurt final quality – it's likely to depend on the specifics of your corpus, meta-parameters, and goals, and whether those external vectors represent word-meanings in sync with the current corpus/goal.

- Gordon

Peter carey

unread,
Sep 4, 2017, 10:26:14 AM9/4/17
to gensim
Many thanks for your detailed answer. It has cleared up a lot of things for me. I can now begin training my first d2v model using PV-DM :) :) 
Reply all
Reply to author
Forward
0 new messages