Doc2Vec - How to get similarity between word and doc vectors?

8,054 views
Skip to first unread message

Ockert Janse van Rensburg

unread,
Jul 7, 2015, 11:09:37 AM7/7/15
to gen...@googlegroups.com
Hi there,

I would like to thank the contributors for  the Gensim package. You're efforts are much appreciated.

I was wondering, especially now with the 0.12 release, what is the best way to compute similarity between the document and word vectors? It seems the model.most_similar() and model.docvecs.most_similar() only work in each respective domain. 

Any help would be greatly appreciated.

Gordon Mohr

unread,
Jul 7, 2015, 8:03:43 PM7/7/15
to gen...@googlegroups.com
Each of the `most_similar()` methods (and some of the others) take raw vectors as well as (string/int) lookup keys. 

So, just as you can do:

    d2v_model.most_similar('longevity') 

...an equivalent call (still just looking among words) is...

    d2v_model.most_similar( [ d2v_model['longevity'] ] )  # currently must be a list, assumed 'positive' examples

This points the way to using a word-vector to find similar doc-vectors:

    d2v_model.docvecs.most_similar( [ d2v_model['longevity'] ] )

...or using a doc-vector to find similar word-vectors:

    d2v_model.most_similar( [ d2v_model.docvecs['DNA repair'] ] ) 

If you want merged results, you'll currently have to combine and resort them yourself. Here's some example code for finding the combined-top-20, which also labels results as 'word' or 'tag' (because string keys *can* repeat between the two vector sources, and *won't* refer to the same vector):

    origin = wiki_model.docvecs['DNA repair']  # or… wiki_model['senescence'] for a word
    word_sims = [('word', word, score) for word, score in wiki_model.most_similar([origin],topn=20)]
    tag_sims = [('tag', tag, score) for tag, score in wiki_model.docvecs.most_similar([origin],topn=20)]
    results = sorted((tag_sims + word_sims),key=lambda tup: -tup[2])
    results[:20]

Whether it's meaningful to compare word and doc vectors will depend on the training method and data. (They're definitely *not* meaningfully comparable in the new `dm_concat` mode, but are likely to be comparable if doing DBOW w/ simultaneous skip-gram words or the DM/mean or DM/sum modes.) In my very-limited review of combined results on a Wikipedia-derived dataset, the closest results for words are mostly other words, and for tags are mostly other tags. 

There's a lot of room for API improvements in convenience and functionality here. Some unprioritized thoughts:

- add the combining option above
- use some shorthand convention for indicating which vector is being requested – the "Document Embedding with Paragraph Vectors" paper uses a notation "pv('Lady Gaga')" and "wv('Japanese')", which could be matched with properties on the Doc2Vec/Word2Vec models
- the duplication of similarity-code between Word2Vec and DocvecsArray could be factored out to some common utility/mix-in
- can indexing based on slices or lists (like 'advanced indexing' in numpy, eg d2v_model['red','green','blue'] etc) be given a useful interpretation?

Other ideas or implementation help is welcome!

- Gordon

Ockert Janse van Rensburg

unread,
Jul 8, 2015, 3:21:01 AM7/8/15
to gen...@googlegroups.com
Hi Gordon,

Thank you so much for your prompt response. This is exacly what I was uncertain about and it will prove very useful. Agreed, the API isn't as intuitive and can be improved especially for this task. But, for now this workaround is neat and very helpful. Much appreciated.

Ockert

Mojtaba Zahedi

unread,
Sep 18, 2016, 5:47:33 AM9/18/16
to gensim
Hi Gordon , I have a question : i have 2 different model , one is my user tweet model , and the other one is news model , the both model is Doc2vec, Is there any chance to find the similarity between these 2 models ?? Thanks

Lev Konstantinovskiy

unread,
Sep 18, 2016, 9:16:59 AM9/18/16
to gensim
Hi Mojtaba,

May i ask for the business motivation of this question?

One thing you can try is to Procrustes align the two models using approach in http://nlp.stanford.edu/projects/histwords/

Regards
Lev

Sriram Gopalakrishnan

unread,
Apr 17, 2017, 4:01:10 PM4/17/17
to gensim
@Gordon : Awesome explanation and workaround.
 
Are the word and doc vectors actually embedded in two different spaces (of same dimensions) ? If so, why was this done? I ask, so I don't assume any relationships between the word embeddings and doc embeddings that isn't true.

My  current assumption is that the word vectors and doc vectors have random start locations in two different spaces. The updates of both words and labels occur as though they were in the same space. The separation is for speed and searching. 
Is that right ? Or is there something else I should understand / "read this link noob :-) "

- Ram
 

Gordon Mohr

unread,
Apr 17, 2017, 5:31:08 PM4/17/17
to gensim
In the PV-DBOW with skip-gram word-training ("dm=0, dbow_words=1") mode, and the PV-DM modes without concatenation ("dm=1, dm_concat=0"), the word and doc vectors are essentially trained into "the same space" because they are supplied to the predictive neural-network in the same place, or in a combined (summed/averaged fashion). That is, magnitudes in any of the particular dimensions, in word or doc vectors, have the same forward-propagation effect on the trained-predictions, and get the same back-propagation corrections. 

They're in separate arrays to allow for a potentially much-larger set of doc-vectors than word-vectors. 

A word vocabulary of tens of thousands to hundreds of thousands of words can be plenty; vocabularies of a million to a few million are extensive. But still, such vocabularies are trained from many examples of each word, and the words are accessed in a highly-random fashion (strongly benefitting from the entire vocabulary's vectors to be in RAM at the same time). Words are always keyed by strings. 

Meanwhile, it's possible to train on tens-of-millions, hundreds-of-millions, or even billions of separate documents. In the classic case of each document getting a single tag/vector, and training cycling through the documents in order, it is thus thinkable for the doc-vector set to be larger than RAM. The option of using plain-ints as doc-tags, rather than full strings, also saves creating a giant string->array-slot dictionary in memory. Also, after training, you may not necessarily need all the resulting doc-vectors in memory – inference on new or the same docs just needs the vocabulary & model out-weights. And, if they are of interest, some might prefer putting those precalculated doc-vectors in some other lookup structure – just as an approximate-nearest-neighbors index, or another database, etc. 

So, even though it's interesting in some cases to mix nearest-word and nearest-doc similarity results, they're stored separately and the extra steps up-thread are required to create merged results. 

(There's a pending option arriving soon – https://github.com/RaRe-Technologies/gensim/pull/1256 – that will allow saving doc-vectors in the word2vec.c-vectors-only format via `save_word2vec_format()` – essentially concatenating both sets of vectors, if you save them both to the same file, after which they're only distinguishable by whatever special string-name-prefix is specified for the doc-vectors.)

- Gordon
Reply all
Reply to author
Forward
0 new messages