Each of the `most_similar()` methods (and some of the others) take raw vectors as well as (string/int) lookup keys.
So, just as you can do:
d2v_model.most_similar('longevity')
...an equivalent call (still just looking among words) is...
d2v_model.most_similar( [ d2v_model['longevity'] ] ) # currently must be a list, assumed 'positive' examples
This points the way to using a word-vector to find similar doc-vectors:
d2v_model.docvecs.most_similar( [ d2v_model['longevity'] ] )
...or using a doc-vector to find similar word-vectors:
d2v_model.most_similar( [ d2v_model.docvecs['DNA repair'] ] )
If you want merged results, you'll currently have to combine and resort them yourself. Here's some example code for finding the combined-top-20, which also labels results as 'word' or 'tag' (because string keys *can* repeat between the two vector sources, and *won't* refer to the same vector):
origin = wiki_model.docvecs['DNA repair'] # or… wiki_model['senescence'] for a word
word_sims = [('word', word, score) for word, score in wiki_model.most_similar([origin],topn=20)]
tag_sims = [('tag', tag, score) for tag, score in wiki_model.docvecs.most_similar([origin],topn=20)]
results = sorted((tag_sims + word_sims),key=lambda tup: -tup[2])
results[:20]
Whether it's meaningful to compare word and doc vectors will depend on the training method and data. (They're definitely *not* meaningfully comparable in the new `dm_concat` mode, but are likely to be comparable if doing DBOW w/ simultaneous skip-gram words or the DM/mean or DM/sum modes.) In my very-limited review of combined results on a Wikipedia-derived dataset, the closest results for words are mostly other words, and for tags are mostly other tags.
There's a lot of room for API improvements in convenience and functionality here. Some unprioritized thoughts:
- add the combining option above
- use some shorthand convention for indicating which vector is being requested – the "Document Embedding with Paragraph Vectors" paper uses a notation "pv('Lady Gaga')" and "wv('Japanese')", which could be matched with properties on the Doc2Vec/Word2Vec models
- the duplication of similarity-code between Word2Vec and DocvecsArray could be factored out to some common utility/mix-in
- can indexing based on slices or lists (like 'advanced indexing' in numpy, eg d2v_model['red','green','blue'] etc) be given a useful interpretation?
Other ideas or implementation help is welcome!
- Gordon