Doc2Vec Comparison using Cosine Similiarity

Chun Siong

unread,

Mar 20, 2017, 5:19:31 PM3/20/17

to gensim

Hi

I have trained my model using 10,000 sentences. My objective is given a search phrase then determine the cosine similarity against one of the document.

My question is how should i compare against the document as I am getting mixed result between 2 approach.

Approach 1: Infer search phrase vector, Infer target document vector then compute cosine similarity.

Approach 2: Infer search phrase vector, retrieve target document vector using docvecs['docid'] then compute cosine similarity.

The result between the 2 approach is puzzling as approach 2 get very bad cosine similarity score when search phrase is exactly the same as doc.

I am wondering which is the right way?

Thank in advance for any response.

from gensim.models import Doc2Vec
from gensim.models.doc2vec import LabeledSentence
from sklearn.metrics.pairwise import cosine_similarity

documents = []

documents.append(TaggedDocument(['i', 'am', 'a', 'cat'], ['SENT_1']))

documents.append(TaggedDocument(['watching', 'a', 'movie'], ['SENT_2']))

documents.append(TaggedDocument(['doc2vec', 'rocks'], ['SENT_3']))

model = Doc2Vec(size=10, window=8, min_count=0, workers=4)

model.build_vocab(documents)

model.train(documents)

search_phrase = ['i', 'am', 'a', 'cat']

s1 = model.infer_vector(search_phrase, alpha=0.025, min_alpha=0.025, steps=20)

print(cosine_similarity(s1, model.docvecs['SENT_1'])) # Print out = ~0.00795774

s2 = model.infer_vector(['i', 'am', 'a', 'cat'], alpha=0.025, min_alpha=0.025, steps=20)

print(cosine_similarity(s1, s2)) # Print out = ~0.9999882

Gordon Mohr

unread,

Mar 20, 2017, 6:24:58 PM3/20/17

to gensim

Note that you can't count on good/representative results from toy-sized examples: the beneficial qualities of dense-embedded vectors depend on a tug-of-war between many diverse examples, and forced tradeoffs in the embedding-space.

For example, even this tiny 10-dimensional model has so many free parameters it can likely become arbitrarily good (overfit) at its training word-prediction goals for these 3 examples, in many different equally-good ways. So rather than being nudged-by-training/inference to one meaningful/generalizing place, `SENT_1` can land equally well in many gigantic ranges-of-space – which may account for wild differences from run-to-run, or via slightly-differently-initialized methods. (Even fewer dimensions or many-more-examples might address this. Also more training 'iter' may help tiny examples behave more consistently.)

Also, regarding `infer_vector()` parameters, I don't think you'd ever want `min_alpha` to be such a large value, close/equivalent to starting `alpha`. The gist of gradient descent is to start this learning-rate factor large, end small. Using a larger `steps` is good – and with very-short examples, even far more steps may help.

Otherwise, your approach seems valid. With a sufficiently large corpus and well-parameterized model/`infer_vector()` call, I would expect the inferred-vectors for texts, and the vectors left over from bulk-training for the same texts, to be similar. (And, increasingly similar with larger training `iter` and/or `steps`.) In some of project tests/demos, such closeness is used as a sanity-check on whether some meaningful training/generalization is occurring.

- Gordon

Chun Siong

unread,

Mar 20, 2017, 7:06:16 PM3/20/17

to gensim

Hi Gordon

In my actual model with 10k sentences, I do notice the cosine similarity to closer between approach 1 and 2.

In your opinion, which approach is more sound?

Approach 1: Infer search phrase vector, Infer target document vector then compute cosine similarity.

Approach 2: Infer search phrase vector, retrieve target document vector using docvecs['docid'] then compute cosine similarity.

Gordon Mohr

unread,

Mar 21, 2017, 1:36:40 AM3/21/17

to gensim

Each is plausibly productive – only testing with your corpus/goals/parameter-choices could tap one or the other as better.

I've seen where re-inferring vectors for the training docs, with a very-high steps value, can lead to more-consistent/representative vectors – perhaps because after that effort, they're all on equal footing, created from a lot of effort against the same frozen model. But that's time-consuming, and doing more iterations during initial training would likely have a similar effect.

- Gordon

Reply all

Reply to author

Forward