Doc2Vec Comparison using Cosine Similiarity

2,440 views
Skip to first unread message

Chun Siong

unread,
Mar 20, 2017, 5:19:31 PM3/20/17
to gensim
Hi

I have trained my model using 10,000 sentences. My objective is given a search phrase then determine the cosine similarity against one of the document.
My question is how should i compare against the document as I am getting mixed result between 2 approach.
Approach 1: Infer search phrase vector, Infer target document vector then compute cosine similarity. 
Approach 2: Infer search phrase vector, retrieve target document vector using docvecs['docid'] then compute cosine similarity.

The result between the 2 approach is puzzling as approach 2 get very bad cosine similarity score when search phrase is exactly the same as doc.
I am wondering which is the right way?

Thank in advance for any response.


    from
 gensim.models import Doc2Vec
    from gensim.models.doc2vec import LabeledSentence
    from sklearn.metrics.pairwise import cosine_similarity

    documents = []
    documents.append(TaggedDocument(['i', 'am', 'a', 'cat'], ['SENT_1']))
    documents.append(TaggedDocument(['watching', 'a', 'movie'], ['SENT_2']))
    documents.append(TaggedDocument(['doc2vec', 'rocks'], ['SENT_3']))

    model = Doc2Vec(size=10, window=8, min_count=0, workers=4)

    model.build_vocab(documents)
    model.train(documents)

    search_phrase = ['i', 'am', 'a', 'cat']

    s1 = model.infer_vector(search_phrase, alpha=0.025, min_alpha=0.025, steps=20)

    print(cosine_similarity(s1, model.docvecs['SENT_1']))  # Print out = ~0.00795774

    s2 = model.infer_vector(['i', 'am', 'a', 'cat'], alpha=0.025, min_alpha=0.025, steps=20)

    print(cosine_similarity(s1, s2))  # Print out =  ~0.9999882

Gordon Mohr

unread,
Mar 20, 2017, 6:24:58 PM3/20/17
to gensim
Note that you can't count on good/representative results from toy-sized examples: the beneficial qualities of dense-embedded vectors depend on a tug-of-war between many diverse examples, and forced tradeoffs in the embedding-space. 

For example, even this tiny 10-dimensional model has so many free parameters it can likely become arbitrarily good (overfit) at its training word-prediction goals for these 3 examples, in many different equally-good ways. So rather than being nudged-by-training/inference to one meaningful/generalizing place, `SENT_1` can land equally well in many gigantic ranges-of-space – which may account for wild differences from run-to-run, or via slightly-differently-initialized methods. (Even fewer dimensions or many-more-examples might address this. Also more training 'iter' may help tiny examples behave more consistently.)

Also, regarding `infer_vector()` parameters, I don't think you'd ever want `min_alpha` to be such a large value, close/equivalent to starting `alpha`. The gist of gradient descent is to start this learning-rate factor large, end small. Using a larger `steps` is good – and with very-short examples, even far more steps may help. 

Otherwise, your approach seems valid. With a sufficiently large corpus and well-parameterized model/`infer_vector()` call, I would expect the inferred-vectors for texts, and the vectors left over from bulk-training for the same texts, to be similar. (And, increasingly similar with larger training `iter` and/or `steps`.) In some of project tests/demos, such closeness is used as a sanity-check on whether some meaningful training/generalization is occurring. 

- Gordon

Chun Siong

unread,
Mar 20, 2017, 7:06:16 PM3/20/17
to gensim
Hi Gordon

In my actual model with 10k sentences, I do notice the cosine similarity to closer between approach 1 and 2.
In your opinion, which approach is more sound?
Approach 1: Infer search phrase vector, Infer target document vector then compute cosine similarity. 
Approach 2: Infer search phrase vector, retrieve target document vector using docvecs['docid'] then compute cosine similarity.

Gordon Mohr

unread,
Mar 21, 2017, 1:36:40 AM3/21/17
to gensim
Each is plausibly productive – only testing with your corpus/goals/parameter-choices could tap one or the other as better. 

I've seen where re-inferring vectors for the training docs, with a very-high steps value, can lead to more-consistent/representative vectors – perhaps because after that effort, they're all on equal footing, created from a lot of effort against the same frozen model. But that's time-consuming, and doing more iterations during initial training would likely have a similar effect. 

- Gordon
Reply all
Reply to author
Forward
0 new messages