Hot to calculate the cosine similarity between inferred_vector and a doc in the training set

726 views
Skip to first unread message

Satya Gunnam

unread,
Mar 30, 2017, 9:21:42 PM3/30/17
to gensim
I am having trouble debugging the test results of doc2vec model.
I have trained by corpus and created a doc2vec model( both dm and dbow).
I am trying to find the nearest documents for a new doc by using the infer_vector.

The problem is sometimes I am finding some results which I do not expect.

# The document which I am expecting is not there in the top 10 closest docs
# The document  I am not all expecting is coming up in the top 10 closest docs.

I am trying to debug why this is happening.

I am trying to find the cosine similarity of the inferred_vector( new document) and  the doc
in the corpus which I would expect to come up in the nearest docs( top 10).. Iwas hoping at least I would
get some clue looking at this ..

docvec = model_dbow.docvecs['label of training doc expected']

print(model_dbow.docvecs.similarity(docvec, infered_vector_dbow))

I do get errors on this call..Is this the right method or is there any other method I am missing out..

Are there any better ways of debugging these problems.



Gordon Mohr

unread,
Mar 30, 2017, 11:21:38 PM3/30/17
to gensim
The `similarity()` method doesn't take a raw vector – just tags to specify known vectors – which likely accounts for the error you got. You could perform the similarity calculation yourself in the same way the source code does, though:


The defaults for the optional parameters to `infer_vector()` (including `steps=5` and `alpha=0.1`) are some crude minimal guesses for quick operation. Many more steps (10-100 or more), and an alternate starting `alpha` (perhaps closer to the training default of `0.025`) often give better inferred vectors. This may be especially important for small documents. 

You should be sure that the argument you supply to `infer_vector()` is a list of text tokens, preprocessed/tokenized the same as the `words` that were provided as part of training examples. (If you supply a string instead, it will look like each character of the string are the words – inferring a nonsense vector.)

A rough sanity check as to whether the model's parameters (including vector size, vocabulary choices & training-iterations) have been sufficient to create a meaningful model, and if the `infer_vector()` parameters are sufficient, is to try re-inferring a vector for some text that was in the training set, then checking that a `most_similar()` using that inferred vector returns results where the text's tag is one of (if not *the*) top hit. For example:

    corpus = …  # assume same as corpus used as documents during training
    test_doc = next(iter(corpus))  # the 1st doc
    inferred_dv = d2v_model.infer_vector(test_doc.words, steps=100, alpha=0.025)
    similars = d2v_model.most_similar(positive=[inferred_dv])  # explicitness avoids misinterp of vector
    print(similars)  

Ideally the tag(s) associated with those words during training (eg, `test_doc.tags[0]`) will be among the top results. If not, there may be something amiss with corpus preparation, the choice of parameters controlling vocabulary/training, or the specific test example. 

- Gordon

Satya Gunnam

unread,
Apr 2, 2017, 12:44:56 AM4/2/17
to gensim
Thanks Gordon..
I have followed all the suggestions and inputs you provided.
# Did configure the infer_vector with the required steps and alpha.
# cross checked the model by using one of the text in the training model and could make sure it is one of the top hits..

And I also calculated the cosine similarity myself using the raw vectors as suggested.

I could see that by removing some text from the input text the cosine similarity changes and I do
see expected results. But this still does not solve the problem of understanding what words in the text
are creating this problem even though I do see the required keywords. 

Is there a method/program which provides me more insight into what words in the text in question are weighing the vector
one side or other etc..Or the other way can be signify some words in a text get more weight etc..

For example I am testing doc2vec on technical support cases corpus which have error codes and I want the error code
have more weight or pull when doing similarities..

BTW..I run a technical support org and you guys do a terrific job with the way you answer the queries in this mailing list.

Gordon Mohr

unread,
Apr 2, 2017, 4:36:08 PM4/2/17
to gensim
Inference approximates what that same text would have received, as a vector, if it had been in bulk training. There's no specific way to weight the inference (other than leaving words out), nor side-reporting of what words most-affected the inference. 

But some thoughts:

If you do have external signals that some words are less relevant, and your experiments suggest leaving those out improves your results, that's a reasonable approach (and it'd be interesting to hear what such signals help). 

Especially in the case of PV-DBOW mode (`dm=0`), word neighbors don't matter, so you could potentially also *repeat* tokens that you know are important, to effectively give them more weight (in either training or inference). (In PV-DM modes, since neighboring words do matter, inserted extra terms might also be worth trying, but exactly where you insert them would then be more likely to have other mixed effects.)

In the canonical/original definition of the Doc2Vec algorithm ("Paragraph Vector"), each text just has a single unique ID, and those IDs each receive a trained vector. However, it's possible to give texts multiple tags, some of which repeat between text examples, and thus learn doc-vectors for those other tags as well. If there are certain distinguished keywords/categories in your data, like error-codes or product-names, it *might* make sense to coerce those to a controlled-vocabulary and add them as tags in the training data. Each such tag would then get its own vector, and the closeness of inferred-text vectors to those vectors might be useful. (In a way, doc-vectors are like "super-words", that range over the entire text example – so promoting known-salient terms from a small subset to be doctags rather than words may make sense.)
 
It's possible that other meta-parameter choices could tend to make the model better at your domain, or at inference. For example, sometimes fewer dimensions achieve better generalization (in addition to training/inferring faster), especially with limited data. (That might be one way to get a model that's better at ignoring 'noise' words, even without your own external word-inclusion tweaking.) So if you have (or can develop) some quantitative, repeatable way to score one model as better than another, a broad-search of potential training parameters could help. 

In word2Vec, some have observed that the magnitude of raw word-vectors (before the unit-norming applied for similarity-comparisons) can be an indication of the strength/unity of a word's meaning. That is, words that mean something very specific have longer vectors, while more generic words (or words with many senses) have shorter. That *might* be a useful signal, alone or in combination with overall corpus/document frequencies, for treating some words specially – but this is speculation for potential experimentation, I don't know any rules-of-thumb. The same might be the case, in negative-sampling models, of the output weights that exist per-predicted word in `syn1neg` – though I wouldn't even have a guess as to whether larger or smaller magnitudes are more meaningful.

(Note that in pure PV-DBOW – `dm=0` – traditional input word-vectors are not trained at all, so will appear random from a model. Only adding word-training to DBOW with `dbow_words=1`, or switching to a PV-DM mode with `dm=1`, will a Doc2Vec model's word-vectors be meaningful.)

I know that's a big dump of half-baked ideas, but hope it helps.

- Gordon
Reply all
Reply to author
Forward
0 new messages