Validating Doc2Vec and/or Word2Vec results

2,712 views
Skip to first unread message

utherpen...@yahoo.com

unread,
Jun 27, 2016, 4:33:42 PM6/27/16
to gensim

Hi there,

I am working with the Doc2Vec and Word2Vec deep learning algorithms. Currently I am interested in using the model.n_similarity(wordSet1, wordSet2).


I am interested in any ways of validating the models performance, not just on the n_similiarity()function, but overall how accurate or realistic results can the model provide. Since it performs deep learning, I do not know if there is any ways of knowing how well does it perform. Are there any techniques that I should look up, then use or is there a data-set that has results and I should compare ?


Any suggestion is much appreciated. Thank you

Lev Konstantinovskiy

unread,
Jun 27, 2016, 6:25:24 PM6/27/16
to gensim
Hi,

Word2vec and doc2vec are unsupervised so the model performance is usually measured on a supervised task like document classification down the line. See https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

Regards
Lev

Gordon Mohr

unread,
Jun 27, 2016, 8:04:40 PM6/27/16
to gensim
For word vectors, a common evaluation (following the original word2vec paper) is to measure how ell the resulting vectors solve analogy problems. The `accuracy()` method of the gnesim  Word2Vec class will check a model against a list of questions in the same format as the original researchers used (and you can grab their questions files from the original word2vec.c distribution). But note: top scores on those questions (or analogies in general) might not correlate with the best word-vectors for other purposes – so it's best to devise your own project/goal-specific evaluation methods. 

Sometimes, for word- or doc- similarity tasks, this means somehow creating your own sets of three items A-B-C, where your assumption/goal is that the similarity between A and B should be larger than the similarity between A and C. For example, A & B might be known to be related via some prior method, and C is randomly chosen (and thus overwhelmingly likely to be less-related). 

In the original 'Paragraph Vectors' paper (https://arxiv.org/abs/1405.4053) section 3.3, they bootstrap such an evaluation set from an existing search engine, and want their vectors on search-result-snippets to be closer-to-each-other for search-results that co-appear from the existing system (versus to other random documents). In the "Document Embedding with Paragraph Vectors" paper (http://arxiv.org/abs/1507.07998), the existing 'category' system of Wikipedia or 'subject' labeling of Arxiv articles are used to hint which pairs-of-documents should be 'closer' (versus other randomly-selected documents). 

- Gordon
Reply all
Reply to author
Forward
0 new messages