doc2vec quality check

2,091 views

Skip to first unread message

jax79sg...@gmail.com

unread,

Jun 21, 2017, 5:58:01 AM6/21/17

to gensim

Hi, i did some training with doc2vec for about 18000000 documents. Each document typically holds a section of an online form ranging form a few words to a couple of sentence. I have gotten some results now but it doesn't seem to correlate at all. There are a few questions i would like to seek advice from.

Q1: Assess quality of document embedding.

I would like to check the quality of the overall document embedding. I'm thinking of performing clustering on the document vectors and then visually analyse if documents from the same sections of the online form are clustered together. Would this be a good way to determine the quality of embedding? What are other ways of going about it?

Q2: Assess if inference is performing correctly.

The second question comes from the inference. I did a few inference and so far the docvecs similarity results doesn't seem relevant (E.g. documents doesn't correlate to same section). How to best measure the result of the inference?

Q3: Considerations for model parameters.

How should i tune the hyper-parameters for the training? Gordon has given a good explanation on StackOverflow for some parts of the hyperparameters. There seems to be a lot more looking at the API page.

Q4: Can i use GloVe or word2vec pretrained as input for doc2vec?

Assuming that my documents are not good material... is it possible to use pretrain word embeddings to create document embeddings? (My understanding is document embeddings were not derived from word embeddings, so this is technically not possible unless i use my own averaging methods.

Q5: What other pre-processing should i consider for my documents?

Thus far, i have used raw documents, stripping away punctuation and non-ascii characters. How might i improve the document embedding given the above? Will the following help?

- stemming? (increase the frequency of words that might be semantically similar?)

- remove stop words (potentially increase the quality of documents)?

The following are some specs i used to generate the results this time round..I had taken some steps to ensure reproducibility when i perform inference, not sure if that messed things up.

No of documents: 18000000
taggeddocument
- Unique tag for each document
- tag contains SectionLabel+DocumentFilename as a string.
doc2vec model
- hashfxn=custom hash #Simple ORD so i can get reproducible results
- size=300  #My understanding is a larger vector size can holds more information, increasing accuracy.
- min_count=5  
- iter=20 #Seems to be the recommended count.
- workers=5  
- alpha=0.0001
- dm=0  #Only training for doc2vec.
- PYTHONHASHSEED=0 #For reproducibility
-   self.__model.random.seed(0) before every inference.

Thanks.

Gordon Mohr

unread,

Jun 21, 2017, 3:53:15 PM6/21/17

to gensim

Of course experiments often start with some eyeballed comparisons of similarity-results against what "seems reasonable", against a few hand-picked (but perhaps top-of-mind) targets. But working from a small set of hand-chosen tests, and collecting just a few datapoints of "this seems somewhat better than last attempt" or "somewhat worse", is a very slow, ad-hoc, and barely-reliable way of guiding improvement.

The best practice is to create some automatable quantitative evaluation that is representative of your true end goal – and use scores on that to decide whether the doc-vectors are getting better, or better than some other baseline/pre-existing method.

The evaluations in some of the original 'Paragraph Vectors' papers may be adaptable, or give ideas for similar approaches.

In the original 'Paragraph Vector' paper (<https://arxiv.org/abs/1405.4053>), three evaluations are performed.

In two evaluations (sections 3.1 and 3.2), doc-vectors are used as training data for logistic sentiment-analysis classiifiers. Since the desired sentiment answers are already known for tens-of-thousands of documents, there is suitable data to both train and test a downstream classifier, and the quality of the doc-vectors (against other methods or alternate parameter choices) is judged as being better, if the doc-vectors are better able to drive sentiment-classification. If your ultimate goal is classification, and you have or can create known label-values for many documents, you can use to a similar process to evaluate Doc2Vec models.

In the 3rd evaluation (section 3.3), the results of an existing system – apparently Google's well-evolved, giant-black-box (to us) search result 'snippets' generator – are used to evaluate and tune the doc-vectors. Specifically, each 'snippet' from a top-10-search-result from a top-million most-popular query is used as a document. Then, test triplets are created which each contain 2 snippets from the same query, and one snippet from some other random query. The motivating idea is that the existing 'black box' is a good judge of document relatedness, and thus pairs of snippets from the same top-10 results should get doc-vectors closer to each other than random other snippets from other results-sets. Any model can then be scored as the percentage of times that, given a snippet A and candidate snippets B and C, it properly indicates by doc-vector-similarity which snippet (of B or C) originated from the same top-10 results.

The followup paper 'Document Embedding with Paragraph Vectors' <https://arxiv.org/abs/1507.07998> uses the pre-existing community-maintained categories of Wikipedia or Arxiv in a similar way: it evaluates a model by how often its vector-similarities indicate two docs from the same category are closer than a third random document.

None of these are perfect but they allow the generation of largish test-sets from data that may already be available, and may in a vaguely-directional manner test for the same sorts of similarity most other info-retrieval or predictive-modeling downstream tasks want. If in fact your form-field texts, from the same section, in some way "should" be more similar to each other than texts from other fields, then those may be 'categories' usable in the same way. Maybe there are other indicators in some of your data – demographic, etc – that are strong hints some texts should be closer than others. There's a risk these proxy measures drive the doc-vectors towards only doing well on the proxy, rather than your true end-goals – so if at all possible to evaluate the doc-vectors in your real end-application, by all means do so. But these can be a good start.

One other note: extremely short texts of just a few words may not get very good representations from Doc2Vec – a 30-word text compared to a 3-word text is getting 10x more effort during bulk-training or later inference. It *might* be beneficial, if performance on shorter documents is important, to figure some way to overweight them – for example, by repeating a document that's 1/Nth the average size N times randomly throughout the training set, or using N times more `steps` during inference. But you'd want to test that using a rigorous evaluation on your goals.

The same ideas from above can evaluate inference. An added option is to re-infer vectors for documents that were in the training set, then check if a top result from `model.docvecs.most_similar(positive=[reinferred_vector])` is the same document in the original training set. If training & inference are having the desired modeling effects, it usually should be in the 1st few results. If not, there may be data, training, or inference problems.

Many have also reported better inference results with a larger optional `steps` value (to 20 or far more), or a different `alpha` value (such as the usual training default of 0.025 rather than 0.1).

There's a big search of possible parameter tweaks to search – but with an quantitative evaluation as above, you can automate a grid-search over many parameter values.

An important thing to realize is that "bigger isn't necessarily better" – values that retain more info (and create a larger model or slower training) don't necessarily improve the model's downstream value. Especially with larger datasets, values like 'negative' and 'window' can become smaller (to do less work and still get better results). Throwing out more words (with a larger 'min_count' or smaller 'sample') often improves model quality by spending less memory/effort on words that are either too infrequent to contribute learnable meaning or so frequent they're over-influential before downsampling. A larger vector 'size' requires more data/time to train, and is prone to overfitting on small/demo datasets.

You can try to import such vectors before training begins, to perhaps give the model a 'head-start', but in some modes ("dm=0") they'll be ignored. They're never strictly-necessary: the modes that use word-vectors also train them concurrently with doc-vectors. The larger your corpus, and the more distinctive the language of your domain, the less likely such external other-domain vectors are to help. (They could hurt, or just waste time and complicate evaluation.) In my opinion, re-using word-vectors from elsewhere should only be tried after achieving some success in tuning/evaluation without that step, and formulating a theory from experience why their influence might be necessary.

Much published Word2Vec/Doc2Vec work retains punctuation as word-tokens. Much also does not seem to stem or remove stop words, but it might be helpful in some cases, or with very short corpuses.

There's no-need to strip non-ASCII characters, and in some domains they might be very important. (Accent-flattening might still be helpful.)

18,000,000 is a great document count – but as above, tiny docs may not get great representations, and if they're important, you may want to experiment with repeating them to give them more weight compared to longer-docs.

A unique tag per document is the classic approach, but it is also OK to repeat tags, if the docs are (even if non-contiguous in the corpus) essentially representatives of the same larger 'virtual document'. You should try to make sure the texts aren't sorted/grouped such that all similar docs are consecutive-to-each-other. (So if they come from their original source like that, it's best to perform a single shuffle at the beginning. A re-shuffle for each training pass is usually overkill.)

You can save a bunch of model memory if the tags are just plain Python ints, consecutive starting from 0, so that the big dict of (string-tag)->(array-slot) need not be maintained. But if RAM isn't an issue, or you'd have to remember that string->int mapping elsewhere, don't worry about it.

There's no need to seek perfect reproducibility in inference, unless maintaining a automated test suite with cached 'correct' answers. The algorithm involves intentional randomness, and involves a 'gradually-better-until-good-enough' optimization process – so most evaluations/applications should be tolerant of slight jitter from run to run.

300 is a reasonable (and common) vector `size`; with enough data & memory it seems some published work has seen benefits up to 1000 dimensions. (With smaller datasets, 100 or fewer dimensions may be appropriate.)

5 is a reasonable starting `min_count`, but with larger datasets even larger values may be appropriate. Model size is highly influenced by surviving vocabulary, and for many projects only the first few tens-of-thousands or hundreds-of-thousands of words are significant. So if you have an effective vocabulary after `min_count` of over a million, and you're not sure those long-tail tokens are important, be sure to try larger cutoffs.

`workers` shouldn't be more than the number of CPU cores, but even if you have 8 or more cores, gensim's Python bottlenecks mean some count in the 3-8 range is usually best for throughput.

An `iter` of at least 10 seems most common in published work with large datasets – so 10 may be sufficient for initial explorations, especially in larger datasets. Eventually, or with smaller datasets, it may be worth evaluating 20 or more to see if it provides benefits.

If you're really specifying a starting `alpha=0.0001`, that's WAY outside the norm - 1/250th the usual starting default of 0.0250. It might require 100s of times more iterations to match the training that occurs with a larger starting `alpha`.

Pure PV-DBOW mode `dm=0` is fast and often the best-performing mode, within a fixed time/memory budget. But if you also need to create word-vectors, you'll either want to add `dbow_words=1` as an option to `dm=0`, or switch to a `dm=1` model.