Q1
Of course experiments often start with some eyeballed comparisons of similarity-results against what "seems reasonable", against a few hand-picked (but perhaps top-of-mind) targets. But working from a small set of hand-chosen tests, and collecting just a few datapoints of "this seems somewhat better than last attempt" or "somewhat worse", is a very slow, ad-hoc, and barely-reliable way of guiding improvement.
The best practice is to create some automatable quantitative evaluation that is representative of your true end goal – and use scores on that to decide whether the doc-vectors are getting better, or better than some other baseline/pre-existing method.
The evaluations in some of the original 'Paragraph Vectors' papers may be adaptable, or give ideas for similar approaches.
In two evaluations (sections 3.1 and 3.2), doc-vectors are used as training data for logistic sentiment-analysis classiifiers. Since the desired sentiment answers are already known for tens-of-thousands of documents, there is suitable data to both train and test a downstream classifier, and the quality of the doc-vectors (against other methods or alternate parameter choices) is judged as being better, if the doc-vectors are better able to drive sentiment-classification. If your ultimate goal is classification, and you have or can create known label-values for many documents, you can use to a similar process to evaluate Doc2Vec models.
In the 3rd evaluation (section 3.3), the results of an existing system – apparently Google's well-evolved, giant-black-box (to us) search result 'snippets' generator – are used to evaluate and tune the doc-vectors. Specifically, each 'snippet' from a top-10-search-result from a top-million most-popular query is used as a document. Then, test triplets are created which each contain 2 snippets from the same query, and one snippet from some other random query. The motivating idea is that the existing 'black box' is a good judge of document relatedness, and thus pairs of snippets from the same top-10 results should get doc-vectors closer to each other than random other snippets from other results-sets. Any model can then be scored as the percentage of times that, given a snippet A and candidate snippets B and C, it properly indicates by doc-vector-similarity which snippet (of B or C) originated from the same top-10 results.
The followup paper 'Document Embedding with Paragraph Vectors' <
https://arxiv.org/abs/1507.07998> uses the pre-existing community-maintained categories of Wikipedia or Arxiv in a similar way: it evaluates a model by how often its vector-similarities indicate two docs from the same category are closer than a third random document.
None of these are perfect but they allow the generation of largish test-sets from data that may already be available, and may in a vaguely-directional manner test for the same sorts of similarity most other info-retrieval or predictive-modeling downstream tasks want. If in fact your form-field texts, from the same section, in some way "should" be more similar to each other than texts from other fields, then those may be 'categories' usable in the same way. Maybe there are other indicators in some of your data – demographic, etc – that are strong hints some texts should be closer than others. There's a risk these proxy measures drive the doc-vectors towards only doing well on the proxy, rather than your true end-goals – so if at all possible to evaluate the doc-vectors in your real end-application, by all means do so. But these can be a good start.
One other note: extremely short texts of just a few words may not get very good representations from Doc2Vec – a 30-word text compared to a 3-word text is getting 10x more effort during bulk-training or later inference. It *might* be beneficial, if performance on shorter documents is important, to figure some way to overweight them – for example, by repeating a document that's 1/Nth the average size N times randomly throughout the training set, or using N times more `steps` during inference. But you'd want to test that using a rigorous evaluation on your goals.
Q2
The same ideas from above can evaluate inference. An added option is to re-infer vectors for documents that were in the training set, then check if a top result from `model.docvecs.most_similar(positive=[reinferred_vector])` is the same document in the original training set. If training & inference are having the desired modeling effects, it usually should be in the 1st few results. If not, there may be data, training, or inference problems.
Many have also reported better inference results with a larger optional `steps` value (to 20 or far more), or a different `alpha` value (such as the usual training default of 0.025 rather than 0.1).
Q3
There's a big search of possible parameter tweaks to search – but with an quantitative evaluation as above, you can automate a grid-search over many parameter values.
An important thing to realize is that "bigger isn't necessarily better" – values that retain more info (and create a larger model or slower training) don't necessarily improve the model's downstream value. Especially with larger datasets, values like 'negative' and 'window' can become smaller (to do less work and still get better results). Throwing out more words (with a larger 'min_count' or smaller 'sample') often improves model quality by spending less memory/effort on words that are either too infrequent to contribute learnable meaning or so frequent they're over-influential before downsampling. A larger vector 'size' requires more data/time to train, and is prone to overfitting on small/demo datasets.
Q4
You can try to import such vectors before training begins, to perhaps give the model a 'head-start', but in some modes ("dm=0") they'll be ignored. They're never strictly-necessary: the modes that use word-vectors also train them concurrently with doc-vectors. The larger your corpus, and the more distinctive the language of your domain, the less likely such external other-domain vectors are to help. (They could hurt, or just waste time and complicate evaluation.) In my opinion, re-using word-vectors from elsewhere should only be tried after achieving some success in tuning/evaluation without that step, and formulating a theory from experience why their influence might be necessary.
Q5
Much published Word2Vec/Doc2Vec work retains punctuation as word-tokens. Much also does not seem to stem or remove stop words, but it might be helpful in some cases, or with very short corpuses.
There's no-need to strip non-ASCII characters, and in some domains they might be very important. (Accent-flattening might still be helpful.)
18,000,000 is a great document count – but as above, tiny docs may not get great representations, and if they're important, you may want to experiment with repeating them to give them more weight compared to longer-docs.
A unique tag per document is the classic approach, but it is also OK to repeat tags, if the docs are (even if non-contiguous in the corpus) essentially representatives of the same larger 'virtual document'. You should try to make sure the texts aren't sorted/grouped such that all similar docs are consecutive-to-each-other. (So if they come from their original source like that, it's best to perform a single shuffle at the beginning. A re-shuffle for each training pass is usually overkill.)
You can save a bunch of model memory if the tags are just plain Python ints, consecutive starting from 0, so that the big dict of (string-tag)->(array-slot) need not be maintained. But if RAM isn't an issue, or you'd have to remember that string->int mapping elsewhere, don't worry about it.
There's no need to seek perfect reproducibility in inference, unless maintaining a automated test suite with cached 'correct' answers. The algorithm involves intentional randomness, and involves a 'gradually-better-until-good-enough' optimization process – so most evaluations/applications should be tolerant of slight jitter from run to run.
300 is a reasonable (and common) vector `size`; with enough data & memory it seems some published work has seen benefits up to 1000 dimensions. (With smaller datasets, 100 or fewer dimensions may be appropriate.)
5 is a reasonable starting `min_count`, but with larger datasets even larger values may be appropriate. Model size is highly influenced by surviving vocabulary, and for many projects only the first few tens-of-thousands or hundreds-of-thousands of words are significant. So if you have an effective vocabulary after `min_count` of over a million, and you're not sure those long-tail tokens are important, be sure to try larger cutoffs.
`workers` shouldn't be more than the number of CPU cores, but even if you have 8 or more cores, gensim's Python bottlenecks mean some count in the 3-8 range is usually best for throughput.
An `iter` of at least 10 seems most common in published work with large datasets – so 10 may be sufficient for initial explorations, especially in larger datasets. Eventually, or with smaller datasets, it may be worth evaluating 20 or more to see if it provides benefits.
If you're really specifying a starting `alpha=0.0001`, that's WAY outside the norm - 1/250th the usual starting default of 0.0250. It might require 100s of times more iterations to match the training that occurs with a larger starting `alpha`.
Pure PV-DBOW mode `dm=0` is fast and often the best-performing mode, within a fixed time/memory budget. But if you also need to create word-vectors, you'll either want to add `dbow_words=1` as an option to `dm=0`, or switch to a `dm=1` model.
- Gordon