As you note, you might repeat a tag, and then you could consider the many texts with the same tag as part of one virtual superset document.
Similarly, there may be namable-qualities of certain documents that (1) repeat across different documents; and (2) sometimes co-occur with other similar-qualities, and sometimes not; and (3) range over the full text, much like a full-document ID-tag, rather than being associated with a specific location (like a single word or synthesized word). In such a case, it seems natural (and is a small extension to the code) to train tags for these qualities like any other doc-vec – and thus have them find positions with-regard-to-each-other much like words that co-occur at different rates do.
I'm not sure of any formally published write-ups of this technique, or where it's been rigorously evaluated against other options.
An example of "full doc tags that aren't unique IDs" could be authors: perhaps every document has its unique ID, but then one or more authors, who may repeat across documents. By throwing repeating author tags in, as extra doc-tags, it's a bit like having – in interleaved training with the classic one-ID-per-doc training – a synthesized super-document per-author in the training set. That is:
[doctags] | [word-tokens]
---------------------------------
doc1 author1 | A B C D E
doc2 author1 | D E F
doc3 author1 author2 | A B F G H
...is treated a bit like a synthesized alternate training set...
doc1 | A B C D E
doc2 | D E F
doc3 | A B F G H
author1 | A B C D E D E F A B F G H
author2 | A B F G H
(In fact, in PV-DBOW, the inner loop is approximates almost exactly this: train each document with each tag individually. So a document with 2 doc-tags takes twice as long to train – it's almost like two documents. In PV-DM, the multiple doc-vecs are averaged together along with context-words – though perhaps there should be an option to do it more like the PV-DBOW case.)
In the end, if everything worked as hoped, you can wind up with vectors for each author in the "same space" as your document-IDs – and thus perhaps be able to do other interesting author-author or author-doc comparisons.
It may also be appropriate as a way of doing semi-supervised Doc2Vec training, where you plan to do downstream classification, and know classification-labels for some but not all of your training texts. In such a case, you'd give the unlabeled texts unique-IDs, but the labeled texts would get a tag equal to their label (and perhaps a unique ID, too). In experiments, this seems to nudge the vector-space to be more sensitive to whatever contrast in the corpus helps distinguish between the known labels – while still also learning about the text space from unlabeled examples. (However, part of this effect, especially in PV-DBOW, may be from the fact that the documents tagged N times get trained N times more, so more training is occurring for otherwise the same model meta-parameters.)
Some possible gotchas with this technique include making the model so large – with so many tags per document – that it's more prone to overfitting or needs far more data to say anything useful. (Intuitively, I'm not sure that N total documents, no matter how many tags are available per document, can meaningfully train K * N unique doctags, with K >= 2.) Having many tags that only appear on 1 or a few documents may mainly preserve initialization noise – making each less interpretable.
With that background, regarding your inference question:
Inference currently takes only a list-of-word-tokens, and calculates a single (unnamed) vector that (given the rest of the frozen model) is well-optimized for predicting those word-tokens. It doesn't take a TaggedDocument; it doesn't consider the possibility of modeling the text with multiple doc-tags (perhaps some known/fixed, and another to be inferred).
A github issue capturing wishlist features for `infer_vector()` improvements –
https://github.com/RaRe-Technologies/gensim/issues/515 – includes the idea of supplying known doctags for inference. (I suspect in such a case what the user would want is to infer the vector that best 'corrects' the token-predictions of the other fixed vector – and it might require some careful design to achieve that. It might even require re-evaluating how multiple-tag bulk raining happens in the 1st place, to be more sophisticated.)
- Gordon