doc2vec infer one vector for multiple documents

Kevin Yang

unread,

May 30, 2017, 11:11:30 PM5/30/17

to gensim

I know that when training I can assign more than one document the same tag, and they'll all be assigned to the same document vector. Does this work in infer_vector too, or does that require that the inputs be lists of lists of words, and not TaggedDocuments?

Out of curiosity, what's the use case for giving a document more than one tag?

Gordon Mohr

unread,

May 31, 2017, 1:06:17 PM5/31/17

to gensim

As you note, you might repeat a tag, and then you could consider the many texts with the same tag as part of one virtual superset document.

Similarly, there may be namable-qualities of certain documents that (1) repeat across different documents; and (2) sometimes co-occur with other similar-qualities, and sometimes not; and (3) range over the full text, much like a full-document ID-tag, rather than being associated with a specific location (like a single word or synthesized word). In such a case, it seems natural (and is a small extension to the code) to train tags for these qualities like any other doc-vec – and thus have them find positions with-regard-to-each-other much like words that co-occur at different rates do.

I'm not sure of any formally published write-ups of this technique, or where it's been rigorously evaluated against other options.

An example of "full doc tags that aren't unique IDs" could be authors: perhaps every document has its unique ID, but then one or more authors, who may repeat across documents. By throwing repeating author tags in, as extra doc-tags, it's a bit like having – in interleaved training with the classic one-ID-per-doc training – a synthesized super-document per-author in the training set. That is:

[doctags] | [word-tokens]

---------------------------------

doc1 author1 | A B C D E

doc2 author1 | D E F

doc3 author1 author2 | A B F G H

...is treated a bit like a synthesized alternate training set...

doc1 | A B C D E

doc2 | D E F

doc3 | A B F G H

author1 | A B C D E D E F A B F G H

author2 | A B F G H

(In fact, in PV-DBOW, the inner loop is approximates almost exactly this: train each document with each tag individually. So a document with 2 doc-tags takes twice as long to train – it's almost like two documents. In PV-DM, the multiple doc-vecs are averaged together along with context-words – though perhaps there should be an option to do it more like the PV-DBOW case.)

In the end, if everything worked as hoped, you can wind up with vectors for each author in the "same space" as your document-IDs – and thus perhaps be able to do other interesting author-author or author-doc comparisons.

It may also be appropriate as a way of doing semi-supervised Doc2Vec training, where you plan to do downstream classification, and know classification-labels for some but not all of your training texts. In such a case, you'd give the unlabeled texts unique-IDs, but the labeled texts would get a tag equal to their label (and perhaps a unique ID, too). In experiments, this seems to nudge the vector-space to be more sensitive to whatever contrast in the corpus helps distinguish between the known labels – while still also learning about the text space from unlabeled examples. (However, part of this effect, especially in PV-DBOW, may be from the fact that the documents tagged N times get trained N times more, so more training is occurring for otherwise the same model meta-parameters.)

Some possible gotchas with this technique include making the model so large – with so many tags per document – that it's more prone to overfitting or needs far more data to say anything useful. (Intuitively, I'm not sure that N total documents, no matter how many tags are available per document, can meaningfully train K * N unique doctags, with K >= 2.) Having many tags that only appear on 1 or a few documents may mainly preserve initialization noise – making each less interpretable.

With that background, regarding your inference question:

Inference currently takes only a list-of-word-tokens, and calculates a single (unnamed) vector that (given the rest of the frozen model) is well-optimized for predicting those word-tokens. It doesn't take a TaggedDocument; it doesn't consider the possibility of modeling the text with multiple doc-tags (perhaps some known/fixed, and another to be inferred).

A github issue capturing wishlist features for `infer_vector()` improvements – https://github.com/RaRe-Technologies/gensim/issues/515 – includes the idea of supplying known doctags for inference. (I suspect in such a case what the user would want is to infer the vector that best 'corrects' the token-predictions of the other fixed vector – and it might require some careful design to achieve that. It might even require re-evaluating how multiple-tag bulk raining happens in the 1st place, to be more sophisticated.)

- Gordon

Kevin Yang

unread,

Jun 13, 2017, 4:29:59 PM6/13/17

to gensim

I'm loving these detailed responses.

So let's say that my "documents" are actually strings of characters with no spaces. To break them up into words, I chop them into chunks (k-mers) of length k. But there are k possible starting points, so now I have k times as many mini-documents as before. What I've been doing is sending all of these in as separate documents during training, and then during inference, I infer the k mini-documents for each new document, then average their vectors. If I could pass tags to infer_vector, then I could give the k minidocuments produced by each document in training the same tag, and do the same during inference, right? Which should speed things up, because it requires storing fewer document vectors.

The other thing I've noticed is that infer_vector is very sensitive to the order in which you pass it documents to be inferred and to the number of steps. I'm currently getting around this by passing the documents I want to infer to it 100 times, in a different (random) order each time, and with one step each time.

Gordon Mohr

unread,

Jun 14, 2017, 1:37:09 AM6/14/17

to gensim

You may want to try pure PV-DBOW mode (`dm=0`) – it's fast and often a top-performer. And if you are, the individual word order (and nearness of word-tokens) isn't significant, because words *aren't* used to predict nearby words. (Only the full-doc vector is used to predict all present words.) In such a case, you could mix the 'k-mers' from all the slightly-different-aligned variants of the same 'document' into one combined document - both during training, and inference. (You could even shuffle their order, after the concatenation, and it might help prevent any one k-variant from having a different influence by being early or late in all examples.)

If you expect the k-mer to k-mer nearness contexts to be important, you might prefer PV-DM mode, or PV-DBOW with the optional `dbow_words=1` interleaved skip-gram training. In such a case, you might provide the k alternately-aligned variants of the same document with the same doc-tag, or even still concatenate them (and figure that the artifact overlapping contexts at the 'seams' won't hurt much if at all).

I'd somewhat expect concatenations of all the variants, into a document that is inferred once, to perform better than performing k separate inferences and averaging the results – because it lets the inference process find the overall most-predictive vector across all alternate k-mer sets. But it'd be something worth testing for your purposes.

I don't completely understand your "if I could pass tags to infer_vector" hypothetical – given that it refers to a not-implemented (and thus not even fully defined) feature, and it's unclear to me what this would let you do different than what I describe above. (You can already give the k variant docs the same tag.)

`infer_vector()` results can vary, on the exact same input tokens, because of inherent randomness in negative-sampling, frequent-word downsampling, and (in some modes) window-trimming. But a better way to reduce the 'jitter' from run-to-run is to *increase* the `steps` optional parameter. One-step inferences will leave each inferred-vector almost at its randomly-initialized position; many steps allow the optimization more chance to find a much-better position (and thus also more-similar-to-each-other near-best positions on subsequent runs).

- Gordon

Reply all

Reply to author

Forward