Doc2vec infer article (opposite of infer vector

E G

unread,

Aug 10, 2018, 6:49:30 PM8/10/18

to Gensim

Is there a way to infer an 'article/paragraph' from a 'point' in a doc2vec vector space created from a corpus - basically the reverse of infer vector. If not, what would the code look like to modify gensim to add this? I imagine I would provide a vector and article word length and then get an 'article.'

I know the original doc2vec paper speaks to this under 2.3 Paragraph Vector without word ordering: Distributed bag of words.

This is a bit wierd but I want to experiment with this for an odd use case.

Gordon Mohr

unread,

Aug 10, 2018, 9:21:58 PM8/10/18

to Gensim

Do you mean, generate a text (series of word-tokens) from a vector?

I know some deep/recurrent language models can do this from their own native summary vectors, and may even come up with vaguely grammatical texts. But `Doc2Vec` is pretty shallow/simple-minded. I believe the best you could hope for would be some indication of which words are most indicated by a vector, and even that would be more likely/interpretable only from some model types.

For `Word2Vec`, there's an experimental `predict_output_word()` method, that when given a context word or words, runs the same model forward-propagation as is used during training to report the words most-predicted by the model. It only works for negative-sampling models, and doesn't apply quite the same context-weighting as is enforced during training, but it may be of interest as a possible approach - since `Doc2Vec` works very similar to `Word2Vec`. You can view its source at:

https://github.com/RaRe-Technologies/gensim/blob/17fa0dcea8bb7824f0e709fd3ff60007bcdd85f6/gensim/models/word2vec.py#L1087

The top-N most-activated word-output-nodes of a `Doc2Vec` vector *might* be a reasonable, but non-grammatical, synthetic text for a given doc-vector.

(Of course if the model was trained to include many unique-ID doctags, for known texts, then just the usual `most_similar()` operation would suggest which known-texts are similar to a given new query vector. You could also consider somehow mixing those top-N known-texts together to synthesize a plausible text for the new vector, perhaps using just repeated words or words that are themselves 'close to' other words in the superset of words of all known-nearby-texts.)

- Gordon

Pete Bleackley

unread,

Aug 11, 2018, 4:22:47 AM8/11/18

to gen...@googlegroups.com

Funny thing is, I was thinking of doing this myself recently. Here's what I came up with (caveat: I haven't actually tried it yet).

Start with a sequence of random word vectors. For each position in the sequence, predict a new word vector, given the document vector and the surrounding word vectors. Iterate until the word vectors converge. Then find the closest word for each word vector.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward