different lengths for the document vectors and word vectors in doc2vec

538 views
Skip to first unread message

Kevin Yang

unread,
May 26, 2017, 9:16:48 PM5/26/17
to gensim
How difficult would it be to add the ability to have different lengths for the document and word vectors in doc2vec? If it's not too difficult, I'd like to take a stab at it, but if it is, I might be better off implementing it from scratch in something like tensorflow. 

iv...@radimrehurek.com

unread,
May 27, 2017, 2:37:32 AM5/27/17
to gensim
Hello Kevin,

Can you describe what do you mean more concretely?
Doc2Vec allow use texts with different length as input of `Doc2Vec`

Kevin Yang

unread,
May 27, 2017, 2:44:19 AM5/27/17
to gensim
I mean the embedding vectors learned by doc2vec. For example, if I wanted it to learn 8-dimensional word embeddings and 64-dimensional document embeddings. 

Gordon Mohr

unread,
May 29, 2017, 1:17:28 AM5/29/17
to gensim
Because of the way typical training modes work, there might not be much benefit to such an option – at least not compared to just training word-vectors and doc-vectors in totally separate models. Consider:

In pure PV-DBOW, word-vectors aren't trained. 

In PV-DBOW interleaved with skip-gram word-training, the usual benefits sought are:

(1) word-vectors & doc-vectors that are in the 'same space' - same dimensionality, proximities mean similarities; and
(2) all the sliding word-context windows serve as sort-of micro-documents, perhaps working as a kind of corpus-extension to make the doc-vector space more expressive

Both of these would be lost with different-sized vectors, and any alternating training of doc-vectors (of one size) and word-vectors (of another size) would be essentially like training two wholly separate models – which is already easy enough to do as separate steps. 

In PV-DM with either summing or averaging of context-vectors, the doc-vectors and word-vectors must be summable/average-able, so must be the same size. 

Only in PV-DM with a concatenative input layer (`dm=1, dm_concat=1`) would combined training of different-sized word-vectors and doc-vectors *possibly* make sense. However, this mode is still best considered experimental. Despite the claims of the original 'Paragraph Vectors' paper, it doesn't seem to offer a noticeable win over other modes. (The results reported there have never to my knowledge been reproduced.) It creates a giant, slow-to-train model. Perhaps, on giant datasets, or with far more training iterations, or with other modifications not detailed in published work, this mode is worthwhile. But for now it's of dubious value.

So, before adding new tunable options (like mixed-size word-and-doc-vectors) to this experimental mode, it'd be good to find some conditions (such as a particular dataset/task) where this mode is valuable, and can be realistically evaluated. (I suppose also there is some chance there's a bug with gensim's implementation of this mode, which I wrote when trying to reproduce the paper's results. But I've reviewed the code quite closely several times, and it does seem to behave in the general ways one would expect. Also, I haven't yet come across other implementations of this mode, in Python or other languages, against which we could double-check its results. If an alternate implementation can be found, that comparison could be interesting.)

- Gordon

Kevin Yang

unread,
May 30, 2017, 8:50:41 PM5/30/17
to gensim
First, thank you for the very thorough response. 

My particular use case has very few words (20 - a few thousand) in the vocabulary, so I was wondering whether I might be able to concatenate smaller word vectors (d=8 or so) with a slightly larger document vector (d = 64 or so). 
Reply all
Reply to author
Forward
0 new messages