Paragraph Matrix D in doc2vec

733 views
Skip to first unread message

Tirthankar Ghosal

unread,
Aug 6, 2016, 12:17:22 AM8/6/16
to gensim
Dear all

I was trying my hands on gensim and going through the original Paragraph Vector paper by Le and Mikolov. I have got some queries :

As specified in the original paper of Paragraph Vector :

1. Just like word vectors paragraph matrix is said to be randomly initialized. Each column in a paragraph matrix represents a paragraph. Each column in the word matrix represents a word. What the rows in the paragraph matrix represent?

Suppose we have 3 documents :

para1: Two for tea and tea for two.You for me and me for you.
para2: Tea for me and tea for you
para3: You for me and me for you.Two for tea and tea for two.

with vocabulary V = [two,tea,me,you]

What would be the corresponding paragraph matrix and word matrix?Is it the simple term-document and one-hot representation?

2. How the word-ordering is preserved in case of paragraph vectors?

3. If for a certain set of documents we have all the word in the vocabulary

for e.g. d1 : Chris is a good boy.
           d2 : Paris is a beautiful city.
            V : [Chris,is,a,good,boy,Paris,beautiful,city]

What would be the document representation like? How would they differ vectorially?

4. How the update to the Paragraph Matrix D in the next epoch is made? " to get paragraph vectors D for new paragraphs (never seen before) by adding more columns in D and gradient descending on D while holding W,U,b fixed " as in the original paper?

Gordon Mohr

unread,
Aug 8, 2016, 6:47:01 PM8/8/16
to gensim
(1) The paragraph representations (columns) are (like word-vectors) also dense representations. So the rows are just arbitrary dimensions: values to be learned that, at the end of the training process, proved to be useful to make predictions. (They're not sparse term-occurence vectors, and individual rows don't mean specific things, just as each dimension of a word2vec vector doesn't mean a specific thing.)

(2) Order isn't strictly preserved. However, the "PV-DM" mode, by using an input-context that's a combination of nearby-words and the paragraph-vector, means neighboring words influence each others' representations, and moreso the more often and closer they appear together. In this way, PV-DM is most like word2vec CBOW mode. If you already understand word2vec it may be easiest to understand PV as word2vec, with special pseudowords, one per paragraph, which are considered 'near' (and thus participating in) every context.

(3) You could run that with many iterations (& a *very* small dimensionality) to see exactly what happens. But note that toy-sized examples with a tiny number of examples or tokens often don't illustrate the sort of continuous, balance-of-many-influences relationships that are the usual goal of dense representations.

(4) Each incremental update to the doc-vectors that are in-training happens after a specific attempted word-prediction task, for each word in each text example. If the paper isn't clear, you may want to go to the source code, to trace what's happening in full detail. The DBOW mode is simpler than the DM mode(s), and the pure-python code a bit easier to follow than the cython-optimized code, so I'd suggest starting there:


You can also see how `infer_vector()` is really just additional training - but with options set to ensure the prior model is totally frozen against changes, and only the candidate-vector for the new text being adjusted with each training-pass:


- Gordon

Tirthankar Ghosal

unread,
Aug 11, 2016, 11:05:45 PM8/11/16
to gensim
Thank You. I have understood a bit. But as mentioned in the original paper "word ordering is preserved in PV" which is shown as an advantage over bag of words or n-grams. Could you please comment on that?
Also could you please throw some light on the initialization of the paragraph matrix D and word matrix W?

Gordon Mohr

unread,
Aug 13, 2016, 4:48:47 PM8/13/16
to gensim
I don't think the PV paper's word-choice is precise when talking about "word ordering". The windowing really just gives some influence to "word proximity". In PV-DM (or word2vec skip-gram or word2vec CBOW), with a 'window' of 1, and with regard to the target word 'hat', both of these contexts are identical:

'black hat color'
'color hat black'

In PV-DM (and skip-gram and CBOW), the ordering doesn't matter, just that 'hat' is often next-to (before or after) 'black' and 'color'. So not strictly 'order' but perhaps gaining some of the same advantages of other n-gram methods that do bind words together with neighbors.

The doc-vector matrix ('D') or word-vector matrix ('W') are just initialized with random, low-magnitude vectors. The exact initialization, in gensim copied from the original word2vec.c code, can be seen at: 


I haven't seen a motivation for this exact approach, nor especially why the random-dimensions are divided by the number-of-dimensions (meaning lower-magnitude initial vectors when working with more dimensions)... but I suspect it's just something that's worked well in practice. 

- Gordon

Tuyen Hoang Dinh

unread,
Sep 2, 2016, 9:45:28 AM9/2/16
to gensim
Dear Gordon,

The paragraph vector is created by an average of words vectors.
Please let's me know, what is difference between Doc2Vec model and average method and how can combine from word-vector to create Doc-vector.

Thank you so much.





Vào 13:48:47 UTC-7 Thứ Bảy, ngày 13 tháng 8 năm 2016, Gordon Mohr đã viết:

Gordon Mohr

unread,
Sep 7, 2016, 2:10:48 PM9/7/16
to gensim
An average (or weighted average) of all the word-vectors in a text is one simple way to get one vector for the full text, and works well as a baseline for certain tasks. That's what the Kaggle post you link to tries, and it's also sometimes called a "doc-2-vec" process, because it does fit the description. 

But, that's not the algorithm used in gensim's Doc2Vec, which matches the Mikolov/Le "Paragraph Vectors" paper (https://arxiv.org/abs/1405.4053). In this approach, a word2vec-like process is used to learn doc-vectors which, rather than being directly composed from word-vectors, are iteratively learned like word-vectors themselves, based on how well they can predict other words in the same text. 

Where it's important to distinguish algorithms, I've started to refer to the Paragraph Vectors approach as "PV-Doc2Vec", and the weighted-word-averaging approach as "WW-Doc2Vec". (Other ways of obtaining a vector from longer texts could be similarly abbreviated – for example the "Skip-Thoughts" RNN approach as "ST-Doc2Vec", etc.)

- Gordon
Reply all
Reply to author
Forward
0 new messages