Doc2Vec - documents with sentences

pca...@gmail.com

unread,

Nov 3, 2016, 8:21:36 AM11/3/16

to gensim

I am interested in using Doc2Vec with multiple documents each consisting of multiple sentences. (Doc_n = "Sent1,..Sentn")
How do I handle punctuation in a TaggetDocument which takes a list of string tokens as input?

Should I just remove punctuation and represent each document as a list of token words?
Should i create multiple TaggetDocument for each document, having the same document label?

Thanks in advance.

Side question: Can I use pairwise_distances from the sklearn package on the doc2vec models 'docvecs'?

pca...@gmail.com

unread,

Nov 3, 2016, 8:36:36 AM11/3/16

to gensim

or third option. Make punctuation tokens for each document?

pca...@gmail.com

unread,

Nov 3, 2016, 1:39:13 PM11/3/16

to gensim

How about stemning, would that typically hurt the accuracy for a doc2vec model?

Gordon Mohr

unread,

Nov 3, 2016, 2:56:52 PM11/3/16

to gensim

Tokenization is up to you. You can certainly include multiple sentences in one text example. (The paper on which gensim Doc2Vec is based called the process 'Paragraph Vectors', and the technique can be applied to documents of arbitrary length.)

Some people strip punctuation, others turn punctuation marks into tokens just like words. (The original word2vec and Paragraph Vectors papers both mentioned retaining punctuation as tokens.)

You can break a large single document (that would be otherwise be presented once with one tag) into multiple documents (each with that same tag repeated), and the effect during training will be very similar. (There will be some lost word-to-word influence, in some training modes, between words that are no longer within each others' context-windows.) But the only time this would be a big benefit would be if your documents are over 10,000 tokens in length That's an implementation limit in the gensim optimized code. (Larger documents are truncated, so the words beyond 10,000 have no influence – unless you pre-split them like this.)

The `docvecs` property of a Doc2Vec model is not itself a numpy array and so would not be suitable as input to sklearn's `pairwise_distances`. But, its internal properties `doctag_syn0` or `doctag_syn0norm` (unit-normalized vectors) would be. Beware the size of the full distance matrix with typical document counts. (The pairwise float32 cosine distances between just 100,000 documents would take 40GB of RAM.)

Some projects stem words before Word2Vec/Doc2Vec training and others don't. The primary papers don't seem to, as they often work on sufficiently copious data to learn good models of alternate word forms independently. What's best for your project probably depends on your data and ultimate goals.

- Gordon

pca...@gmail.com

unread,

Nov 6, 2016, 10:44:58 AM11/6/16

to gensim

Thanks it cleared things out !

I can see the memory problem.. Sorry for all these questions.

Would it be possible to use sklearns K-means implementation on trained doc2vec vectors accessed by syn0 or syn0norm?

pca...@gmail.com

unread,

Nov 7, 2016, 5:14:50 AM11/7/16

to gensim

why is the syn0 and syn0norm array larger than the number of documents? I'm using ints as tags, one tag per document, starting from zero and counting.
The length of len(model.docves) returns the correct number of vectors(one for each document), syn0 returns a significant larger amount.

On Thursday, November 3, 2016 at 7:56:52 PM UTC+1, Gordon Mohr wrote:

pca...@gmail.com

unread,

Nov 7, 2016, 5:42:59 AM11/7/16

to gensim

Oh i assume it must be model.docvecs.doctag_syn0 and model.docvecs.doctag_syn0norm i have to use :-)

Reply all

Reply to author

Forward