Need Doc2Vec Example

3,576 views
Skip to first unread message

Varman

unread,
Oct 23, 2015, 2:32:15 PM10/23/15
to gensim
Hi Guys,

I am new to Doc2Vec. Can Anyone recommend me a nice post or example that helps me get started? I found few examples in the google but those articles are outdated.


Thanks,
Varman

Gordon Mohr

unread,
Oct 26, 2015, 1:36:36 PM10/26/15
to gensim
There's an IPython Notebook in gensim that steps through one of the sentiment experiments from the original "Paragraph Vectors" paper. It's a bit advanced in its Python usage, but serves as a working example of gensim's Doc2Vec class and options. 

In the gensim install directory, look in `docs/notebooks/doc2vec-IMDB.ipynb`, or you can view it in Github at:

Kevin L

unread,
Oct 27, 2015, 6:35:46 AM10/27/15
to gensim
Well, here is a short script I used, if that helps...

from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
import numpy
import random


sources
= {'training.txt':'TRAINING'}
sentences
= word2vec.LabeledLineSentence(sources)


model
= Doc2Vec(min_count=5, window=8, size=100, sample=1e-4, negative=5, workers=4)
model
.build_vocab(sentences.to_array())


sentences_list
=sentences.to_array()
Idx=range(len(sentences_list))


for epoch in range(20):
 random
.shuffle(Idx)
 perm_sentences
= [sentences_list[i] for i in Idx]
 model
.train(perm_sentences)
 
print(epoch)


model
.save('example.model')
model
= Doc2Vec.load('example.model')
This way, you say first what parameters your model should have and then you feed it with the training data over a number of epochs (here 20) with a random permutation of your sentences (when in your training data is a sentence/paragraph/document on each line)

Kevin L

unread,
Oct 27, 2015, 6:41:21 AM10/27/15
to gensim
The "LabeledLineSentence" class I took from http://rare-technologies.com/doc2vec-tutorial/ I think...

Varman

unread,
Oct 27, 2015, 5:33:29 PM10/27/15
to gensim
Thanks Kevin. That helped to get started. But now my doubt is.
I trained the model with 100 documents but i want to find the similarity of a document that is separate from the documents that were used to train the model. How do i do that? 
This is what i am doing. Am i doing it in correct way? Any advice will be helpful.

train_model = gensim.models.Doc2Vec(size=300, window=10, min_count=1, workers=11,alpha=0.025, min_alpha=0.025) # use fixed learning rate
train_model.build_vocab(train_sentences)
for epoch in range(10):
    train_model.train(train_sentences)
    train_model.alpha -= 0.002 # decrease the learning rate
    train_model.min_alpha = model.alpha # fix the learning rate, no deca
    train_model.train(train_sentences)

test_model = gensim.models.Doc2Vec(test_senteces,size=300, window=10, min_count=1, workers=11,alpha=0.025, min_alpha=0.025)
print(model.docvecs.most_similar([test_model.docvecs[0]]))


Thanks,
Varman

On Friday, October 23, 2015 at 11:32:15 AM UTC-7, Varman wrote:

Kevin L

unread,
Oct 28, 2015, 7:13:07 AM10/28/15
to gensim
Sorry, I'm also only a beginner in NLP and neural networks...

I never used the alpha value. 
When I want to calculate a similarity (word2vec or doc2vec model) I use the n_similarity(ws1, ws2) function in the way:
>>> trained_model.n_similarity(['sushi', 'shop'], ['japanese', 'restaurant'])
0.61540466561049689
But I don't know how to get the new trained document/paragraph vectors in doc2vec model or if it's even possible to get them or reasonable to use them...

Aries Fitriawan

unread,
Sep 22, 2016, 6:05:27 AM9/22/16
to gensim
Thank you for your script. It was clearly understandable script. Unfortunately I have an error :

AttributeError: module 'gensim.models.word2vec' has no attribute 'LabeledLineSentence'

Any update for this code?

Lev Konstantinovskiy

unread,
Sep 22, 2016, 9:39:43 AM9/22/16
to gensim
Hi Aries, 

Thanks for reporting it - there is an easier intro to doc2vec on our tutorials page

See Doc2vec Quick Start on Lee Corpus - it has a smaller dataset than IMDB so it will give you results even on a laptop.

Regards
Lev

Veronica Cheng

unread,
Dec 12, 2016, 10:12:46 AM12/12/16
to gensim
Hi Aries, he mentioned that: The "LabeledLineSentence" class I took from http://rare-technologies.com/doc2vec-tutorial/ I think...
Reply all
Reply to author
Forward
0 new messages