How to get the Document Vector from Doc2Vec in gensim 0.11.1?

Amar Budhiraja

unread,

Jun 11, 2016, 8:47:38 AM6/11/16

to gensim

Is there a way to get the document vectors of unseen and seen documents from Doc2Vec in the gensim 0.11.1 version?

For example, suppose I trained the model on 1000 thousand -

Can I get the doc vector for those 1000 docs?
Is there a way to get document vectors of unseen documents composed from the same vocabulary?

Thanks!

Gordon Mohr

unread,

Jun 11, 2016, 1:17:17 PM6/11/16

to gensim

You probably want to be on a more current version; 0.11.1 is over a year old, and Doc2Vec got significant improvements and interface changes starting in 0.12.0. (The current version, 0.12.4, has been available since January.)

In 0.12+, you access doc-vectors for those documents (actually, 'doctag' keys associated with documents) via code like:

dv = model.docvecs['my_doc_0001']

Also, in 0.12+, you can use the `infer_vector()` method to infer (train-with-model-frozen) vectors for new documents:

newvec = model.infer_vector(['the', 'cat', 'jumped'])

(I can't recall if the 1st version of `infer_vector()` was available in 0.11.1 – if it was, it wasn't yet optimized and implemented for all modes.)

Note that you should pass tokens (not a full string) into `infer_vector()`, and you may want to play with different values of its default parameters ('steps', 'alpha') to see what works best with your model/texts.

- Gordon

AVHIRUP CHAKRABORTY

unread,

Feb 16, 2017, 12:28:30 AM2/16/17

to gensim

I am having the following error while running the code,

Code:

import gensim

model = gensim.models.doc2vec.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin', binary=True)

vector=model.infer_vector(['the', 'cat', 'jumped'])

Error:

C:\Users\zc440z0ac\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\ut

ils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

Traceback (most recent call last):

File "doc2vectest.py", line 10, in <module>

vector=model.infer_vector(['the', 'cat', 'jumped'])

File "C:\Users\zc440z0ac\AppData\Local\Continuum\Anaconda3\lib\site-packages\g

ensim\models\doc2vec.py", line 741, in infer_vector

doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)

File "gensim/models/doc2vec_inner.pyx", line 414, in gensim.models.doc2vec_inn

er.train_document_dm (./gensim/models/doc2vec_inner.c:5255)

word_locks = model.syn0_lockf

AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Gordon Mohr

unread,

Feb 16, 2017, 1:56:31 AM2/16/17

to gensim

Loading a word2vec vector set, as with the file you reference, does not create a valid, trained, capable-of-inference Doc2Vec model.

(Doc2Vec only implements `load_word2vec_model()` because it inherits from Word2Vec to share some common functionality – but using that method results in only a read-only set of word vectors in memory, same as in Word2Vec.)

- Gordon

Lev Konstantinovskiy

unread,

Feb 17, 2017, 9:26:59 AM2/17/17

to gensim

Hi,
Actually to avoid these questions in the future in the upcoming 1.0.0 release the load_word2vec_format function has been removed from word2vec and doc2vec and moved to KeyedVectors class. That class doesn't have a infer_vector method.
See https://github.com/RaRe-Technologies/gensim/pull/1147

Regards
Lev

eman kaziom

unread,

Nov 28, 2017, 8:39:09 AM11/28/17

to gensim

i have the same problem with infer vector but i donot use load_word2vec_model()

model = Doc2Vec()

inferred_vector = model.infer_vector([test_corpus[doc_id]], steps=20, alpha=0.025)
print (model.most_similar([inferred_vector], topn=len(model.docvecs)))

 File "C:/Users/iman/PycharmProjects/untitled/b.py", line 53, in <module>
    inferred_vector = model.infer_vector([test_corpus[doc_id]], steps=4, alpha=0.025)
  File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 758, in infer_vector
    doctag_vectors[0] = self.seeded_vector(' '.join(doc_words))
TypeError: sequence item 0: expected string, list found

Ivan Menshikh

unread,

Nov 28, 2017, 11:17:04 PM11/28/17

to gensim

Hi Eman,

the first argument of infer_vector should be a list of tokens (probably you no need additional [] here, i.e. this should be test_corpus[doc_id] instead of [test_corpus[doc_id]]

eman kaziom

unread,

Nov 29, 2017, 7:59:30 AM11/29/17

to gensim

Thank you very much ,

Ivan Menshikh

for responding, I tried and tried more of his method but kept the same problem

inferred_vector = model.infer_vector(test_corpus[doc_id], steps=20, alpha=0.025)

inferred_vector = model.infer_vector(test_corpus[doc_id], steps=20, alpha=0.025)

File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 780, in infer_vector

learn_words=False, learn_hidden=False, doctag_vectors=doctag_vectors, doctag_locks=doctag_locks

File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 146, in train_document_dm

word_locks = model.syn0_lockf

AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Ivan Menshikh

unread,

Nov 30, 2017, 6:00:25 AM11/30/17

to gensim

What's gensim version you use? Can you show your full code?

Gordon Mohr

unread,

Nov 30, 2017, 10:20:24 AM11/30/17

to gensim

Also, the code excerpt earlier showed only instantiating a fresh model (`model = Doc2Vec()`), with no further vocabulary-initialization or training.

It's only after vocabulary-initialization that the model is fully allocated – until that step you'll get errors like "has no attribute" if attempting other operations.