How to get the Document Vector from Doc2Vec in gensim 0.11.1?

2,782 views
Skip to first unread message

Amar Budhiraja

unread,
Jun 11, 2016, 8:47:38 AM6/11/16
to gensim

Is there a way to get the document vectors of unseen and seen documents from Doc2Vec in the gensim 0.11.1 version?

  • For example, suppose I trained the model on 1000 thousand - 

  1. Can I get the doc vector for those 1000 docs?
  2. Is there a way to get document vectors of unseen documents composed from the same vocabulary?
Thanks!

Gordon Mohr

unread,
Jun 11, 2016, 1:17:17 PM6/11/16
to gensim
You probably want to be on a more current version; 0.11.1 is over a year old, and Doc2Vec got significant improvements and interface changes starting in 0.12.0. (The current version, 0.12.4, has been available since January.)

In 0.12+, you access doc-vectors for those documents (actually, 'doctag' keys associated with documents) via code like:

    dv = model.docvecs['my_doc_0001']

Also, in 0.12+, you can use the `infer_vector()` method to infer (train-with-model-frozen) vectors for new documents:

    newvec = model.infer_vector(['the', 'cat', 'jumped'])

(I can't recall if the 1st version of `infer_vector()` was available in 0.11.1 – if it was, it wasn't yet optimized and implemented for all modes.)

Note that you should pass tokens (not a full string) into `infer_vector()`, and you may want to play with different values of its default parameters ('steps', 'alpha') to see what works best with your model/texts.

- Gordon

AVHIRUP CHAKRABORTY

unread,
Feb 16, 2017, 12:28:30 AM2/16/17
to gensim
I am having the following error while running the code,
Code:
import gensim
model = gensim.models.doc2vec.Doc2Vec.load_word2vec_format('GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin', binary=True) 

vector=model.infer_vector(['the', 'cat', 'jumped'])
Error:
C:\Users\zc440z0ac\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\ut
ils.py:855: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
  File "doc2vectest.py", line 10, in <module>
    vector=model.infer_vector(['the', 'cat', 'jumped'])
  File "C:\Users\zc440z0ac\AppData\Local\Continuum\Anaconda3\lib\site-packages\g
ensim\models\doc2vec.py", line 741, in infer_vector
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_inner.pyx", line 414, in gensim.models.doc2vec_inn
er.train_document_dm (./gensim/models/doc2vec_inner.c:5255)
    word_locks = model.syn0_lockf
AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Gordon Mohr

unread,
Feb 16, 2017, 1:56:31 AM2/16/17
to gensim
Loading a word2vec vector set, as with the file you reference, does not create a valid, trained, capable-of-inference Doc2Vec model. 

(Doc2Vec only implements `load_word2vec_model()` because it inherits from Word2Vec to share some common functionality – but using that method results in only a read-only set of word vectors in memory, same as in Word2Vec.)

- Gordon

Lev Konstantinovskiy

unread,
Feb 17, 2017, 9:26:59 AM2/17/17
to gensim
Hi,
Actually to avoid these questions in the future in the upcoming 1.0.0 release the load_word2vec_format function has been removed from word2vec and doc2vec and moved to KeyedVectors class. That class doesn't have a infer_vector method.
See https://github.com/RaRe-Technologies/gensim/pull/1147

Regards
Lev

eman kaziom

unread,
Nov 28, 2017, 8:39:09 AM11/28/17
to gensim
i have the same problem with infer vector but i donot use load_word2vec_model()

model = Doc2Vec()
inferred_vector = model.infer_vector([test_corpus[doc_id]], steps=20, alpha=0.025)
print (model.most_similar([inferred_vector], topn=len(model.docvecs)))

 File "C:/Users/iman/PycharmProjects/untitled/b.py", line 53, in <module>
    inferred_vector = model.infer_vector([test_corpus[doc_id]], steps=4, alpha=0.025)
  File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 758, in infer_vector
    doctag_vectors[0] = self.seeded_vector(' '.join(doc_words))
TypeError: sequence item 0: expected string, list found

Ivan Menshikh

unread,
Nov 28, 2017, 11:17:04 PM11/28/17
to gensim
Hi Eman,

the first argument of infer_vector should be a list of tokens (probably you no need additional [] here, i.e. this should be test_corpus[doc_id] instead of [test_corpus[doc_id]]

eman kaziom

unread,
Nov 29, 2017, 7:59:30 AM11/29/17
to gensim
Thank you very much ,
Ivan Menshikh
for responding, I tried and tried more of his method but kept the same problem

inferred_vector = model.infer_vector(test_corpus[doc_id], steps=20, alpha=0.025)
 inferred_vector = model.infer_vector(test_corpus[doc_id], steps=20, alpha=0.025)
  File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 780, in infer_vector
    learn_words=False, learn_hidden=False, doctag_vectors=doctag_vectors, doctag_locks=doctag_locks
  File "C:\Python27\lib\site-packages\gensim\models\doc2vec.py", line 146, in train_document_dm
    word_locks = model.syn0_lockf
AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Ivan Menshikh

unread,
Nov 30, 2017, 6:00:25 AM11/30/17
to gensim
What's gensim version you use? Can you show your full code?

Gordon Mohr

unread,
Nov 30, 2017, 10:20:24 AM11/30/17
to gensim
Also, the code excerpt earlier showed only instantiating a fresh model (`model = Doc2Vec()`), with no further vocabulary-initialization or training.

It's only after vocabulary-initialization that the model is fully allocated – until that step you'll get errors like "has no attribute" if attempting other operations. 

And it's only after sufficient training that any results will be non-random.

- Gordon
Reply all
Reply to author
Forward
0 new messages