loading a seed file of doc2vec in the latest gensim version

Ikuo Keshi

unread,

Feb 26, 2017, 9:26:11 AM2/26/17

to gensim

I updated gensim v1.0.

So I encountered the following error, when I loaded a seed vector file.

In the previous version, it works. How would you explain what happened in the latest version?

Exception in thread Thread-1:

Traceback (most recent call last):

File "/Users/ikuokeshi/.pyenv/versions/anaconda2-4.2.0/lib/python2.7/threading.py", line 801, in __bootstrap_inner

self.run()

File "/Users/ikuokeshi/.pyenv/versions/anaconda2-4.2.0/lib/python2.7/threading.py", line 754, in run

self.__target(*self.__args, **self.__kwargs)

File "/Users/ikuokeshi/.pyenv/versions/anaconda2-4.2.0/lib/python2.7/site-packages/gensim/models/word2vec.py", line 822, in worker_loop

tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))

File "/Users/ikuokeshi/.pyenv/versions/anaconda2-4.2.0/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 711, in _do_train_job

doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)

File "gensim/models/doc2vec_inner.pyx", line 289, in gensim.models.doc2vec_inner.train_document_dbow (./gensim/models/doc2vec_inner.c:3747)

cum_table_len = len(model.cum_table)

TypeError: object of type 'NoneType' has no len()

Gordon Mohr

unread,

Feb 26, 2017, 6:01:17 PM2/26/17

to gensim

To "load a seed vector file" isn't a usual way of using Doc2Vec, so it's not clear what exactly you mean.

What's an example of code that used-to-work, and now doesn't?

- Gordon

Ikuo Keshi

unread,

Feb 26, 2017, 8:20:27 PM2/26/17

to gensim

Gordon-san

Thanks for your message. I used the following step and example of code.

Step1. I made the following vocabulary file for the corpus.

sources = {'tweet.txt':'OTHERS'}

sentences = LabeledLineSentence(sources)

model_dbow = Doc2Vec(dm_concat=0,dm_mean=0,dm=0, hs=0, dbow_words=1,size=266,min_count=5,window=5,sample=1e-05,negative=15,

workers=8)

model_dbow.build_vocab(sentences.to_array())

model_dbow.save('./dbow_vocab.d2v')

Step2. I made a seed vector for each word not document.

Step3. I set the seed vector to each vocabulary as follows:

model_dbow = Doc2Vec.load('./dbow_vocab.d2v')

for i,word in enumerate(model_dbow.wv.index2word):

model_dbow.wv.syn0[i]=wordVecs3[word]

model_dbow.save('./seed_dbow_vocab.d2v')

Step4. Leaning the corpus

sources = {'tweet.txt':'OTHERS'}

sentences = LabeledLineSentence(sources)

model_dbow=Doc2Vec.load("seed_dbow_vocab.d2v")

model_dbow.size, model_dbow.min_count,model_dbow.window, model_dbow.sample,model_dbow.negative,model_dbow.workers=266,5,5,1e-05,15,8

sentences.to_array()

for epoch in range(passes):

model_dbow.alpha, model_dbow.min_alpha=alpha,alpha

model_dbow.train(sentences.sentences_perm())

alpha -=alpha_delta

model_dbow.save('./dbow_tweets.d2v')

I encountered the error in the step 4 using the latest version of gensim.

So far I had used a relatively old version of gensim in Oct. 2105.

Best Regards

Ikuo

2017年2月27日月曜日 8時01分17秒 UTC+9 Gordon Mohr:

Gordon Mohr

unread,

Feb 27, 2017, 1:14:43 AM2/27/17

to gensim

Thanks for the details. I'd expect this sequence of actions to work.

This looks like a regression in recent versions of gensim. The `load()` should be rebuilding the necessary `cum_table` (used in negative-sampling) here:

https://github.com/RaRe-Technologies/gensim/blob/4dc6c74d5ae9b22367e044a4da44a27277f50f5b/gensim/models/word2vec.py#L1274

But, since `index2word` no longer exists on the Word2Vec/Doc2Vec model (having moved to KeyedVectors), the necessary step is skipped.

Until fixed, you can probably work-around the issue by explicitly calling `make_cum_table()` yourself, just after `load()` and before using the model. For example:

Separately:

You don't need to, and probably don't want to, manage `alpha` yourself with the `for epoch in range(passes):` loop. Just set the model's `iter` to the desired number of passes, and the `alpha` to the desired starting-alpha, and the `min_alpha` to the desired ending-alpha. Then one call to `train()` will do the right number of passes, and smoothly decrease `alpha` the whole way.

- Gordon

Ikuo Keshi

unread,

Feb 27, 2017, 6:30:25 AM2/27/17

to gensim

Thanks for providing me with the solution!
Calling `make_cum_table()` just after `load()` and `for epoch in range(iter):` seem to work.

Concerning my step3, I used the following procedure to set the seed vector for each word.

for i,word in enumerate(model_dbow.wv.index2word):

model_dbow.wv.syn0[i]=wordVecs[word]

It seems strange in the latest version. Would you tell me how I can move to KeyedVectors?

Best Regards

Ikuo

2017年2月27日月曜日 15時14分43秒 UTC+9 Gordon Mohr:

Gordon Mohr

unread,

Feb 27, 2017, 12:20:06 PM2/27/17

to gensim

To be clear, you only need to call `make_cum_table()` once, after `load()`. (My recommendation is that you don't do the `for epoch in range(iter):` loop at all.)

I don't know exactly what your Step 2 does. You could perhaps use a KeyedVectors instance, instead of whatever type your `wordVecs` is.

Or maybe whatever process you are using to fill `wordVecs` can directly assign into the `model_dbow.wv` instance of KeyedVectors (never creating `wordVecs`).

But I don't think there's necessarily anything wrong with your current process – build in `wordVecs`, then copy into `model_dbow.wv` one-by-one – as long as you have enough memory.

(I'm not sure why KeyedVectors doesn't yet appear in the gensim API docs at <https://radimrehurek.com/gensim/apiref.html>. Until it does, you can review the doc-comments alongside the source at: <https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py>.)

- Gordon

Ikuo Keshi

unread,

Feb 28, 2017, 6:03:06 AM2/28/17

to gensim

Thanks for your explanation. I got it.

Also, copying into `model_dbow.wv` one-by-one works.

Because the order of the words is different from the 2015 version of gensim, I felt it was strange.

Words are arranged in descending order of frequency in the corpus in the newer version.

Is my understanding correct?

Best

Ikuo

2017年2月28日火曜日 2時20分06秒 UTC+9 Gordon Mohr:

Gordon Mohr

unread,

Feb 28, 2017, 12:54:00 PM2/28/17

to gensim

Yes, the default in recent gensim Word2Vec/Doc2Vec is to sort the words to make the most-frequent appears first in `syn0`. It seems to offer a slight training speedup (via CPU cache efficiency), and simplifies some downstream operations (that may be clipped to more-frequent words and thus more-reliable word-vectors). But you can pass `sorted_vocab=False` into your model initialization to skip sorting during vocabulary-setup.

- Gordon

Ikuo Keshi

unread,

Feb 28, 2017, 7:26:53 PM2/28/17

to gensim

Thanks for the great information!

Ikuo

2017年3月1日水曜日 2時54分00秒 UTC+9 Gordon Mohr:

Lev Konstantinovskiy

unread,

Mar 3, 2017, 7:31:53 PM3/3/17

to gensim

Fixed in 1.0.1 release.

Reply all

Reply to author

Forward