wor2vec.train errors: 'Word2Vec' object has no attributes 1) 'corpus_count' 2) 'syn0_lockf'

5,332 views
Skip to first unread message

Brad Bolender

unread,
Jul 18, 2015, 8:58:06 PM7/18/15
to gen...@googlegroups.com

Hello, I loaded a word2vec model and attempted to continue training, but I'm getting errors. Any guidance would be much appreciated. Thank you!



mod1 = Word2Vec.load('path_to_model')

bigrams = Phrases.load('path_to_bigrams')

trigrams = Phrases.load('path_to_trigrams')

mod1.train(trigrams[bigrams[sentences]], total_words=676857300)




2015-07-18 19:41:11,887 : INFO : training model with 4 workers on 222306 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 and negative=10


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 705, in train
pushed_words += round((chunksize/self.corpus_count)/total_words)
AttributeError: 'Word2Vec' object has no attribute 'corpus_count'

>>> Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 675, in worker_loop
if not worker_one_job(job, init):
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 666, in worker_one_job
job_words = self._do_train_job(items, alpha, inits)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 623, in _do_train_job
tally += train_sentence_sg(self, sentence, alpha, work)
File "gensim/models/word2vec_inner.pyx", line 259, in gensim.models.word2vec_inner.train_sentence_sg (./gensim/models/word2vec_inner.c:3156)
cdef REAL_t *word_locks = <REAL_t *>(np.PyArray_DATA(model.syn0_lockf))
AttributeError: 'Word2Vec' object has no attribute 'syn0_lockf'

Gordon Mohr

unread,
Jul 18, 2015, 10:31:34 PM7/18/15
to gen...@googlegroups.com
That's not good! Looks like some of my recent refactoring around the training-loop/alpha-decay doesn't properly handle reloaded models from earlier versions. (Looking closely, it also probably doesn't do the right thing when the text provided to `train()` isn't the same text as was originally supplied to `build_vocab()`.) 

I'll look into the right fixes so this will work as expected, but in the meantime you should be able to patch your model to get around at least these two errors if you manually:

(1) Create the expected `syn0_lockf` array:

    mod1.syn0lockf = numpy.ones(len(mod1.syn0), dtype=numpy.float32) 

(2) Before calling `train()`, set `corpus_count` to the count of sentences about to be presented:

    mod1.corpus_count = len(sentences)

If you hit any other errors that block training after these two steps, let me know.

- Gordon

Gordon Mohr

unread,
Jul 19, 2015, 6:16:00 PM7/19/15
to gen...@googlegroups.com
Turns out that a few more patch-ups, beyond the two I mentioned, are needed to make a Word2Vec model saved from an earlier version work for further training when reloaded into 0.12.0/current code. 

You can see the extra steps in some updates to the `Word2Vec.load()` that are now on github:


There are other changes in `train()` to more properly respect the passed-in `total_words`, for a case where followup training data isn't the same size as the original corpus used for vocabulary-building. (Additionally, if you don't have a word count, but do have a count of the examples/sentences that will be provided, that can be provided as `total_examples` instead of a `total_words` word-count, and will be used to estimate progress and alpha-decay.)

These changes will be in the next release, coming very soon. 

- Gordon

Brad Bolender

unread,
Jul 19, 2015, 11:02:41 PM7/19/15
to gen...@googlegroups.com
Fantastic, Gordon. Thanks for the update!

Salman Mahmood

unread,
Oct 16, 2015, 12:37:07 AM10/16/15
to gensim
I am having the same problem, even though I installed the latest version of gensim. Were you able to fix it?

Gordon Mohr

unread,
Oct 16, 2015, 2:07:26 AM10/16/15
to gensim
What specific error are you getting?  

What exactly triggered the error: what kind of load, of a model from what prior version, then doing what to trigger an error?

Version 0.12.2, as released 2015-09-19, has code which *should* prevent the exact "has no attribute" error, at least for models that are loaded via `load()`. (It might still have other issues after loading older models, but at least shouldn't trigger *that* same error.)

- Gordon

Salman Mahmood

unread,
Oct 16, 2015, 3:39:22 PM10/16/15
to gensim
I am using the load_word2vec_format() function, my code is as follows:


model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary = True)
model.train(sents)      # sents is a list of sentences

I am getting the following error

 /usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.pyc in train(self, sentences, total_words, word_count, chunksize, total_examples, queue_factor, report_delay)
    683 
    684         if total_words is None and total_examples is None:
--> 685             if self.corpus_count:
    686                 total_examples = self.corpus_count
    687                 logger.info("expecting %i examples, matching count from corpus used for vocabulary survey", total_examples)

Gordon Mohr

unread,
Oct 16, 2015, 4:09:34 PM10/16/15
to gensim
The `load_word2vec_format()` function works with the vectors-only format of the original word2vec.c implementation. That's not enough to continue training; a model so loaded is only good for comparisons of the existing vectors. 

If you're trying to mix those GoogleNews vectors into training on your own data, you may want to look at the instance-method `intersect_word2vec_format()`. After creating your own model, *and* having that model scan your corpus to establish and randomly-initialize its known-word vocabulary, then you can use that method to overwrite those words in your local vocabulary with those from the external word2vec.c-format file, *and* lock their values against training. (Words not already in the local vocabulary are ignored.) Then you could train on your corpus, and only the unique words in your corpus will get trained, essentially forced into the space already defined by the imported words. 

However, while many people seem to want to try variants of this, I've seen no writeups describing any tangible benefits achieved. 

I'm guessing the motivation is to be able to borrow word vectors from the GoogleNews set, when your local corpus doesn't contain them? If so, another approach that might be better than this mixed-training might the one described in section 2.2 ("Vocabulary expansion") of the "Skip-Thought Vectors" paper. There, they train word vectors on their corpus, but then learn a (language-translation-like) linear mapping of the needed words *from* the larger external set, *into* their own space. (That is, rather than constraining their new words to be trained-up alongisde the imported set of frozen-coordinates common words.) It's a wishlist item for gensim to have some utility code for learning such a translation-mapping: https://github.com/piskvorky/gensim/wiki/Word2Vec-&-Doc2Vec-Wishlist#implement-translation-matrix-of-exploiting-similarities-among-languages-for-machine-translation

- Gordon

Salman Mahmood

unread,
Oct 16, 2015, 6:10:34 PM10/16/15
to gensim
Thanks Gordon for the detailed answer, it was very helpful. 

So this is what I want to do, I want to see how a text (e.g. Lord of the Rings by Tolkien ) will effect the embeddings and whether that can give us some insight into the story/writing style. So basically I want to compare the embeddings of the words inside the text pre and post training. 

Since the text is going to be relatively small I am not  sure what would be the best way to go about this, I am hoping you can help me?

Gordon Mohr

unread,
Oct 17, 2015, 7:21:53 PM10/17/15
to gensim
An online source pegs the word-count of LotR as 481K. While that's small compared to usual corpuses to understand a language, since the point of your exercise is to tease out what may be unique/imbalanced about specific works, it might be fine.

I would try simply training models on different texts. Then for the different models, note which pairs/triplets/etc of words have very different distances, or which models succeed/fail on tests like the analogies-task often used to evaluate word2vec models.

(That is: I doubt you necessarily need to do any merged/translated models or training, and trying to do so might dilute the insights you'd like to be vivid.)

*If* you felt you needed to test how training-on-the-text changes pre-existing words, you could try a process like:

(1) create the merged vocabulary with `intersect_word2vec_format()`, with the GoogleNews vectors locked in place, and train enough for the words unique in the text to settle in their optimal positions
(2) remember that model state
(3) unlock the GoogleNews vectors (set all `model.syn0_lockf` values back to 1.0 – the `intersect_word2vec_format()` will have set some to 0.0)
(4) continue training until all words 'settle'

*Maybe*, how much the shared words move between step (2) and the end would reflect how different the meanings are in the two corpuses.

Keep in mind that the GoogleNews vectors are still based on peculiar subset of language: news stories.

Also remember: the training process itself is both seeded with randomness, and affected by runtime randomness. So even two runs with the exact same text can lead to different results. (Only if you deterministically seed the initial pre-training values, *and* run the algorithm in a single thread to eliminate ordering-jitter from multithreading, will the results be identically reproducible.) The results of different runs on the same data, if given enough training to settle into a 'best' configuration, should be very similar in their internal interrelations, but I believe might be reflected/rotated/scaled very differently. And across independent training runs on different corpuses, the exact coordinates of words are somewhat arbitrary.

That is, `distance(corpus1['king'], corpus2['king'])` is not likely to be meaningful: it might be large or small depending on random factors.

But *perhaps* the fact that:

distance(corpus1['king'], corpus1['queen']) >>> distance(corpus2['king'], corpus2['queen'])

...*might* start to be meaningful. Even so, it may make more sense to only make relative comparisons to other pairs in the same corpus:

distance(corpus1['king'], corpus1['queen']) / distance(corpus1['prince'], corpus1['princess']) >>> distance(corpus2['king'], corpus2['queen']) / distance(corpus2['prince'], corpus2['princess'])

...That might plausibly mean, "The concepts of 'king' and 'queen' are relatively more distinctive (predictive of different words) in corpus 1 than the concepts of 'prince' and 'princess', compared to corpus2." And perhaps that would match (or suggest) some other insights about what stories or attitudes the source texts describe.

Good luck!

- Gordon

Salman Mahmood

unread,
Oct 20, 2015, 3:19:48 AM10/20/15
to gensim
Thanks for the great insight.

Salman

Stefan Falk

unread,
Apr 24, 2016, 8:22:05 AM4/24/16
to gensim
I don't really understand. I want to load an existing word vector model. Is there a way to do this and then "continue" training on document vectors?

d2vModel = Doc2Vec()
d2vModel.intersect_word2vec_format("models/" + model_dir + "/data/part-00000", binary=False)

print "Training .."
for epoch in range(10):
    d2vModel.train(labeled_line_sentence)

print "done."

but I am getting

  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 1098, in intersect_word2vec_format
    logger.info("merged %d vectors into %s matrix from %s" % (overlap_count, self.syn0.shape, fname))
AttributeError: 'Doc2Vec' object has no attribute 'syn0'

What steps am I missing to train doc2vec from my existing word2vec model?

Gordon Mohr

unread,
Apr 24, 2016, 2:32:36 PM4/24/16
to gensim
Both `intersect_word2vec_model()` and `train()` require a model to already have its vocabulary discovered and internal state-arrays allocated. That's achieved by doing a `build_vocab()` over a training corpus. 

- Gordon

Gordon Mohr

unread,
Apr 24, 2016, 3:11:51 PM4/24/16
to gensim
Also note, though: "to train doc2vec from my existing word2vec model" is *not* the usual or recommended way to use gensim Doc2Vec. 

Re-using word-vectors from elsewhere is *not* a process described in the original 'Paragraph Vectors' paper. Trained word-vectors are *not* a required input to the Doc2Vec process, nor is there any "1st step" of Doc2Vec that supplying word-vectors can eliminate or speed up. 

Not all modes of Doc2Vec even use word2vec-style vectors internally. In those modes that do, I haven't seen any published results quantifying benefits from seeding the process with pre-trained vectors. 

So while you may be able to cobble together steps that inject prior vectors into a new training session, that should be considered a speculative, advanced technique. 

I suggest doing things in a more standard way first, to gain expertise in the usual tradeoffs and optimizations possible. Develop some task-specific way to evaluate the models/vectors you're creating. Only then will you know if experimenting with word-vector re-use is helping or hurting your end-goals. 

- Gordon
Message has been deleted

Tân Trần

unread,
Dec 8, 2016, 3:39:47 AM12/8/16
to gensim

Hi everyone,

I have installed the latest version gensim 0.12.4 via conda 
But I get this error, could you help me fix this issue ? "AttributeError: 'Word2Vec' object has no attribute 'syn1neg'"

data_list = ['data/prep/prep_data_train_twitter140.csv']
word2vec_path="model/GoogleNews-vectors-negative300.bin"
data_path = data_list[0]

word2vec_model = word2vec.Word2Vec.load_word2vec_format(word2vec_path, binary=True)
coreprepAPI_instance = CorePreprocessingAPI()
word2vec_model.syn0_lockf = np.ones(len(word2vec_model.syn0), dtype=np.float32)          

df = pd.read_csv(data_path,sep=',')
print "Before dropna",len(df)
df = df[(df['content'] != '')]
print "After dropna",len(df)
print df.head()
            
df['tokens'] = df['content'].apply(
                lambda line : coreprepAPI_instance.text_to_tokens(line))
# drop null tokens list
# df = df[df['tokens'] != []] # bug here
print "After dropna tokens",len(df)
if word2vec_model:
   # updating word2vec model
   tokens = []
   print "Parsing tokens from updating set"
   for tk in df['tokens']:
       for e in tk:
           tokens.append(e)
   print "Number of tokens rows: ", len(tokens)

   word2vec_model.corpus_count = len(tokens)
   #word2vec_model.build_vocab(tokens)
   word2vec_model.train(tokens

Output:

2016-12-08 15:36:50,036 : INFO : loading projection weights from model/GoogleNews-vectors-negative300.bin
2016-12-08 15:37:24,962 : INFO : loaded (3000000, 300) matrix from model/GoogleNews-vectors-negative300.bin
Before dropna 1048576
After dropna 1048576
   class                                            content
0      0   aww that's a bummer . you shoulda got david c...
1      0   is upset that he can't update his facebook by...
2      0   i dived many time for the ball . managed to s...
3      0    my whole_body feel itchy and like it on fire . 
4      0   no it's not behaving at all . i'm mad . why a...
After dropna tokens 1048576
Parsing tokens from updating set
Number of tokens rows:  1916276
2016-12-08 15:38:09,099 : INFO : training model with 3 workers on 3000000 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5
2016-12-08 15:38:09,099 : INFO : expecting 1916276 sentences, matching count from corpus used for vocabulary survey
Exception in thread Thread-164:
Traceback (most recent call last):
  File "/opt/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 735, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 671, in _do_train_job
    tally += train_batch_cbow(self, sentences, alpha, work, neu1)
  File "gensim/models/word2vec_inner.pyx", line 398, in gensim.models.word2vec_inner.train_batch_cbow (./gensim/models/word2vec_inner.c:4671)
    syn1neg = <REAL_t *>(np.PyArray_DATA(model.syn1neg))
AttributeError: 'Word2Vec' object has no attribute 'syn1neg'
Exception in thread Thread-163:
Traceback (most recent call last):
  File "/opt/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 735, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 671, in _do_train_job
    tally += train_batch_cbow(self, sentences, alpha, work, neu1)
  File "gensim/models/word2vec_inner.pyx", line 398, in gensim.models.word2vec_inner.train_batch_cbow (./gensim/models/word2vec_inner.c:4671)
    syn1neg = <REAL_t *>(np.PyArray_DATA(model.syn1neg))
AttributeError: 'Word2Vec' object has no attribute 'syn1neg'

Exception in thread Thread-162:
Traceback (most recent call last):
  File "/opt/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/opt/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 735, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/opt/miniconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 671, in _do_train_job
    tally += train_batch_cbow(self, sentences, alpha, work, neu1)
  File "gensim/models/word2vec_inner.pyx", line 398, in gensim.models.word2vec_inner.train_batch_cbow (./gensim/models/word2vec_inner.c:4671)
    syn1neg = <REAL_t *>(np.PyArray_DATA(model.syn1neg))
AttributeError: 'Word2Vec' object has no attribute 'syn1neg'



Gordon Mohr

unread,
Dec 12, 2016, 2:14:42 PM12/12/16
to gensim
Gensim Word2Vec doesn't support continued training after loading Google-format word-vectors with `load_word2vec_format()`. That format doesn't include all the model internal weights, training options, or vocabulary statistics that are prerequisites for meaningful training. 

(You could plausibly manually initialize the missing parts of the model to make it training-ready – you'd have to study the source code, and use trial-and-error, to get that working.)

- Gordon

Radim Řehůřek

unread,
Dec 13, 2016, 7:27:10 AM12/13/16
to gensim
Also, the latest version of gensim is NOT 0.12.4, but rather 0.13.3.

0.12.4 is almost one year old.

-rr
Reply all
Reply to author
Forward
0 new messages