word2vec train() called with an empty iterator when using Phraser

131 views
Skip to first unread message

Jan Pesl

unread,
Aug 15, 2019, 6:03:38 AM8/15/19
to Gensim
Hello,

I have managed to build the vocabulary, but not train the actual model.

This is the actual code:
phrases = Phrases(lemmas, min_count=30, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[lemmas]

w2v_model = Word2Vec(min_count=20,
                     window=2,
                     size=300,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     workers=6)
t = time()
w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))
t = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

 Here is the traceback:
INFO - 18:30:03: training model with 6 workers on 162443 vocabulary and 300 features, using sg=0 hs=0 sample=6e-05 negative=20 window=2
WARNING - 18:30:03: train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).
INFO - 18:30:03: worker thread finished; awaiting finish of 5 more threads
INFO - 18:30:03: worker thread finished; awaiting finish of 4 more threads
INFO - 18:30:03: worker thread finished; awaiting finish of 3 more threads
INFO - 18:30:03: worker thread finished; awaiting finish of 2 more threads
INFO - 18:30:03: worker thread finished; awaiting finish of 1 more threads
INFO - 18:30:03: worker thread finished; awaiting finish of 0 more threads
INFO - 18:30:03: training on 0 raw words (0 effective words) took 0.0s, 0 effective words/s
WARNING - 18:30:03: under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
WARNING - 18:30:03: supplied example count (0) did not equal expected count (22259850)
INFO - 18:30:03: saving Word2Vec object under word2vec.model, separately None
INFO - 18:30:03: storing np array 'syn0' to word2vec.model.wv.syn0.npy
INFO - 18:30:08: not storing attribute syn0norm
INFO - 18:30:08: storing np array 'syn1neg' to word2vec.model.syn1neg.npy
INFO - 18:30:13: not storing attribute cum_table
/usit/abel/u1/janpesl/.local/lib/python3.5/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

Could anyone please explain how come that the iterator I have used to build vocabulary and did work just fine is suddenly empty when I try to train the model?
Thanks in advance!

Gordon Mohr

unread,
Aug 15, 2019, 8:22:26 AM8/15/19
to Gensim
Gensim 2.0.0 is over 2 years old, and wouldn't be installed by any of the recommended ways of installing gensim. This may not be a factor in your problem, but I would recommend against using such an old version, and if having any problems at all, the 1st thing I'd try was the latest release.

I believe what you're trying should work, but:

* it'd help to see the log output of the phrases steps, and the build_vocab step, to be sure everything beforehand proceeded as it should

* it'd help to see what was in `lemmas` to know how the later steps will/should proceed

* the use of `[]`-indexing (as in `bigrams[lemmas]` to convert a full re-iterable sequence into another re-iterable sequence has always struck me as a bit weird, and it involves some convoluted switching-based-on-detected-parameter-type inside the implementing `__getitem__()`, and I wouldn't be surprised if it's fragile with unexpected failure modes.

* but, if it's working for one iteration (as it apparently does because after your `build_vocab()` the model has a non-zero vocabulary), I'd suggest (a) if your corpus is small, use that single iteration to just create the full corpus as a list in memory: `sentences = list(bigrams[lemmas])`; (b) if your corpus is large, use that one iteration to write the phrase-ified corpus to a sentence-per-line file on disk, so that future runs on the same corpus can just stream it from that file, without the phrase-analysis or phrase-promotion steps. 

- Gordon

Radim Řehůřek

unread,
Aug 16, 2019, 6:06:48 AM8/16/19
to Gensim
Hi Jan,

My guess is your `lemmas` is a generator. You exhaust it during `build_vocab()`, and then it's empty by the time you call `train()`.

If you plan to be training and re-training word2vec over the same corpus multiple times, I'd also suggest you serialize your `sentences` into a text file, like Gordon days. It'll make data inspection easier, and (repeat) training faster.

HTH,
Radim

Radim Řehůřek

unread,
Aug 16, 2019, 6:08:41 AM8/16/19
to Gensim
(btw, you can gzip that serialized text file into .txt.gz if it's very large – Gensim can work with such compressed files natively, decompressing them transparently on-the-fly)
Message has been deleted

Jan Pesl

unread,
Aug 16, 2019, 6:53:07 AM8/16/19
to Gensim
Thank you both for your input!
This makes indeed a lot of sense!
Reply all
Reply to author
Forward
0 new messages