Error during FastText Training

Jamie Brandon

unread,

Sep 13, 2019, 8:38:30 PM9/13/19

to Gensim

Hi there,

I'm running python 3.7.3 and gensim 3.8.0. I've run the same code successfully on python 3.6 with gensim version 3.2.0 with no errors.

Can you help me understand this error message attached? The error is identical for each thread. I am using a generator for my sentences, and I have used the Phraser from gensim to phrase them beforehand. Please let me know if something I am doing is incorrect.

I'm training the model using the following code.


model = FastText(window=embedding_window,

                 min_count=min_count_embeddings,

                 workers=4,

                 sg=1)

model.build_vocab(sentences=sentences_for_vocab)

model.train(sentences=sentences_for_training,

            total_examples=model.corpus_count,

            epochs=NUM_EPOCHS)

Thanks in advance,

Jamie

Screen Shot 2019-09-13 at 8.15.05 PM.png

Radim Řehůřek

unread,

Sep 14, 2019, 9:33:28 AM9/14/19

to Gensim

Hi Jamie,

that screenshot indicates you might be using a slow (unoptimized) version of Gensim. Even if it didn't throw an error, training would be dog slow.

What does

from gensim.models import word2vec

print("FAST_VERSION", word2vec.FAST_VERSION)

output for you?

How did you install Gensim? I'm thinking it may be related to this issue:

https://github.com/RaRe-Technologies/gensim/issues/2600

Cheers,

Radim

Gordon Mohr

unread,

Sep 14, 2019, 8:53:16 PM9/14/19

to Gensim

While it's probably not related to your exception, it's really only appropriate to use `model.corpus_count` as the argument for `total_examples` if you're training on the exact same corpus as was fed to `build_vocab()`. If not, the cached `model.corpus_count` (from the `build_vocab()` step) may not match the size of the corpus you're passing to `train()`, and then both progress-reporting and learning-rate-decay may be calculated wrong.

That wouldn't cause an exception, but could sabotage training in other ways. The argument `total_examples` should reflect the size of the corpus passed-in, here `sentences_for_training`.

As Radim notes, your exception stack implies you're running the slower, pure-Python, less-tested code path – and fixing that in your local install should be the top priority. (The current mixed word2vec-and-fastttext, switched-on-an-`if_ft` flag pure-Python codepath through `word2vec.py` is fairly convoluted, and so if there *is* a bug there, I'm not sure it'd even get fixed, as opposed to finally dropping support for the pure-Python path entirely to reduce maintenance/testing costs, as has been discussed for a while.)

- Gordon

Jamie Brandon

unread,

Sep 16, 2019, 8:22:07 AM9/16/19

to Gensim

Thanks so much for the quick replies.

Radim, I've attached another screenshot of the output you requested, but in short, I'm getting -1. I've seen the issue you linked there, and was thinking they might be related too. I downloaded the most recent version of Anaconda, created a virtual environment, and installed gensim with `pip install gensim`. Do let me know if there's something else I can do with regards to installation.

Gordon, yes, thanks for the good eye. Since I'm using larger datasets, I have a generator for sentences whose input gets consumed. I should've included one more line in my code snipped to show that they are tee'd from the same generator using `itertools.tee(sentences)`. I've updated the code snippet here for full disclosure. I don't think this will cause the issue you brought up, but please correct me if I'm wrong.


sentences_for_vocab, sentences_for_training = itertools.tee(sentences)



model = FastText(window=embedding_window,

                 min_count=min_count_embeddings,

                 workers=4,

                 sg=1)

model.build_vocab(sentences=sentences_for_vocab)

model.train(sentences=sentences_for_training,

            total_examples=model.corpus_count,

            epochs=NUM_EPOCHS)



model.save(embedding_file_path)

From what I've found so far, it looks like gensim version 3.8.0 is missing the binary wheels for faster training in C. I understand (and support) the decision not to devote time to the slower, pure-Python alternative. Do you have an estimate of when the binary wheels would be released for version 3.8.0? Would you suggest that I go back to the previous version of gensim for now?

Really appreciate the help,

Jamie

Screen Shot 2019-09-16 at 8.11.04 AM.png

Gordon Mohr

unread,

Sep 16, 2019, 2:14:09 PM9/16/19

to Gensim

I don't think `tee` will do what you need, because the `train()` step will need to re-iterate over the data `epochs` times. And, if your `sentences` object is itself re-iterable – because it's either an in-memory object, or an iterable-interface object that can start its own iterations when requested, then you wouldn't need to use `tee` at all: you can just use `sentences` itself for every step.

(If you run with logging enabled at the `INFO` level, the output from the `train()` step might show clearly that all intended epochs of training aren't happening with your current `tee`-based approach.)

(Separately: others will have to answer the recommended steps for getting optimized-training working under Windows with recent gensim releases.)

- Gordon

Jamie Brandon

unread,

Sep 16, 2019, 3:01:52 PM9/16/19

to Gensim

Thank you! You are right! I'll change my structure to ensure I can iterate through the sentences more than once.

Looking forward to the other responses in regards to running on Windows.

Thanks again,

Jamie

Reply all

Reply to author

Forward