Cumulative loss gets stuck.

395 views
Skip to first unread message

Valentino Maiorca

unread,
May 15, 2019, 8:13:33 AM5/15/19
to Gensim
Hi, I'm not sure if it's a bug or I'm doing something wrong, so I'm posting here instead of on GitHub.
I'm experiencing a weird issue with gensim.
It's related to the loss acquired (model.get_latest_training_loss()) during training (after each epoch).
No matter the function parameters I give to the word2vec.Word2Vec constructor (hyper-parameters/seed/etc) at a certain point during the training the loss goes to 0 (I know it's a cumulative loss, what I mean is that the value stays the same so I interpret that as the loss going to 0, right?). That happens always after it reaches the value: 134217728
The corpus is a simple txt file of about 300Mb in the LineSentence format.
I've tried on a different machine and the same happens at the exact same value.
I've tried changing the corpus (I'm loading it with corpus_file parameter but I've tried with sentences too) truncating the original one, but that just delays the issue to a later epoch.

Any idea of what the problem could be? Thank you.

Gordon Mohr

unread,
May 15, 2019, 6:31:15 PM5/15/19
to Gensim
What model parameters are you using, and how are you calling 'train()`'? 

How are you monitoring the running-loss-tally? 

Are you running with logging at the INFO level and watching logged messages to confirm otherwise-expected training progress?

It is natural and expected for the *per-epoch* loss to decrease for a while, but then stop decreasing. It'd be strange for the full-epoch loss to ever be zero, and may indicate some other problem in your corpus/code.

- Gordon

Valentino Maiorca

unread,
May 15, 2019, 6:41:08 PM5/15/19
to Gensim

model = word2vec.Word2Vec(corpus_file=input_file, sg=w2v_model == W2V_Model.SKIPGRAM,
                              iter=100, min_count=1, size=300, workers=4, compute_loss=True, sample=0.5e4,
                              negative=10, callbacks=[loss_callback])


These are the parameters I'm using right now, but as I said the issue is always present. Even if I change the epochs, the sampling, the negative_samples, the size, workers, everything.

This is (part of) the callback I'm using to monitor the loss:
    def on_epoch_end(self, model):
        cumulative_loss = model.get_latest_training_loss()
        with open(self.loss_file_path, 'a') as f:
            f.write(f'{self.epoch}\t{cumulative_loss}\n')


I'm running with logging at the INFO level and I don't see anything weird, what should I be looking for?

I know the loss should decrease, but it's pretty impossible for it to reach 0. That's why I'm asking for help because I'm out of ideas because the code I'm running is pretty simple.
Thanks for your help.

Gordon Mohr

unread,
May 15, 2019, 7:10:54 PM5/15/19
to Gensim
You're not calling `train()` any further, beyond the model-instantiation?

What's shown in your `loss_file_path` file, at the end?

Is your data natural-language, or something else with a far-different sort of word-frequency distribution? (What does the logging show as the final surviving vocabulary size?)

Other observations about your parameters, which may not be relevant to the loss issue:

* 100 iterations is very high, and very atypical unless you're trying to squeeze results from too-little data. 

* `min_count=1` is rarely a good idea, as all those tokens that only appear one or a few times just serve as noise preventing the more-frequent tokens from getting their best-possible vectors

* `sample=0.5e4` is a very atypical setting; classic settings would be tiny values from 0 (no downsampling) to 1e-5 (0.00001, aggressive downsampling. You're specifying 0.5e4, aka 5e3, aka 5000. It just so happens gensim `Word2Vec` supports these > 1.0 values as a obscure way to indicate: "downsample all words with a count over this number" - so your choice means "fully sample all words with <5000 occurrences, downsample others depending on how many more times they appear". But it's atypical enough to use that interpretation that I wonder if that's your true intent.

As far as reviewing the logs: make sure they show actual training progress taking time (and not finishing instantly). Make sure reported numbers of words trained, vocabulary words surviving, etc are all sensible and in line with your expectations. 

- Gordon

Valentino Maiorca

unread,
May 15, 2019, 7:37:26 PM5/15/19
to Gensim


This is part of the file tracking the losses:

32 126265624.0
33 128704760.0
34 131158848.0
35 133551176.0
36 134217728.0
37 134217728.0
38 134217728.0
39 134217728.0
40 134217728.0
41 134217728.0
.
.
.
And that's 134217728.0 till the end

I'm only training with the model construction, I'm not calling `train` elsewhere.

The data is in natural language and this is what the log says after building the vocabulary:
2019-05-16 01:19:54,305 : INFO : collected 190104 word types from a corpus of 23928055 raw words and 1903865 sentences
2019-05-16 01:19:54,305 : INFO : Loading a fresh vocabulary
2019-05-16 01:19:54,747 : INFO : effective_min_count=1 retains 190104 unique words (100% of original 190104, drops 0)
2019-05-16 01:19:54,748 : INFO : effective_min_count=1 leaves 23928055 word corpus (100% of original 23928055, drops 0)
2019-05-16 01:19:55,435 : INFO : deleting the raw counts dictionary of 190104 items
2019-05-16 01:19:55,438 : INFO : sample=5000 downsamples 50 most-common words
2019-05-16 01:19:55,438 : INFO : downsampling leaves estimated 22549173 word corpus (94.2% of prior 23928055)

I'm subsampling with that value because of this:
                                                      words_dist.png

That's the words frequency plotted and the words with more than 500000 counts are the 50 most-common ones.

Each epoch takes roughly the same time, even when the weird loss value is reached and later.

About the iterations, I didn't know it's considered an high value. Why is that? I mean, iterating over the corpus 100 times didn't seem so weird to me to increase the convergence.

About the min_count, I'd like to evaluate the embeddings against a specific dataset which, unfortunately, requires words that occur only one time in my corpus.

Thank you again!

Gordon Mohr

unread,
May 16, 2019, 12:24:40 AM5/16/19
to Gensim
Not sure what's happening, but further thoughts:

I'd try logging the loss alongside the other logging output, to more easily see any correlations between loss changes and other progress indicators (and be sure of viewing the current loss in context of current corpus/parameter settings). 

Are the word-vectors at the end useful for the intended purpose? You might also try logging a few dimensions of some word-vector that you know ends up useful, at each epoch's end, to be sure it's continuing to update. (EG: `logger.info(str(model[probe_word][:5]))`) If it also remains unchanged in later epochs, then for some reason additional training isn't having effect – perhaps a larger problem than the observed loss-reporting issue. 

That's still a strange subsampling parameter - the default might be better, or even more aggressive via smaller values (like `1e-05`) - discarding only 5.8% of all the words via downsampling, per you current `sample=5000` parameter, is very conservative in a large corpus. (More aggressively discarding frequent words, just like discarding very-low-frequency-words, tends to improve the quality of remaining words.)

The default for iterations is 5, following the original word2vec.c code released by Google. Smaller corpuses might benefit from more, but a point of diminishing or negligible returns from more iterations will be reached. I think I read somewhere that the famous "GoogleNews" pretrained vectors were based on only 3 iterations – but on a very large & varied corpus (which thus would have examples of all words, in all sorts of varies contexts, all throughout). 

Are you sure that the mere 353 unique words in the "WordSimilarity-353" dataset require keeping all the tens-of-thousands of count<5 words in your corpus? Even if discarding a word in that test set means any evaluation based on that word fails completely, the benefit from having discarded all such words, on the quality of the remaining word-vectors, might improve the overall evaluation score. 

If you absolutely need a vector for some set of words that only appear once in your corpus, and you can't acquire more organic examples of that word to extend the corpus, it might be a better strategy to synthesize extra examples of the exact words of interest, by repeating the extant examples, rather than retaining *all* low-frequency words, given the cost that incurs via extra training time and worsened quality on other vectors.

- Gordon
Reply all
Reply to author
Forward
0 new messages