Word2Vec training loss

Heta Saraiya

unread,

Nov 20, 2018, 2:16:59 PM11/20/18

to Gensim

Hi,

I am training a dataset using Word2Vec and saving the training loss after each epoch. But the training loss does not decrease after some epochs but it increases. Can you give me any idea of why this happens?

Thanks

Gordon Mohr

unread,

Nov 21, 2018, 12:39:33 PM11/21/18

to Gensim

Up through at least gensim 3.6.0, the loss value reported may not be very sensible, only resetting the tally each call to train(), rather than each internal epoch. There are some potential fixes forthcoming in this pending pull-request:

https://github.com/RaRe-Technologies/gensim/pull/2135

In the meantime, the losses after each separate call to `train()` using the same `epochs` should be comparable, but:

* in general, most users *shouldn't* be calling `train()` more than once – it's very easy and common to mismanage parameters like the `alpha`/`min_alpha` – and perhaps that's happening for you

* there's no guarantee the loss will always decrease; in general when things are working the loss will head lower over time but eventually reach some point where the model is doing as well as it can given the inherent limitations of its technique and complexity – it has "converged" on its optimal state. After that, it can only get better at some training examples by worsening performance on others, and thus loss will jitter a little up-and-down over time, but no longer trend downward. You may have reached that point.

* a lower loss isn't always better for generalized performance on downstream applications. Making the model larger – such as by using a larger vector dimensionality – will usually make the attainable level of loss lower, because there's then extra room in the model to essentially memorize idiosyncratic cases in the training data. But such deep adaptation to just the training data is "overfitting", tending to make the model less useful on any out-of-training-set data, because rather than learning rough-but-reusable general patterns, it's just learned mechanistic rules from the limited training data.

So, it you see the loss going wildly wrong, you might be mishandling calls to `train()`. But if it's just jittering around a best-achievable value after sufficient training, that's the normal state of a 'converged' model that's been trained as much as necessary. And while loss is definitionally the thing that the single Word2Vec model is locally optimizing, it's not the thing to optimize in the whole system of model-plus-downstream-uses. That should be some quantitative measurement of model quality specific to your downstream tasks, and the smallest-loss Word2Vec model is unlikely to be the best-general-performance model for downstream tasks.

- Gordon

Heta Saraiya

unread,

Nov 26, 2018, 1:19:37 PM11/26/18

to Gensim

Hi,

I am using callbacks to get loss after each epoch. But what I do not understand is why is the loss oscillating (some times increases then decreases and then again increases). Can that be due to some hyper paramater value? I am new to using word2vec and would like suggestions on how I can check if my model is performing better?

Thanks

Gordon Mohr

unread,

Nov 26, 2018, 2:46:15 PM11/26/18

to Gensim

You'll have to show more details of your code, metaparameters, data type/quality/size, output values, and gensim version for me to give any more specific answer than previously.

(Are you possibly affected by the bug I linked? Are you calling train more than once? Does the loss decrease for a while and then jitter up and down around some level, which as I've explained is normal behavior – because loss can't go down forever unless the model is big enough to 'memorize' the data, which would result in an unhelpful 'overfit' model? Is there anything atypical about your data or parameter choices? Etc.)

- Gordon

Heta Saraiya

unread,

Nov 26, 2018, 3:50:47 PM11/26/18

to Gensim

class EpochSaver(CallbackAny2Vec):

def __init__(self,path_prefix):

self.epoch=0

def on_epoch_end(self,model):

print("Epoch_"+str(self.epoch))

print("Training loss:"+str(model.get_latest_training_loss()))))

self.epoch+=1

sentences = GetSentences('/filer/corpus')

epoch_saver=EpochSaver("model_some")

model = Word2Vec(sentences, min_count=5, sg=1, size=45, window=8,iter=10, workers=4,compute_loss=True,callbacks=[epoch_saver])

This is how I get the training done as well as get loss. The sentence is iterable of my data.

The data size is approximately 1gb. Gensim version is 3.6.0.

Loss does not decrease and then steadies to some value. Its value is always changing but never decreases to some range.

Gordon Mohr

unread,

Nov 26, 2018, 7:43:49 PM11/26/18

to Gensim

What loss values are printed?

Is the progression any different if using more iterations? (Say, `iter=20`.)

- Gordon

Heta Saraiya

unread,

Nov 27, 2018, 1:44:10 AM11/27/18

to Gensim

Training loss printed are huge numbers. And they do not change even when I increase the no. of iterations. They just decrease and increase in small amounts from start.

-Heta

Gordon Mohr

unread,

Nov 27, 2018, 1:58:10 AM11/27/18

to Gensim

Vague descriptions like "huge numbers… [that] do not change" might be useful in a high-level summary from someone who knows what's going on, because they've applied their expertise and understanding to compress lots of details down to the essentials.

But for someone else to figure out an issue that's stumping you, you should provide the exact details they request. It shouldn't be hard to cut & paste the actual numerical output you're seeing. Otherwise it's nearly impossible to help.

- Gordon

Heta Saraiya

unread,

Nov 27, 2018, 12:10:27 PM11/27/18

to Gensim

Sorry for vague description. I was not sure. I have attached file with loss values after each iteration.

loss1.txt

Gordon Mohr

unread,

Nov 27, 2018, 2:10:11 PM11/27/18

to Gensim

The printed loss value never goes down. That's the issue mentioned atop my first reply: that the reported number is just a running tally from the start of the `train()` call.

So the real numbers to care about are the differences between each printed number. This would be the more interesting number to print; here I've calculated them via a spreadsheet:

Printed Loss Last Epoch Loss

71964 71964

109095 37131

167496 58401

234446 66950

300772 66327

367244 66472

433101 65857

488015 54914

553470 65455

621009 67540

686966 65957

744763 57797

811454 66691

875197 63743

940066 64869

1004814 64748

1068990 64176

1133891 64900

1196838 62948

1262514 65676

1328977 66463

Those are quite strange, in that rather than improving for a while, they really only improve on the 2nd epoch, before jittering around within a tight range. That's more typical near the end of training, and indicates the model has learned as much as it can.

Are you sure this output is from the metaparameters and training code you showed earlier?

Are you sure your `GetSentences()` code is working properly, providing text that can be learned from?

For example, what does the following print after your `sentences` is defined:

print(sum(1 for _ in sentences)) # total count of training examples

first = iter(sentences).next() # get 1st item

print(len(first)) # 1st item's length in words

print(first[0:3]) # 1st item's 1st 3 words

As a separate note, none of the total or per-epoch loss numbers actually seem 'huge' to me, as it's a tally over all examples in a large dataset. And if you train a model on a larger dataset, this kind of summed loss value will go even higher (and when it reaches its best value, still be higher) than a smaller dataset – even if the model is better at the end, simply because more examples have been tallied together to get the number.

- Gordon

Heta Saraiya

unread,

Nov 27, 2018, 10:01:42 PM11/27/18

to Gensim

Okay thank you so much for the help. I only have one more question. If I change the paramaters and train again then can I compare loss values to the current values to see which model performs better?

Thanks

Gordon Mohr

unread,

Nov 28, 2018, 3:54:30 AM11/28/18

to Gensim

On Tuesday, November 27, 2018 at 7:01:42 PM UTC-8, Heta Saraiya wrote:

Okay thank you so much for the help. I only have one more question. If I change the paramaters and train again then can I compare loss values to the current values to see which model performs better?

No, as mentioned previously, the loss is not a reliable indicator of overall model quality. The model with the lowest loss could perform worse on real tasks – as in the given example of an overfit model. It's just an indicator of training progress, and when loss stops improving it's a hint that further training can't help.

Further, many of the parameters change the type/amount of training that happens. For example, a different 'negative' value means more negative-examples are trained. A different 'window' means more (context->target) examples are constructed. A different `sample` value drops a different proportion of words. A different 'min_count' drops different low-frequency words. The loss values are at best just comparable within a single model, over the course of its training.

Is there a reason you can't share the `sentences` output I suggested to debug your problem? Did you try that at all, and did it lead lead you to discover an error you were making that explained the prior atypical loss behavior?

- Gordon

Heta Saraiya

unread,

Nov 29, 2018, 1:04:41 AM11/29/18

to Gensim

The output of sentences you shared gave me:

print(sum(1 for _ in sentences)) # total count of training examples 1565475

first = iter(sentences).next() # get 1st item

print(len(first)) # 1st item's length in words 91

I have also attached new training loss after I ran it again.

If I cannot compare 2 training loss from different models then how can I know which paramters are better suitable for my data?

Thanks

loss_some1.txt

Gordon Mohr

unread,

Nov 29, 2018, 7:41:53 AM11/29/18

to Gensim

On Wednesday, November 28, 2018 at 10:04:41 PM UTC-8, Heta Saraiya wrote:

The output of sentences you shared gave me:
print(sum(1 for _ in sentences)) # total count of training examples 1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

And what about the output of the 3rd print statement, "print(first[0:3]) # 1st item's 1st 3 words"?

Also, it would be better to simply run all four suggested lines as given, after `sentences` was created, then copy & paste the exact 3 lines of output, rather than pasting results at the end of each line. Now, I'm less sure that all lines were run together, in order. (Doing that would have also checked for another common error in peoples' corpus-iterable-object. If you've collected the results for different lines in different runs, the output isn't as useful. If you got any errors trying to run the 4 suggested lines, that'd be useful info.)

I have also attached new training loss after I ran it again.

Those are very odd results, in that the difference-in-loss becomes 0 after 10 iterations.

I suspect some or all of:

(1) An error in your difference calculation/display;

(2) A problem with your training corpus; running all 4 requested lines together would help identify or rule out some of these potential problems.

(3) You've been changing other things about your parameters/code at the same time as you're following my suggestions, introducing new problems. For example, your previous strange output was for 20 iterations, and showed essentially no decrease-in-epoch-loss over 20 passes. This new output shows 25 iterations, and a decrease-in-epoch-loss for the 1st 10 passes, then the odd stabilization at per-epoch loss of 0. So it looks like you're trying several things at the same time, without sharing all the details of what you've changed, making it very hard to guess what could be causing that output.

If I cannot compare 2 training loss from different models then how can I know which paramters are better suitable for my data?

As mentioned in my 1st response on this thread:

"And while loss is definitionally the thing that the single Word2Vec model is locally optimizing, it's not the thing to optimize in the whole system of model-plus-downstream-uses. That should be some quantitative measurement of model quality specific to your downstream tasks, and the smallest-loss Word2Vec model is unlikely to be the best-general-performance model for downstream tasks."

That means: you have to test the resulting model/word-vectors on some version of the real task(s) where you want to use word-vectors. That's the only real measure of whether you've chosen good parameters.

If you don't have a way to run such a test, you could look at other more generic measures - there's a method `evaluate_word_analogies()` on the word-vectors object (`model.wv`) that can be fed a series of word-analogy problems from the original Google word2vec.c release, and return a score on that task. But of course that may not test your corpus's most important words, and further, word-vectors that do best on analogies may not do best for classification problems, or info-retrieval, or other tasks. To know which parameters are best for your project, you need to check them against some version of that task.

- Gordon

Heta Saraiya

unread,

Nov 29, 2018, 12:46:19 PM11/29/18

to Gensim

On Thursday, November 29, 2018 at 7:41:53 AM UTC-5, Gordon Mohr wrote:

I ran all 4 lines together before running the training. I just copied the output to end of line to make it easier to understand. Also the 3rd line fo print(first[0:3])) I got the words from my sentences.

My sentences are not english sentences we are training on assembly language instruction.

On Wednesday, November 28, 2018 at 10:04:41 PM UTC-8, Heta Saraiya wrote:
The output of sentences you shared gave me:
print(sum(1 for _ in sentences)) # total count of training examples 1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

And what about the output of the 3rd print statement, "print(first[0:3]) # 1st item's 1st 3 words"?

Also, it would be better to simply run all four suggested lines as given, after `sentences` was created, then copy & paste the exact 3 lines of output, rather than pasting results at the end of each line. Now, I'm less sure that all lines were run together, in order. (Doing that would have also checked for another common error in peoples' corpus-iterable-object. If you've collected the results for different lines in different runs, the output isn't as useful. If you got any errors trying to run the 4 suggested lines, that'd be useful info.)

I have also attached new training loss after I ran it again.

I just calcuated loss by subtracting difference between 2 epochs i.e. current-previous. I have also printed the original values to get the value before subtracting. Also I have not changed any other paramater than no. of iteration. The results from before were not for the whole dataset as it didnt take whole dataset. This time I made sure to get the whole dataset. I am not sure what 0 means for training loss. Does it mean that the loss is stabilized and there will be no more change in it or is it an error?

Gordon Mohr

unread,

Nov 29, 2018, 4:29:39 PM11/29/18

to Gensim

On Thursday, November 29, 2018 at 9:46:19 AM UTC-8, Heta Saraiya wrote:

On Thursday, November 29, 2018 at 7:41:53 AM UTC-5, Gordon Mohr wrote:

I ran all 4 lines together before running the training. I just copied the output to end of line to make it easier to understand. Also the 3rd line fo print(first[0:3])) I got the words from my sentences.
My sentences are not english sentences we are training on assembly language instruction.

That's useful information, The two common errors I was hoping to rule-out with the complete output were:

(1) corpus iterables that can't restart for a 2nd iteration (which would trigger an error)

(2) providing strings, rather than lists-of-words, as examples (which would show up as the 1st three words being just letters)

(1) might still be an issue, if there's a problem with your GetSentences.

If you saw multi-character tokens as the `first[0:3]` printed output, then you don't literally have (2). But it's possible there's so little semantic-relatedness, in context-windows of your domain of assembly language instructions, that Word2Vec can't learn much. (And that would be a bit analogous to the prospect of trying to train word2vec on individual characters instead of words.) If so, treating n-grams of multiple instructions might be more word-like, and thus more of a fit for Word2Vec, but that's just speculation.

On Wednesday, November 28, 2018 at 10:04:41 PM UTC-8, Heta Saraiya wrote:
The output of sentences you shared gave me:
print(sum(1 for _ in sentences)) # total count of training examples 1565475
first = iter(sentences).next() # get 1st item
print(len(first)) # 1st item's length in words 91

And what about the output of the 3rd print statement, "print(first[0:3]) # 1st item's 1st 3 words"?

Also, it would be better to simply run all four suggested lines as given, after `sentences` was created, then copy & paste the exact 3 lines of output, rather than pasting results at the end of each line. Now, I'm less sure that all lines were run together, in order. (Doing that would have also checked for another common error in peoples' corpus-iterable-object. If you've collected the results for different lines in different runs, the output isn't as useful. If you got any errors trying to run the 4 suggested lines, that'd be useful info.)

I have also attached new training loss after I ran it again.

I just calcuated loss by subtracting difference between 2 epochs i.e. current-previous. I have also printed the original values to get the value before subtracting. Also I have not changed any other paramater than no. of iteration. The results from before were not for the whole dataset as it didnt take whole dataset. This time I made sure to get the whole dataset. I am not sure what 0 means for training loss. Does it mean that the loss is stabilized and there will be no more change in it or is it an error?

I can't yet imagine any mechanism whereby changing just the iterations, from 20 to 25, would change the pattern from the 1st output you showed – essentially no change in epoch-loss over 20 passes – to the pattern in the 2nd output you showed – epoch-loss starting large, plummeting to a tight range, then to 0 long before all iterations are done. Specifically:

72899680

8575336

8470784

8353016

8254568

8131600

8021816

7928136

3582792

0.0

This is beyond bizarre, and is likely indicative of multiple irregularities in your code and corpus. It indicates no further model adjustment is happening – normal stabilization ('convergence') of a useful model will be at some non-zero loss level. You should revert to 20 iterations and see if you can get the old behavior. You should enable INFO logging and watch the output for any other suspicious timings/progress-reports (like later epochs completing instantly compared to the time taken on earlier epochs).

If you can make a small, self-contained example using a shareable portion of your data, or similar public data, that can reproduce either of these epoch-loss behaviors, you could share it completely, and it'd probably be obvious what's going wrong. But without that, I have no further guesses.

- Gordon

Reply all

Reply to author

Forward