Model trained with Skip Gram seem to be missing semantic information

emmanuel chappat

unread,

Nov 3, 2019, 10:46:04 AM11/3/19

to Gensim

Hi,

I've tried training a FastText model using the follwing parameter:

alpha=0.03
compute_loss=True
dim = 200
epochs = 80
hs=0
min_alpha=0.000075
min_count = 10
negative = 0
sample = 0.0005
sg = 1
window = 5
workers=16

I've read here: https://ruder.io/word-embeddings-softmax/index.html (see table bellow) that SG without NS or HS would work best on a small dataset (mine is about 60M words).

Screenshot 2019-11-03 at 16.36.28.png

To evaluate the model I have a set test relevant to the domain of data using analogies and vector proximities. Using these tests, the resulting model seems to capture almost NONE of the semantic relationship of words while the CBOW model with NS does.

Could this be something wrong with my training parameters?

Emmanuel

Gordon Mohr

unread,

Nov 3, 2019, 12:55:08 PM11/3/19

to Gensim

Yes, if you have both `hs=0`, and `negative=0`, then neither of the two potential methods for reading predictions from the neural network are active, and there's never any backpropagated prediction-errors to adjust the word-vectors. They just stay at their initial (random) values through all training passes.

One or the other of these options should be non-zero in order for any training to happen. (You probably don't want both active: while it will "work", the time and memory required will be the sum of that required for both options, somewhat like running both, interleaved. But for any particular dataset/parameters/goal, it's likely only one particular option would work better, and if you were willing to spend the same incremental extra time/memory, some other parameter adjustment, like more epochs or larger vectors or a larger vocabulary – would offer more return than a second prediction/error cycle.)

Besides the lack of any usefulness in the resulting vectors, training with such a problem setting should appear suspiciously "instant" if reviewed in the logs. There's also a pending issue, <https://github.com/RaRe-Technologies/gensim/issues/1983>, to add a more explicit warning.

- Gordon

emmanuel chappat

unread,

Nov 3, 2019, 1:05:38 PM11/3/19

to Gensim

I see. Thanks Gordon.

The idea was that using both `hs=0`, and `negative=0` would just fall back to a plain softmax ( from what I take from the table that I've attached in my first message was that plain softmax worked best for smaller datasets).

As for more epochs, do you have a rough heuristic on what may be a good target? (i could not figure out how to print the loss using FastText so It's hard to tell when model stop improving)

Thanks again your help

Gordon Mohr

unread,

Nov 3, 2019, 1:51:19 PM11/3/19

to Gensim

Note that the chart you've excerpted reports plain softmax as being 25x to 100x slower than either HS or negative-sampling. It'd have to offer some massive benefits elsewhere to justify that slowness – and if those faster modes buy enough of a speedup to adjust *other* parameters in the direction of more-time/more-quality (such as 'size', 'window', 'epochs', etc), any supposed advantage of full softmax (on a naive ceteris-paribus or runtime-oblivious basis) could evaporate.

Neither the original Google-released `word2vec.c` code nor Facebook's original FastText code implemented a plain softmax mode, probably because of its non-competitiveness in runtime performance. Gensim's implementations are modeled after those, and behave the same way with `hs=0, negative=0`.

Also, that table's row on hierarchical sampling doesn't match my understanding, at least in comparison to negative-sampling. HS is most likely to be competitive (agains the default of using negative-sampling) on smaller vocabularies. On larger vocabularies, its training time becomes longer as a function of the size of the vocabulary (and thus average encoding-length of individual words). For negative-sampling, training time is only sensitive to the size of the corpus (& other parameters), but is unaffected by the size of the vocabulary (because the same single-positive-example and N-negative-examples are trained for each target word, no matter how large V the vocabulary-size).

Published work uses a wide range of training-epochs, very often 5-20. But theoretically, a large-enough corpus where the same words appear in equally-good and equally-varies usage contexts throughout could generate fine vectors in a single epoch. (I believe it's been reported that the 'GoogleNews` vectors were the result of 3 passes over 100 billion words of text.) And a small corpus might benefit from more epochs.

Unfortunately the 'loss' values which could offer an objective indicator of model convergence aren't well-reported by the `gensim` code – see pending issue <https://github.com/RaRe-Technologies/gensim/issues/2617> to do some necessary cleanup/completion. Until then, I think the existing `loss` values can be re-interpreted to help determine what's happening in `Word2Vec`, but maybe not `FastText`, and thus evaluating the end results of different trial values is the only sure tactic.

- Gordon

emmanuel chappat

unread,

Nov 3, 2019, 3:04:49 PM11/3/19

to Gensim

My understanding of that chart was that for smaller corpora plain softmax overall yield best results. That said it's indeed something that I can not find any references to anywhere else.

As for negative sampling, to your understanding, is there any drawback to increasing it (besides of course computational cost)?

Similarly, for vectors dimensions, all else being equal, is a higher necessarily better?

Thanks a lot for your help so far Gordon.

Gordon Mohr

unread,

Nov 3, 2019, 8:00:23 PM11/3/19

to Gensim

My sense is that softmax is the thing that yields the most-theoretically-elegant neural network for the training-task, of predicting words from surrounding words. But it's overkill, and not very practical, compared to the other "sparser" training methods of hierarchical-softmax or negative-sampling, which much more quickly approximate the same results. The author of that chart leaves a lot of cells empty, notes it's based on conflicting sources, and urges it to be taken with plentiful "grains of salt".

The default is "negative=5, hs=0" in a lot of implementations so I'd only stray from that if I had theories of why my data/goals justify other parameters, or a parameter-search methodology with automated evaluation that guided me to other choices.

Larger vectors aren't necessarily better: they'll require more memory, and more training time, without necessarily offering better performance on downstream tasks. And if vectors are too large, on data that is comparatively too small/uniform, the model can suffer overfitting: where it tends to memorize idiosyncracies of the training set, driving its predictive-loss arbitrarily low, but beyond the actual generalizable meanings of the the words. I'd stick with common defaults & published-result values in the 100-400 range for word-vectors, unless the data is very small (perhaps justifying even smaller vectors) or very large (perhaps justifying larger vectors), and some evaluation proves the alternate choices work better for a target task.

(Similarly, keeping more words with a lower `min_count` isn't necessarily better. Low-frequency words don't have enough varied examples to learn their "real" natural-domain meanings, but do tend to soak up training time and result in essentially "noisy" interference with other words' improvements. So throwing out more training data, with a *higher* `min_count`, often notably improves the quality of the remaining vectors on downstream evaluations.)

- Gordon

emmanuel chappat

unread,

Nov 4, 2019, 7:20:00 AM11/4/19

to Gensim

Got it.

Thanks for sharing all those tips Gordon

Reply all

Reply to author

Forward