Training issue in Word2Vec for CBOW model

1,960 views
Skip to first unread message

Hugo W

unread,
Nov 17, 2015, 9:24:08 AM11/17/15
to gensim
Hi all,

I wrote this comment first on Radim's Word2Vec/gensim tutorial blog. He then suggested that I post the issue here.

So when I trained word2vec model, with default parameters (namely the skip-gram model), the results where coherent with what is reported (in the blog and in papers..).
When I used the pre-trained “vectors.bin” model from C version of Word2Vec from Tomas, loaded in gensim, everything seems fine as well (notice that the default model of C version is CBOW).
Then I tried to train the Gensim Word2Vec with default parameters used in C version (which are: size=200, workers=8, window=8, hs=0, sampling=1e-4, sg=0 (using CBOW), negative=25 and iter=15) and I got a strange “squeezed” or shrank vector representation where most of computed “most_similar” words shared a value of roughly 0.97!! (And from the classical “king”, “man”, “woman” the most similar will be “and” with 0.98, and in the top 10 I don’t even have the “queen”…). Everything was train on the SAME text8 dataset.
So I wondered if you saw such “wrong” training before, with those atypical characteristics (all words in roughly one direction in vector space) and if you know where might be the issue.
I am trying different parameters setting to hopefully figure out what is wrong (workers>1? iter?).


These are the parameters used for training as entered in python:

model = word2vec.Word2Vec(sentences, size=200, workers=8,
                          iter=15,sample=1e-4,hs=0,negative=25,
                          min_count=1,sg=0, window=8)

This is quite similar to how you would call the C version of Word2Vec as in:

./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 4 -binary 1 -iter 15

Here is the end of outputs for training and the query for "king","man","woman":

training on 255078105 raw words took 260.5s, 558320 trained words/s

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

INFO:gensim.models.word2vec:precomputing L2-norms of word weight vectors
precomputing L2-norms of word weight vectors
Out[153]:
[(u'nine', 0.9694292545318604),
 (u'v', 0.9688974022865295),
 (u'it', 0.9687643051147461),
 (u'zero', 0.9683082699775696),
 (u'five', 0.9682567119598389),
 (u'and', 0.9681676030158997),
 (u'p', 0.9680780172348022),
 (u'm', 0.9679656028747559),
 (u'eight', 0.9679427146911621),
 (u'them', 0.9679186344146729)]
Attached you will find a .txt file with some other results (when I tried other settings). Strangely enough, when I used only one iteration,
"queen" (among other wanted results like "prince", "iv", etc... ) appeared again in the results for most_similar query, even if I still got the very close
vector (0.99 similarity!!).
Also, cbow_mean=1 improved the issue a little bit, but it is no way comparable to the result when the model is loaded from vectors.bin from C-format Word2Vec (as seen in the attached file).
As a reminder I put also here the result of that model (computed with the exact command line written earlier in the message).

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

Out[166]:
[(u'queen', 0.5550357103347778),
 (u'betrothed', 0.4963855743408203),
 (u'urraca', 0.4869607090950012),
 (u'marries', 0.48425954580307007),
 (u'vii', 0.4788791239261627),
 (u'isabella', 0.4788578748703003),
 (u'throne', 0.4734063744544983),
 (u'daughter', 0.4699792265892029),
 (u'abdicates', 0.46685048937797546),
 (u'infanta', 0.46183738112449646)]
 
So it seems that the L2-norms of weight is not correctly applied here... It is quite strange, any suggestion?

Some extra info:
Python version 2.7.10
numpy 1.10.1
gensim 0.12.2
other_results.txt

Andrey Kutuzov

unread,
Nov 17, 2015, 12:47:42 PM11/17/15
to gen...@googlegroups.com
Hello,

In my experience with Gensim implementation of Word2vec, CBOW mode
returns very bad models, when used with negative sampling, but quite OK
when used with hierarchical softmax. With skipgram the situation is vice
versa.
Not sure whether this is true for Mikolov's implementation as well.

17.11.2015 15:24, Hugo W wrote:
> Hi all,
>
> I wrote this comment first on Radim's Word2Vec/gensim tutorial blog. He
> then suggested that I post the issue here.
>
> So when I trained word2vec model, with *default parameters *(namely the
> skip-gram model), the *results* where *coherent* with what is reported
> (in the blog and in papers..).
> When I used the *pre-trained “vectors.bin”* model from C version of
> Word2Vec from Tomas, loaded in gensim, everything seems *fine* as well
> (notice that the default model of C version is CBOW).
> Then I tried to*train the Gensim Word2Vec with default parameters used
> in C *version (which are: size=200, workers=8, window=8, hs=0,
> sampling=1e-4, sg=0 (using *CBOW*), negative=25 and iter=15) and I got a
> strange “squeezed” or *shrank vector representation* where most of
> computed “most_similar” words shared a value of roughly 0.97!! (And from
> the classical “king”, “man”, “woman” the most similar will be “and” with
> 0.98, and in the top 10 I don’t even have the “queen”…). Everything was
> *train on the SAME text8 dataset*.
> other settings). Strangely enough, *when I used only one iteration*,
> "*queen*" (among other wanted results like "prince", "iv", etc... )
> *appeared again* in the results for most_similar query, even if I *still
> got* the /very/ close
> vector (*0.99 similarity*!!).
> Also, /cbow_mean=1/ improved the issue a little bit, but it is no way
> comparable to the result when the model is loaded from vectors.bin from
> C-format Word2Vec (as seen in the attached file).
> As a reminder I put also here the result of that model (computed with
> the exact command line written earlier in the message).
>
> model.most_similar(positive=['woman', 'king'], negative=['man'],
> topn=10)
>
> Out[166]:
> [(u'queen', 0.5550357103347778),
> (u'betrothed', 0.4963855743408203),
> (u'urraca', 0.4869607090950012),
> (u'marries', 0.48425954580307007),
> (u'vii', 0.4788791239261627),
> (u'isabella', 0.4788578748703003),
> (u'throne', 0.4734063744544983),
> (u'daughter', 0.4699792265892029),
> (u'abdicates', 0.46685048937797546),
> (u'infanta', 0.46183738112449646)]
>
>
> So it seems that the L2-norms of weight is not correctly applied here...
> It is quite strange, any suggestion?
>
> Some extra info:
> Python version 2.7.10
> numpy 1.10.1
> gensim 0.12.2
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Hugo W

unread,
Nov 17, 2015, 2:03:33 PM11/17/15
to gensim
Hi, thanks for your reply.

Indeed this is the conclusion I come to. But I am still confused by the inadequacy with Mikolov's implementation.
Doesn't it reveal an implementation issue in Gensim's CBOW?
Namely a CBOW model trained with NS (and no HS) in Mikolov's Word2Vec has a rather good accuracy (as computed with question-words.txt, 30 000 entries more than 50% in total) .
And the same model (same hyper-parameters) in Gensim will have a significant drop of accuracy, going down to around 2% (total accuracy).

Also, even with cbow and negative sampling (at 20) I have a better accuracy than with Hierarchical Softmax in Mikolov's Word2Vec (all other parameters kept the same, probably some optimization can be done there).
I guess I will simply stick to skip-gram model, but I was quite happy with the fast training of CBOW... (and obviously I can still use the C version, but Python's Gensim is really useful too).

Maybe it is a problem of how negative sample are drawn under cbow model? I did not look at the code, but the negative sampling should differ depending if it is used in CBOW or SG...

Cheers,
Hugo

Gordon Mohr

unread,
Nov 17, 2015, 4:34:22 PM11/17/15
to gensim
That's odd and suggests a bug: the results should be similar for similar training parameters.

A few notes about the dataset and relative defaults (in the latest word2vec.c) to be sure the comparison is valid:

(1) text8 puts all words on one line; if fed to gensim word2vec as one sentence, only the first 10k words will be considered. (The LineSentence iterable in word2vec.py will auto-split the long-line, if you're using it.)

(2) gensim's word2vec doesn't automatically bump the `alpha` default to 0.05 for CBOW mode, while word2vec.c does

(3) word2vec.c's default min_count is 5 if otherwise unspecified

(4) word2vec.c is always doing a form of what gensim enables with `cbow_mean=1`, so any gensim runs to compare results should use that option

(5) other word2vec.c defaults if unspecified are window=5, negative=5, iter=5, and sample=1e-3 (but I see those explicitly specified in your example word2vec.c invocation, so as long as they match in both runs all should remain comparable)

If after being sure to control for these differences, the results are still very different, one area of which I've been suspicious is that while the cython-optimized path takes effort to sense & select whether to use a BLAS sdot or dsdot function, there's no similar selection between sscal and dscal (functions used in the CBOW/cbow_mean paths). 

So if (and it's still a very speculative if) this is the source of some result discrepancy, it might only affect certain configurations, specifically the difference triggered by gensim.models.word2vec.FAST_VERSION being 1 or 2. What does your gensim installation report for this value?

- Gordon

Hugo W

unread,
Nov 17, 2015, 10:04:40 PM11/17/15
to gensim
Hi Gordon,

Thanks for this developed answer. I did not know for cbow_mean parameter indeed. I did try both in case, keeping other parameters similar and it was less "shocking" with cbow_mean = 1. But still,
the drop of accuracy remains (and is too big to be caused only by ,for instance, some initialization difference).
About the dataset, I am using the Text8Corpus class (gensim.models.word2vec.Text8Corpus) on the unzipped version of text8 (the one downloaded from mattmahoney.net/dc/text8.zip). And I suppose it iterate correctly as I do not have the problem for
skip-gram model (and also because I can read that the training happens on the exact same number of raw words, sentences and I have the same vocabulary size as in C).

And finally:
Your suggestion for the alpha bump (that I did not notice happened in word2vec.c...) solved the problem! So in other words, I was training two times too slow to really learn these vectors. But then I should have seen some improvement with more iterations, no?
I managed to get similar accuracy in the two versions (I gave it one quick try, I will confirm in the following days whether that was the issue/solution).

And for the sake of completeness:
gensim.models.word2vec.FAST_VERSION returned "1" on my installation.

Thanks again for your help!
Hugo

Radim Řehůřek

unread,
Nov 18, 2015, 12:37:17 AM11/18/15
to gensim
Perhaps we should bump up the defaults in gensim as well, to match the C version. They matched the C version long time ago, but the C version defaults have changed since.

I'm a little concerned for backward compatibility though. But if the "new defaults" return consistently better models, plus people expect the gensim and C word2vec to behave the same, it may be worth it.

What do you think? Ideas?

-rr

Oliver Adams

unread,
Nov 18, 2015, 9:31:26 AM11/18/15
to gensim
I'm perhaps not qualified to make an opinion here, but I thought I might as well contribute.

If the new defaults do return consistently better models, then I feel changing them is fine. People writing papers or using the software in any serious way where it matters should be keeping tabs on the all the model hyperparameters anyway, though it's true many may not be doing an adequate job of this. It may cause a few people confusion at some point, but you could argue that they shouldn't be updating gensim willy-nilly. I think giving newcomers to the software defaults that represent what tends to work the best matters more.

Hugo W

unread,
Nov 18, 2015, 12:46:55 PM11/18/15
to gensim
And I could add that any one (including those writing papers) willing to use such model, especially if coming from some other fields, will first "give it a try". That is they would eventually run
some code like a toy example or try to reproduce at least previous result of a given model (e.g. from a paper they read). At this point, if you get some relatively good result with default parameters, eventually you would keep working with the package and then tune the parameters for your own experiment/task (keeping tabs on these parameters!).
In my case, chance (or google) led me to the gensim package and I was glad to see that it would integrate in Python smoothly as any python package so I did not need to wrap around the C code. Since I had word2vec.c before, first thing I did was to compare results from both using what I thought to be default parameters. My bad that I did not see the different alpha used in CBOW from word2vec.c (I was actually checking the value in gensim to see if it was 0.025...).
Anyway, I think it should be changed accordingly so that people, newcomers, can also "easily" try out both CBOW and SG wihtout such surprising bad training in CBOW when they follow Radim's blog tutorial.

Gordon Mohr

unread,
Nov 19, 2015, 1:17:22 AM11/19/15
to gensim
Yes, many people will either try both, or compare the results of each, so matching the word2vec.c defaults exactly will best avoid confusion. 

There's a risk that people who have come to rely on the defaults will encounter an unexpected change in behavior (for better or worse), but I think that (1) serious/production use is more likely to explicitly specify options that were consciously chosen/optimized; and (2) a prominent mention in the release note accompanying the release is sufficient warning. 

A further step would be to match the option names exactly: word2vec.c expresses the skip-gram/CBOW toggle as a 'cbow' parameter being 0 or 1, while gensim uses 'sg' being 1 or 0. It also uses 'threads' where gensim uses 'workers'. 

Re-naming these parameters (without respecting the old for backward-compatibility) would force a hard error on non-adapted code, as that's one of the most likely parameters to be explicitly set.  That might be a good thing: it'd force people to notice the change in parameters/defaults, and consider them, before running older code with the newer gensim. 

Finally, we could even make the word2vec.py `main()` read compatible command-line switches and run the equivalent training (including `-output` and '-binary' for triggering a `save_word2vec_format()`)... so that word2vec.c behavior could be duplicated by just replacing `word2vec` with `word2vec.py`.

I made an issue to further consider/track such changes: https://github.com/piskvorky/gensim/issues/534

- Gordon

Andrey Kutuzov

unread,
Nov 20, 2015, 10:38:33 AM11/20/15
to gen...@googlegroups.com
Hi all,

I ran a bunch of experiments with various settings for CBOW mode in
Gensim, see the results here:
https://docs.google.com/spreadsheets/d/1dgr513AePh4EjCUQxQyeT9i6Xnig3SOtjLJwiIVmuu4

The models were trained on one and the same corpus (British National
Corpus, lemmatized, stop words and single-word sentences removed), 5
iterations, vector size 300, window 2, min count 3.

Then the models were evaluated on Google's analogy data set
(https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt),
using only semantic sections (purely grammatical sections do not make
sense for a lemmatized training corpus).
The performance was calculated using Gensim's `accuracy' function; note
that I set restrict_vocab to a very high number, so that all the pairs
were used for evaluation.

In short:
In Mikolov's word2vec, CBOW+negative sampling (NS) is obviously superior
to CBOW+hierarchical softmax (HS). This holds both for alpha=0.05
(default for CBOW) and for alpha=0.025 (default for skip-gram).

However, in Gensim with default settings, CBOW+NS is worse than CBOW+HS,
and MUCH worse than Mikolov's word2vec results. It seems that the reason
is really the default cbow_mean=0. As soon as vector mean is used
(cbow_mean=1), the performance grows seriously both for HS and NS; also,
NS version is now superior to HS, similar to Mikolov's implementation.

Considering bumping alpha to 0.05, I did not observe much improvement
from this: in fact, both HS and NS (either with cbow_mean 0 or 1)
results become slightly worse after such change. With Mikolov's
word2vec, this does not seem to have much influence on the accuracy.

I observe similar trends with models for Russian language, by the way.

Thus, I would certainly suggest making cbow_mean=1 the default for CBOW
mode in Gensim. At the same time, I am not so sure about changing the
initial alpha setting.

Hope this helps,


On 11/18/2015 06:37 AM, Radim Řehůřek wrote:
> Perhaps we should bump up the defaults in gensim as well, to match the C
> version. They matched the C version long time ago, but the C version
> defaults have changed since.
>
> I'm a little concerned for backward compatibility though. But if the
> "new defaults" return consistently better models, plus people expect the
> gensim and C word2vec to behave the same, it may be worth it.
>
> What do you think? Ideas?
>
> -rr
>
>
> On Wednesday, November 18, 2015 at 12:04:40 PM UTC+9, Hugo W wrote:
>
> Hi Gordon,
>
> Thanks for this developed answer. I did not know for /cbow_mean
> /parameter indeed. I did try both in case, keeping other parameters
> similar and it was less "shocking" with /cbow_mean /= 1. But still,
> the drop of accuracy remains (and is too big to be caused only by
> ,for instance, some initialization difference).
> About the dataset, I am using the Text8Corpus class
> (gensim.models.word2vec.Text8Corpus) on the unzipped version of
> text8 (the one downloaded from mattmahoney.net/dc/text8.zip
> <http://mattmahoney.net/dc/text8.zip>). And I suppose it iterate
> correctly as I do not have the problem for
> skip-gram model (and also because I can read that the training
> happens on the exact same number of raw words, sentences and I have
> the same vocabulary size as in C).
>
> And finally:
> Your suggestion for the /alpha /bump (that I did not notice happened
> So when I trained word2vec model, with *default parameters
> *(namely the skip-gram model), the *results* where
> *coherent* with what is reported (in the blog and in papers..).
> When I used the *pre-trained “vectors.bin”* model from C
> version of Word2Vec from Tomas, loaded in gensim, everything
> seems *fine* as well (notice that the default model of C
> version is CBOW).
> Then I tried to*train the Gensim Word2Vec with default
> parameters used in C *version (which are: size=200,
> workers=8, window=8, hs=0, sampling=1e-4, sg=0 (using
> *CBOW*), negative=25 and iter=15) and I got a strange
> “squeezed” or *shrank vector representation* where most of
> computed “most_similar” words shared a value of roughly
> 0.97!! (And from the classical “king”, “man”, “woman” the
> most similar will be “and” with 0.98, and in the top 10 I
> don’t even have the “queen”…). Everything was *train on the
> SAME text8 dataset*.
> (when I tried other settings). Strangely enough, *when I
> used only one iteration*,
> "*queen*" (among other wanted results like "prince", "iv",
> etc... ) *appeared again* in the results for most_similar
> query, even if I *still got* the /very/ close
> vector (*0.99 similarity*!!).
> Also, /cbow_mean=1/ improved the issue a little bit, but it

Gordon Mohr

unread,
Nov 20, 2015, 2:47:52 PM11/20/15
to gensim
Thanks for the detailed analysis! To me the key takeaway (from your bolded/green lines) is that once `cbow_mean=1` is used in gensim, the same parameters (including alpha) yield essentially the same word-vector quality from either Le/Mikolov word2vec.c or gensim, as judged by the analogies evaluation. 

And yes, it thus looks like `cbow_mean=1` would clearly be the better default. 

- Gordon

Radim Řehůřek

unread,
Nov 21, 2015, 1:59:05 AM11/21/15
to gensim
Interesting, thanks Andrey!

Can you create a Github PR that will sync the parameters between Python / C word2vec? https://github.com/piskvorky/gensim/issues/534

That is, we want to be doing the same thing as the C tool, plus put a fat warning into README for the next release about this change, because relying in default Word2Vec parameters will now result in different behaviour.

A main() section in word2vec.py that uses optparse to simulate the C command line interface would be a nice plus.

Cheers,
Radim
Reply all
Reply to author
Forward
0 new messages