min_count in word2vec is very problematic

1,539 views
Skip to first unread message

tedo.v...@gmail.com

unread,
Mar 5, 2018, 4:33:40 PM3/5/18
to gensim
I have found that if we raise the min_count, even not so high, there will be a moment when the wv.vocab will not contain all the words it should have(!). Therefore, if we use vectors from such a defective model, we get KeyError word not in vocabulary.
1. How do I know the min_count limit? I applied the principle that the list should not remain empty, or that no list within the list of lists should be left empty. However, this is too large. How it can be countable?
2. By what code word2vec reduced the list with regard to min_count?
3.
It would be good that keyedvectors contain except that such a situation passes and be a warning type, not a mistake.

Gordon Mohr

unread,
Mar 5, 2018, 5:24:22 PM3/5/18
to gensim
The default for `min_count` is 5, following the word2vec.c code on which gensim's implementation was originally modeled. That means that for whatever corpus that you provide for vocabulary-discovery purposes, words that appear fewer than 5 times will be dropped from the vocabulary and ignored during training. 

Discarding low-frequency words is generally a good thing for word2vec model quality, as such infrequent words individually can't get good vector representations from so few examples – but altogether, all the low-frequency words soak up a lot of training effort, and wind up essentially 'interfering' with the quality of surrounding more-frequent words. 

You can set `min_count=1` if you'd like, but that'll create a much larger-in-memory model, with a lot of low-value, low-quality infrequent words, that takes longer to train – a loss on almost every dimension, except the completist desire to have a vector-for-every-seen-word. 

Often just ignoring out-of-vocabulary words is defensible choice – and most deployed systems need to be tolerant of novel unknown words, anyway. (So they should check for presence, or otherwise tolerate KeyErrors.)

Beyond that, I don't quite understand your questions (1)-(3). If they're not cleared up by the above explanation, can you re-word them for greater clarity about what you expect, why, and what you've tried or seen instead? 

- Gordon

tedo.v...@gmail.com

unread,
Mar 5, 2018, 5:34:16 PM3/5/18
to gensim
Yes, my bad english. Sorry.
Changing min_count form 1 to same expected maximum (where there's yet no empty list(s)), leads me to the situation in which I get KeyError - missing word. So, I don't thing that word2vec is reducing it's vocabulary the _good_ way.
I aggred with every word you wrote and I thank you, but that's not what I ment to ask. ;)

tedo.v...@gmail.com

unread,
Mar 5, 2018, 7:47:06 PM3/5/18
to gensim
You can take my 2D list. Load it in word2vec with min_count = 10. Then try to get vector for word 'even'.

all_words_list.txt

Gordon Mohr

unread,
Mar 5, 2018, 8:28:58 PM3/5/18
to gensim
There are only 9 occurrences of 'even'. So, with a `min_count=10`, it should be discarded from the vocabulary. This is the designed behavior: a `min_count=10` means to only keep words with at least 10 training-examples.

For the reasons mentioned previously, it is usually a good idea to ignore such rare words when training (or later using) a Word2Vec model. You could set `min_count` lower, but that will make a larger, slower-to-train model. The vectors for the rare words won't be very good – undertrained based on a few idiosyncratic examples. Using them is often worse than ignoring them entirely, for many applications. And mixing those rare words into the training often serves as 'noise' which makes the vectors for more-frequent words less meaningful. 

So: you can choose to retain them (with something like `min_count=1`) – and get weaker vectors. 

Or, as is often the better policy, discard the rarest words during training, and then also disregard them later when they appear in other texts. Then you either never get `KeyError` because you're not even trying to look up those vectors, or you catch-and-ignore the `KeyError`.

- Gordon

tedo.v...@gmail.com

unread,
Mar 6, 2018, 5:28:24 AM3/6/18
to gensim
Thanks! I was sure it was 10. Now, thanks to you, I looked more carefully into the error and I realized that the line number where the error occurs is different (very similar) than the part of the code I looked at. Embarrassing.
Debugged, resolved. Thanks again!
Message has been deleted

Kamal Garg

unread,
Apr 20, 2018, 7:41:02 AM4/20/18
to gensim

I have applied word2vec on whole wikipedia dump with min_count=5 window_size=5 and I am getting good results. But when I tried to find the most similar words related to 
magnetorheological and I get an error that this word is not in vocab. There are wiki pages associated with this word but its not in the vocabulary. Can someone explain. If i decrease the min_count=2, will it be added to dictionary? Any help will be greatly appreciated. 
    Also i want to know whether we can speed up the training process by increasing the number of cores and RAM?
    Thanks in advance 

Ivan Menshikh

unread,
Apr 23, 2018, 3:41:34 AM4/23/18
to gensim
Hello Kamal,

If i decrease the min_count=2, will it be added to dictionary?

probably yes (I can't guarantee), min_count take into account only term frequency and nothing more.

About performance, you can speed up training if you'll use more cores (~10 is enough), fast disk (SSD if possible, especially for filling vocab), increase a RAM have no sense in speed up terms (exception is swap if you use it).

Gordon Mohr

unread,
Apr 23, 2018, 3:16:36 PM4/23/18
to gensim
If there's a word you're sure is in the corpus that you fed to Word2Vec, but then it's not available after training your model, then yes, it was probably eliminated as not having at least `min_count` occurrences. 

You can lower `min_count`, but words with so few examples will not get good vectors themselves. And, there are a lot of them in aggregate, so your model gets noticeably larger (but with poor-quality vectors). And, keeping those low-frequency words around during the training of other word-vectors tends to make those other vectors slightly worse, too. So if there are rare (in your corpus) words that you need vectors for, if at all possible, it's better to find more training data than lower `min_count`. (The default `min_count=5` is already quite small – the model won;t learn very good vectors for words with only 5 examples.)

More cores can sometimes help, and If you have more than 4 cores, you can tell gensim to use more worker threads via the `workers` model parameter (which has a default value of 4). But, the gensim implementation has some multithreading bottlenecks related to the Python "Global Interpreter Lock", and current methods of IO & inter-thread communication, so there's usually peak throughput somewhere in the range of 3-16 threads – the exact value may depend on other parameters, especially `window` & `negative`. 

RAM can sometimes help – if there's any virtual-memory usage (swapping) at all during model-training, you should eliminate that with either more RAM or parameters that create a smaller model (such as a larger `min_count`). Also, if you can get your whole training corpus into RAM – and thus be sure there's absolutely no IO lags during training – that may help. (But it may not be a big boost over using an SSD volume, at least not after making sure your corpus iterator is fairly efficient – not repeating any complicated preprocessing/tokenization each iteration.)

- Gordon
Reply all
Reply to author
Forward
0 new messages