Hard limit on vocab size?

37 views
Skip to first unread message

Danilo Tomasoni

unread,
Aug 28, 2023, 3:06:34 AM8/28/23
to Gensim
Hello,
I'm trying to get the vocab size of the model I just trained with

```
model = KeyedVectors.load_word2vec_format(fname, binary=True, unicode_errors='ignore')

print(len(model.index_to_key))
```

This prints out 34,521,720
Then I train another model on a bigger corpus, with far more words,
But then when I try to get it's size in the same way I still get 34,521,720
I was wondering if there is a hard limit on the size of the vocab
or if there is a bug somewhere.

Model 1 takes around 11 GB of disk space
Model 2 takes around 20 GB of disk space

Model 2 has extra vocabulary words as expected.
The vectors size is 152 in both models.

Thank you for your help!
Danilo

Danilo Tomasoni

unread,
Aug 28, 2023, 6:36:00 AM8/28/23
to Gensim
Nevermind, the gensim behaviour is correct. I was loading the wrong model. Sorry.

Gordon Mohr

unread,
Aug 28, 2023, 4:36:48 PM8/28/23
to Gensim
Glad it's sorted. If you *did* want to cap the number of words loaded, you can supply a `limit` parameter to `.load_word2vec_format()`, which will read exactly that many words, then stop. Otherwise, it'll try to load them all, limited only by addressable memory.

- Gordon

Reply all
Reply to author
Forward
0 new messages