Hard limit on vocab size?

Danilo Tomasoni

unread,

Aug 28, 2023, 3:06:34 AM8/28/23

to Gensim

Hello,
I'm trying to get the vocab size of the model I just trained with

```

model = KeyedVectors.load_word2vec_format(fname, binary=True, unicode_errors='ignore')

print(len(model.index_to_key))

```

This prints out 34,521,720

Then I train another model on a bigger corpus, with far more words,

But then when I try to get it's size in the same way I still get 34,521,720

I was wondering if there is a hard limit on the size of the vocab

or if there is a bug somewhere.

Model 1 takes around 11 GB of disk space

Model 2 takes around 20 GB of disk space

Model 2 has extra vocabulary words as expected.

The vectors size is 152 in both models.

Thank you for your help!

Danilo

Danilo Tomasoni

unread,

Aug 28, 2023, 6:36:00 AM8/28/23

to Gensim

Nevermind, the gensim behaviour is correct. I was loading the wrong model. Sorry.

Gordon Mohr

unread,

Aug 28, 2023, 4:36:48 PM8/28/23

to Gensim

Glad it's sorted. If you *did* want to cap the number of words loaded, you can supply a `limit` parameter to `.load_word2vec_format()`, which will read exactly that many words, then stop. Otherwise, it'll try to load them all, limited only by addressable memory.

- Gordon

Reply all

Reply to author

Forward