Load of FastText binary format with mmap='r'

49 views
Skip to first unread message

Danilo Tomasoni

unread,
Jun 29, 2023, 11:16:21 AM6/29/23
to Gensim
Hello,
I know I can do something similar to share word vectors among many processors through mmap functionality.

First store word2vec format in kv format:
```
model = KeyedVectors.load_word2vec_format(path, binary=True, unicode_errors='ignore')
model.save(path)
```
and then load with mmap='r'

```
from gensim.models import KeyedVectors

model = KeyedVectors.load(path, mmap='r')
model.fill_norms()
model.most_similar('stuff') 
```
rif. https://stackoverflow.com/a/43067907

I can do this even for a fasttext model, and works,
but then I loose the ability to get word embeddings for words out-of-vocabulary.

How can I do the same, while still being able to get out-of-vocabolary vectors?
Thank you
Danilo

Gordon Mohr

unread,
Jun 29, 2023, 5:35:33 PM6/29/23
to Gensim
To have the subword functions of a FastText model – which wouldn't even be stored in the sort of plain-text or binary files that `load_word2vec_format()` can handle – you'd need to start from the full binary FastText model.

With that, you can load the entire model using Gensim's 'load_facebook_model()` function –


– which then has, as its `.wv` subcomponent, the special kind of `FastTextKeyedVectors` that can do OOV-word vector-synthesis. 

But also, you can use the Gensim `load_facebook_vectors()` function on that same full-model file –


– to skip right to the `FastTextKeyedVectors` object. (If you look at the source code, you'll see it loads the whole model, then just returns the `.wv` part, discarding the rest.)

The `FastTextKeyedVectors` objects include both the full-word vectors that were in the known vocabulary *and* the subword info that can synthesize OOV guess-vectors, so you can get OOV word-vectors via the usual lookups. 

- Gordon


Danilo Tomasoni

unread,
Jun 30, 2023, 2:06:05 AM6/30/23
to Gensim
Thank you very much,
is it possible to load only fasttextkeyedvectors (thus saving CPU and RAM) and at the same time enabling mmap of vectors, as with load(mmap='r') ?
Thanks
Danilo

Gordon Mohr

unread,
Jun 30, 2023, 1:59:27 PM6/30/23
to Gensim
Yes - but not directly from the Facebook-binary full-model format – which hasn't yet saved the large internal arrays as separate files that can be mapped 1:1 into numpy arrays.

You'd want to create the FastTextKeyedVectors object in memory once, using one of the techniques mentioned, then use that Gensim's model object's own `.save()` method, to save it as a set of files on disk.

Then, when using `FastTextKeyedVectors.load()` from *that* save-path, you could use the `mmap='r'` option – which should ensure that the giant arrays (which make up most of the model's memory use) would use memory-mapping.

With memory-mapping, they'll only fully-load as they're accessed. (Given the way typical lookup or similarity operations access slots all across their arrays, there often isn't much net savings of memory or load-time via this deferral.)

But if you're reloading the same read-only vectors across multiple processes, memory-mapping lets the OS know that the separate processes can re-use the same mapped RAM – potentially saving a lot of redundant in-RAM duplication, in just that case.  

- Gordon

Danilo Tomasoni

unread,
Jul 4, 2023, 4:11:46 AM7/4/23
to Gensim
thank you very much!! it works!
Reply all
Reply to author
Forward
0 new messages