Loading FastText Model takes very long

718 views
Skip to first unread message

Andrej

unread,
Jul 8, 2021, 2:22:08 PM7/8/21
to Gensim

Hi everyone,

we have recently upgraded to Gensim 4.0.1. We have trained a new FastText model with this version. However loading the FastText model takes very long (4 minutes) and takes up more RAM than previous versions with the mmap option (~1,2 GB).

If we use a model trained with Gensim 3.8.3 and load with Gensim 4.0.1, it loads instantly and won't use as much RAM (~600 MB).

Here is the snippet we use for training:

# FastText model
epochs = 10
embedding_size = 400  # Typical range is from 100 - 1000
window_size = 5
minimum_word_frequency = 1  # Low values result in larger models and slow training; Bigger Corpus != Better
cores = multiprocessing.cpu_count()
sample = 1e-5
word_ngrams = 1
minimum_length_ngrams = 2
maximum_length_ngrams = 6
model = FastText(vector_size = embedding_size,
                 min_count = minimum_word_frequency,
                 window = window_size,
                 workers = cores,
                 sample = sample,
                 word_ngrams = word_ngrams,
                 min_n = minimum_length_ngrams,
                 max_n = maximum_length_ngrams)

# Build vocabulary
model.build_vocab(corpus_iterable = position_texts_tagged)

# Train model
model.train(corpus_iterable = texts_tagged,
            total_examples = model.corpus_count,
            epochs = epochs)

Not sure, if we have done something wrong migrating the code. We have uploaded the requirements file for the environment here (req.txt). Thank you very much in advance for any advices.

Andrej
req.txt

Gordon Mohr

unread,
Jul 8, 2021, 11:41:03 PM7/8/21
to Gensim
That's surprising, given that a nmber of Gensim 4.0 changes to FastText eliminate prior very-wasteful memory practices.

- For identically-trained models in your setup, are the on-disk files larger or smaller when trained in Gensim 3.x vs Gensim 4.0? 
- Are you using the same memmap options on both loads?
- Can you reproduce the difference in a small self-contained example – even one on synthetic data or so small normal benefits aren't visible, but the performance discrepancy is?

I suppose there's a chance something's gone wrong in the Gensim 4.0 memmap support, and if so we'd want to fix that, but if that's truly the case case, any apparent advantage of the older-load might not mean much in actual use. Perhaps, it's just deferring (hiding) some of the cost until the first operations addressing all words, and real workloads (unless confined to tiny ranges of words) would pay the same full-load costs soon enough. 

- Gordon

Andrej

unread,
Jul 9, 2021, 4:50:27 AM7/9/21
to Gensim

Hi Gordon,

thank you for the fast answer.

1. We have made some screenshots about the filesize (FastText & FastTextKeyedVectors):

Gensim 4.0.1
gensim4.0.1_FastText_FileSize.png

Gensim 3.8.3
gensim3.8.3_FastText_FileSize.png

Gensim 4.0.1 takes less space on disk.

2. We use the same mmap option for both loads. Some further screenshots which also show the time loading:

Gensim 4.0.1 - Loading a Gensim 3.8.3 Text Model (~ 10 seconds)
gensim4.0.1_FastText_LoadTime_Model_Trained_On_3.8.3.png

Gensim 4.0.1 - Loading a Gensim 4.0.1 Text Model (~ 4 minutes)
gensim4.0.1_FastText_LoadTime_Model_Trained_On_4.0.1.png

3. I haven't got the time yet, to make a small example project. Do you have any recommendations which small corpuses can be used? Something you might have used for Debugging yourself?

I will add an update, when I have finished the third point.

Andrej

Andrej

unread,
Jul 9, 2021, 6:57:42 AM7/9/21
to Gensim

Additionally to 3.:

We have just used ( from gensim.test.utils import common_texts ) for training. No differences in loading.

There is an other thing we noticed while testing with our data. Both versions Gensim 3.8.3 (~600 MB) and Gensim 4.0.1 (~1,2 GB) seem to load the necessary data into RAM rather quickly. However after that Gensim 3.8.3 is finished and Gensim 4.0.1 does something, but there is no noticeable load on the system and RAM usage doesn't change in those 4 minutes of waiting.

Andrej

Gordon Mohr

unread,
Jul 9, 2021, 11:56:20 AM7/9/21
to Gensim
Thanks for the extra details. 

Is your saving-code saving *both* the FT model, and the model's `.wv` attribute, separately? (If not, it may not be correct that there are two files for each of the `.vectors_vocab.` and `.vectors_ngrams.` parts.)

How are you monitoring RAM usage? 

How much RAM in the test machine?

Note that these symptoms may still be consistent with the memory-mapping just not working as intended. Specifically: perhaps 4.0 is inadvertently loading the full data on `.load()` - paying all the IO cost of the full model up-front, before it returns. That's not the intent of the `mmap` option, but loading around 4GB of data can easily take that long. By contrast, when memory-mapping is used effectively, the load *appears* instant, but no data has actually yet moved over the slow IO channel. It's just been set up so that any attempts by other code to *read* those memory ranges then *later* cause those IO ops. 

So another worthwhile check: add a single `sims = model.most_similar('apple')` operation at the end of the load, which will access all the vectors. (And ideally, wrap it with some logging that indicates how long *it* takes.) If it's fast in the 4.0.1 case, but slow in the 3.8.3 case (where the IO then is happening a little later), that's evidence that the 4.0.1 load isn't respecting the mmap request. (We'd want to fix that, and it could make a big difference in cases in memory usage where multiple processes are sharing the same mmapped file - but *wouldn't* be a real performance drag in most real usage scenarios - because the same followup word-lookup or training ops would, in all cases, have to page in the same amount of data eventually.)

- Gordon

Andrej

unread,
Jul 12, 2021, 5:35:14 PM7/12/21
to Gensim
Hi Gordon,

saving--code is seperate for both models (FT & '.wv').

We have monitored RAM by watching the TaskManager. However, we noticed this problem with Docker stats, inititally. Both showing similar behaviour.

System has 64 GB of RAM.

We have used the 'most_similar()'-method with Gensim 4.0.1 on both models (screeshots:

Gensim 3.8.3 - Model Loading and Most Similar
gensim4.0.1_FastText_LoadTime_MostSimilar_3.8.3.png

Gensim 4.0.1 - Model Loading and Most Similar
gensim4.0.1_FastText_LoadTime_MostSimilar_4.0.1.png

The 'most_similar()' method takes about 8 seconds after loading FastText model (Gensim 3.8.3). Loading the FastText model (4.0.1), we recieve the results nearly instantly. This sounds like you are right with your assumption. If there is anything, we can do to help further, let us know. Do you know, if Gensim 4.0.1 has lower RAM usage when mmap is working compared to 3.8.*?

Andrej

Gordon Mohr

unread,
Jul 13, 2021, 1:40:21 PM7/13/21
to Gensim
I see the likely issue: in 4.0, we made a change to avoid writing the redundant data in the FastTextKeyedVectors `.vectors` array to disk - since it is fully determined, & re-calculable, from other data already-written.

Unfortunately, that re-calc on re-load can take a noticeable amount of time with real-sized models - and also creates a new in-memory (not memmapped) array for those re-calculated full-word vectors. It was probably better, for most users, to waste a little disk space on the saves for faster loads & the possibility of interprocess RAM sharing via memmapping.

To change the behavior back would require a pretty-tiny patch to the saving routines, so I've made an issue about this which recommends that step: https://github.com/RaRe-Technologies/gensim/issues/3192

I haven't thought of a workaround other than changing those routines in a locally-patched version of Gensim. 

- Gordon

Andrej

unread,
Jul 13, 2021, 2:08:48 PM7/13/21
to Gensim
Thank you very much, Gordon.

Andrej

Radim Řehůřek

unread,
Jul 14, 2021, 4:56:06 AM7/14/21
to Gensim
Andrej, would you be able to send a PR with a patch along the lines that Gordon suggests?

Cheers,
Radim
Reply all
Reply to author
Forward
0 new messages