Thanks for the extra details.
Is your saving-code saving *both* the FT model, and the model's `.wv` attribute, separately? (If not, it may not be correct that there are two files for each of the `.vectors_vocab.` and `.vectors_ngrams.` parts.)
How are you monitoring RAM usage?
How much RAM in the test machine?
Note that these symptoms may still be consistent with the memory-mapping just not working as intended. Specifically: perhaps 4.0 is inadvertently loading the full data on `.load()` - paying all the IO cost of the full model up-front, before it returns. That's not the intent of the `mmap` option, but loading around 4GB of data can easily take that long. By contrast, when memory-mapping is used effectively, the load *appears* instant, but no data has actually yet moved over the slow IO channel. It's just been set up so that any attempts by other code to *read* those memory ranges then *later* cause those IO ops.
So another worthwhile check: add a single `sims = model.most_similar('apple')` operation at the end of the load, which will access all the vectors. (And ideally, wrap it with some logging that indicates how long *it* takes.) If it's fast in the 4.0.1 case, but slow in the 3.8.3 case (where the IO then is happening a little later), that's evidence that the 4.0.1 load isn't respecting the mmap request. (We'd want to fix that, and it could make a big difference in cases in memory usage where multiple processes are sharing the same mmapped file - but *wouldn't* be a real performance drag in most real usage scenarios - because the same followup word-lookup or training ops would, in all cases, have to page in the same amount of data eventually.)
- Gordon