Your text looks quite odd - not readable as natural language prose at all, as if it's been heavily preprocessed to the point of removing lots of meaningful english word, and possibly mixed with other kinds of non-prose keyword fields.
Note also that `evaluate_word_pairs()` is a generic evaluation of whether the typical words in its test data hae the sorts of relative similarities humans expect from normal usage.
To the extent your data, and possibly your goals, are very different:
* your corpus's domain may use words differently
* your domain, or your preprocessing (especially if your infrequency standard is by-proportion), may underrepresent (or elide completely) some of the usual words that are in this evaluation (crowded out by your domain-specific words)
* any other changes in your preprocessing may have disproportionately affected evaluation words
* from your mention of `min_n` and `max_n`, you seem to be using `FastText`; your domain's arbitrary terms/abbreviations whose character-n-grams *don't* work like word-fragments might be interfering with its usual vector-synthesis
So I'd first take a close look at whether each model has retained actual (not FastText-synthesized) vectors for all the words relevant to the eval.
Because your domain vocabulary may be very different from normal written english, you may need to develop a task-specific quality evaluation, rather than relying on a generic one.
Other notes on your setup, not necessarily related to your problem:
* in my experience, lemmatization isn't necessary with word2vec-style algorithms, and is most often helpful when data is *scarce* - by unifying the laternate versions of rarer words, they aren't individually pruned, or left to get weak vectors. But when data is plentiful, with sufficient in-context uses of all word forms for them all to receive appropriate & subtly-varied vectors, lemmatization destroys useful context instead of helping.
* a similar concern may apply to whatever aggressive preprocessing is making the texts only vaguely readable
* a large number of epochs (30) is more common when data is thin; with larger corpora, especially where terms appear with equally-useful contexts throughout, fewer epochs can be used. (Google's famous 'GoogleNews' word-vectors were reported trained with just 3 passes.)
* larger corpora also usually allow for larger vector-sizes (300 and up), and can benefit more from smaller `sample` values (assuming usual natural-language frequencies, where it's ok to probabilistically drop more-frequent words – which may not be the case after your unique preprocessing)
* 50 worker threads will often, in the usual corpus-iterable input method, suffer enough Pythonic thread-contention that it may train slower than fewer threads, even if you have >50 CPU cores. Your relatively-high values of 'negative' and 'window' may offset this somewhat, by doing larger batches of calculations in the non-contending code blocks - but it is still likely, and only discoverable experimentally, that some lower number of workers would train faster. (If you're able to use the `corpus_file` alternative, more workers should help up to your true number of cores – but it may be hard to have your 40B+ word corpus as a single uncompressed text file, as that mode requires.)
- Gordon