Debug drammatic performance drop

Danilo Tomasoni

unread,

Jul 31, 2024, 9:36:25 AM7/31/24

to Gensim

Hello,

I'm training on a very large dataset

of

2908252777 lines

and

39474945814 words

some example of the sentences that you can find are:

=============

combine overlap input datum supertree method require multiple sequence alignment ml analysis taxon simultaneously
extremely high temperature result destruction crystalline structure cellulose excessive heat energy eventually promote formation unfavorable product levulinic_acid hmf
appropriate guidance need e.g. search strategy reference indicate datum directly derive peer-reviewed_publication monitor program new analysis

understand relationship tumour require careful epidemiological study difficult perform africa present time lack resource particularly good pathology laboratory absence widely accept definition ebl sbl confuse clarify

npas2 p_gene aanat p_gene significant bonferroni correction

=============

Previously I had an older version of the same dataset, preprocessed similarly (remove of infrequent words, lemmatize, remove meaningless variants [es. person, person^4] ecc.) with

4071374855 lines

and

53425429560 words

some example of the sentences that you can find are:

=============

present predictive approach allee_effect polar bear low population density unpredictable habitat harvest-depleted male population result infrequent mating encounter

viral micrornas identify ebv-infected cell initially expression ebv micrornas mir-bhrf1-1 mir-bhrf1-2 mir-bhrf1-3 mir-bart1 mir-bart2 demonstrate clone small rna nt ebv b95-8-infected bl41 cell confirm northern blot analysis

activity total serum lactate dehydrogenase lactate dehydrogenase isoenzyme hepatic type disease_mesh_d006525

gland opening ridge sign show high value ci ci ci respectively

=============

I trained gensim on both dataset with the same hyperparameters that I report below:

epochs: 30
alpha: 0.025
min_alpha: 0.0001
negative: 10
workers: 50

hs: 0
max_n: 6
min_count: 5
min_n: 3
negative: 35
ns_exponent: 0.75
sample: 0.001
sg: 1
vector_size: 152
window: 20
batch_words: 10000

The pearson correlation with human judgement on different datasets (model.evaluate_word_pairs()) was good on the first dataset (> 0.65)

but on the second the correlation is completely lost ( < 0.2 )

Can you help me spot why?

What would you try first?

Thanks

Danilo

Danilo Tomasoni

unread,

Aug 2, 2024, 3:08:53 AM8/2/24

to Gensim

Any clue on how to recover my performance?

I suspect the issue is in the preprocessing.. but the only thing I did was to improve the cleaning of the words.

Duplicated sentences can impact performances? How?

I just realized I reported negative parameter twice. the correct value that I'm using is

negative: 10

Thanks!

Gordon Mohr

unread,

Aug 3, 2024, 2:44:54 PM8/3/24

to Gensim

Your text looks quite odd - not readable as natural language prose at all, as if it's been heavily preprocessed to the point of removing lots of meaningful english word, and possibly mixed with other kinds of non-prose keyword fields.

Note also that `evaluate_word_pairs()` is a generic evaluation of whether the typical words in its test data hae the sorts of relative similarities humans expect from normal usage.

To the extent your data, and possibly your goals, are very different:

* your corpus's domain may use words differently

* your domain, or your preprocessing (especially if your infrequency standard is by-proportion), may underrepresent (or elide completely) some of the usual words that are in this evaluation (crowded out by your domain-specific words)

* any other changes in your preprocessing may have disproportionately affected evaluation words

* from your mention of `min_n` and `max_n`, you seem to be using `FastText`; your domain's arbitrary terms/abbreviations whose character-n-grams *don't* work like word-fragments might be interfering with its usual vector-synthesis

So I'd first take a close look at whether each model has retained actual (not FastText-synthesized) vectors for all the words relevant to the eval.

Because your domain vocabulary may be very different from normal written english, you may need to develop a task-specific quality evaluation, rather than relying on a generic one.

Other notes on your setup, not necessarily related to your problem:

* in my experience, lemmatization isn't necessary with word2vec-style algorithms, and is most often helpful when data is *scarce* - by unifying the laternate versions of rarer words, they aren't individually pruned, or left to get weak vectors. But when data is plentiful, with sufficient in-context uses of all word forms for them all to receive appropriate & subtly-varied vectors, lemmatization destroys useful context instead of helping.

* a similar concern may apply to whatever aggressive preprocessing is making the texts only vaguely readable

* a large number of epochs (30) is more common when data is thin; with larger corpora, especially where terms appear with equally-useful contexts throughout, fewer epochs can be used. (Google's famous 'GoogleNews' word-vectors were reported trained with just 3 passes.)

* larger corpora also usually allow for larger vector-sizes (300 and up), and can benefit more from smaller `sample` values (assuming usual natural-language frequencies, where it's ok to probabilistically drop more-frequent words – which may not be the case after your unique preprocessing)

* 50 worker threads will often, in the usual corpus-iterable input method, suffer enough Pythonic thread-contention that it may train slower than fewer threads, even if you have >50 CPU cores. Your relatively-high values of 'negative' and 'window' may offset this somewhat, by doing larger batches of calculations in the non-contending code blocks - but it is still likely, and only discoverable experimentally, that some lower number of workers would train faster. (If you're able to use the `corpus_file` alternative, more workers should help up to your true number of cores – but it may be hard to have your 40B+ word corpus as a single uncompressed text file, as that mode requires.)

- Gordon

Danilo Tomasoni

unread,

Aug 20, 2024, 2:04:33 AM8/20/24

to Gensim

Thank you Gordon for your insightful suggestions.

I understand that my setup looks quite strange, I thought preprocessing, removing stopwords etc would help learning, not the opposite, and was experimenting with that idea..

Initially I had good results, but it looks like they are not reproducible.

The thing that looks strange to me is that before I had good results, that are not reproducible with this new dataset version.. And I wonder why.

I was using `corpus_file` to exploit as much as possible the parallelization.

The preprocessing also reduces the file size, that will be enomous otherwise and won't fit the disk.

I may try to disable preprocessing and compress the file with bzip to save file space and uncompress it on the fly before passing it to gensim.

With respect to vector size, I would like to test if I can squeeze the final vector size and save space in the final model that way, without affecting too much the performances.

To evaluate, I'm using domain-specific, manually curated word similarity pairs.

Thank you again for your time.

Danilo

Danilo Tomasoni

unread,

Aug 20, 2024, 2:22:37 AM8/20/24

to Gensim

we use fasttext to train, but then avoid to syntetize new word vectors, we just use the ones existing in the dataset.

Gordon Mohr

unread,

Aug 20, 2024, 5:21:32 PM8/20/24

to Gensim

Especially in FastText, retaining all word forms (ie no stemming/lemmatization, in either training or deployment) *might* help the model learn meaningful n-grams - worth a test.

If some tokens are especially over-represented, a very-aggressive (very small) `sample` parameter might offer a big training speedup, by essentially slimming the corpus.

If final deployed model size is the major concern, retaining fewer full words or using smaller vector sizes are the most direct approaches. The usual approach of discarding rarer words *during* training is all that Gensim implements, but in some cases I could imagine you'd want more words retained during training – when you can temporarily use more resources, and learn better n-grams from the larger set. The best you could simulate this in Gensim would be to train with a larger vocabulary, then hand-tamper with the model to discard the rarest full-words before deployment.

There's also a trick used in SpaCy (I think), not implemented in Gensim but maybe not too hard to add, where rarer words that are "close enough" to more-frequent words get their full vectors discarded, but their lookup entries in the token->vector dict point back to the surviving more-frequent vector. (That is, multiple tokens map to the exact same vector.) This would be a custom model postprocessing/compression step.

There's also this project for lossy-compressing FastText models: https://github.com/avidale/compress-fasttext – I've not used but seems theoretically sound (though again giving up some exactness for the large savins).

Unfortunately keeping the corpus compressed rules out the `corpus_file` optimization, which relies on random-access to uncompressed ranges to split the corpus between non-contending threads.

Good luck!

- Gordon

Danilo Tomasoni

unread,

Aug 27, 2024, 2:29:32 AM8/27/24

to Gensim

Ow the compression may be an issue.. Thank you I will try to go your direction since it looks like preprocessing gives sub-optimal (or non-reproducible) results.

Reply all

Reply to author

Forward