Adding additional documents to an existing FastText model

169 views
Skip to first unread message

Pankaja Shankar

unread,
Dec 4, 2021, 12:03:38 AM12/4/21
to Gensim
Premise: I use the cc.nl.300.bin - FastText model in dutch language, which has 200000 words. And I have to add about 159 additional documents to the corpus and retrain. So I add by doing so
corpus = [... ]  # 159 words
bin_model = fasttext.load_facebook_model("cc.nl.300.bin")
bin_model.build_vocab(corpus, update=True)

# train
bin_model.train(corpus_iterable= corpus   , total_examples=len( corpus   ), epochs=5)

# and then when I check in the debug window the bin_model still has only 200000 words
not sure what I am doing wrong.

Gordon Mohr

unread,
Dec 5, 2021, 5:57:22 PM12/5/21
to Gensim
A likely proximate reason that you're not seeing a larger vocabulary, after `build_vocab(..., update=True)`, is that your new corpus does not contain enough examples of each of the new words' usage to pass the model's `min_count` requirement. 

More generally, you can't necessarily be confident, with such a tiny increment-of-training, atop an old model, that it will do more good than harm. At the same time as it it training up the new words, it's also continuing to adjust old words, that also appear in the new texts - pulling them away from their prior representations - perhaps in a way that's not balanced with their prior training examples. (The same is also occurring with the character n-grams.) In some cases, the new words might wind up with "good enough" representations with little damage to what was learned from the original training corpus... in others, your new examples might be leaving parts of the model to be less useful/comparable with regard to the original data's patterns. You should be sure you have a way to evaluate whether such ad-hoc, incremental model updates are actually helping, especially compared to the baseline alternatives of (a) just relying on FastText's inherent ability to synthesize vectors for OOV words; (b) retraining the whole model, with old & new data, including new word examples mixed throughout - ensuring an equal treatment of all words.

- Gordon

Pankaja Shankar

unread,
Dec 5, 2021, 11:14:44 PM12/5/21
to Gensim
If I don't go with retraining, then how do I synthesize vectors for OOV words. I am new to this, hence asking what I should be doing. 
The premise of my problem is that, we are determining the similarity scores of the scrapes data of food products for a company
in Belgium and we are using the FB - dutch pretrained model that consists of 200000 words. But upon querying we found about 
159 words are coming up as OOV words and this is for just one store (food client of this company), so we dont know with 
other clients how many such will occur. Hence we thought to at least do a POC with this one client and try retraining
the model. Again, the way dutch words and the naming of the product varies from store to store (client to client).

Pankaja Shankar

unread,
Dec 6, 2021, 9:40:54 AM12/6/21
to Gensim
Also quoting your response ... " (b) retraining the whole model, with old & new data, including new word examples mixed throughout - ensuring an equal treatment of all words." - how would I get the original data set for the model that was pretrained? I am referring to this link ...


I would appreciate any input/help/directions from you.

Gordon Mohr

unread,
Dec 6, 2021, 4:46:37 PM12/6/21
to Gensim
Im not sure what you mean by "cominb up as OOV words". Full FastText models will return a (synthesized guess) vector even for OOV words, as long as they're longer than its minimum-character-n-gram value. (If they're shorter, then there's no basis for FastText to synthesize a vector from the fragments.)

Note that to get good from-scratch vectors for 159 new words, you'd want a training corpus with many – ideally dozens – of varied, realistic usage examples for each word. So many thousands of words of new training material. (And, the balance issues I'd mentioned earlier would still apply.)

- Gordon

Gordon Mohr

unread,
Dec 6, 2021, 4:50:31 PM12/6/21
to Gensim
I believe the `cc` in your filename indicates that model was trained on 'Common Crawl' data which is public - though you'd want to check the Facebook site you got the vectors from for full details on which data, of what vintage, with what preprocessing, they used. 

If your domain uses a lingo with different words, and word-senses for shared words, than a general web crawl, then you might have far better results using domain-specific training data – either exclusively, or mixed with more-generic data that seems compatible in word senses, if your domain-specific data is thin.

- Gordon 

Reply all
Reply to author
Forward
0 new messages