Continue training existing fasttext model on multicore

141 views
Skip to first unread message

Bo

unread,
Jul 29, 2021, 1:45:34 PM7/29/21
to Gensim

Hi, Can I please ask is it possible to specify workers when continue training an existing fasttext model?

from gensim.models.fasttext import FastText 
model = api.load("fasttext-wiki-news-subwords-300") 
model.build_vocab(sentences, update=True) 
total_words = model.corpus_total_words 
model.train(sentences, total_examples=len(sentences), total_words=total_words, epochs=model.epochs)

It doesn't seem I can change the parameter of this existing model. Any advice would be much appreciated! Thanks!

Gordon Mohr

unread,
Jul 29, 2021, 7:50:32 PM7/29/21
to Gensim
What happened when you tried to change it, for example by `model.workers = new_value`? Was there an error, or some indication it hadn't taken effect, or something else? 

Separately: even if you follow these steps, and manage to see additional training happening with any specified number of worker threads,  I suspect this sort of expand-words-and-keep-training operation is almost always a bad idea, with more ways it can go wrong than go right. Online examples I've seen encouraging this operation never show any clear understanding of the risks/tradeoffs, demonstrate tangible benefits, or provide due attention to evaluating the results or considering alternatives. 

- Gordon

Bo

unread,
Jul 30, 2021, 1:07:16 AM7/30/21
to Gensim
Thanks Gordon!

Initially when I checked out model.workers I got 

model.workers AttributeError: 'KeyedVectors' object has no attribute 'workers'

And I thought I couldn't set the parameter but actually,  I could, e.g. model.workers = 4. I will give it a go and check if the training is really using multiple cores now.

Hmm, do you suggest to train the fasttext model from scratch then, instead of continue training? I have a set of multiword expressions (MWE) ranging from bigram short phrases to 5-gram phrases, that I want to find similar terms to. Currently there is no embedding model (w2v/glove/fasttest) that is able to embed these MWEs. Therefore the aim of the continue training is to use the Phrase function in Gensim to add more MWEs in the model vocab. 

Gordon Mohr

unread,
Jul 31, 2021, 5:50:11 PM7/31/21
to Gensim
If the `model` object you're working with is a `KeyedVectors`, you can set a new `.workers` property on it just fine - but a `KeyedVectors` object doesn't support further training, so there's no point. Also, a plain `KeyedVectors` object won't have any FastText-like features, either, like synthesizing vectors for OOV tokens. 

If you can generate a text corpus with examples of your multiword-expressions, as tokens, in realistic contextual usages alongside other words, then you could train a `FastText` or other model from that corpus. The quality of the representations will largely be a factor of the size/variety/representativeness of the corpus. I suspect that collecting more data that's relevant to your domain, and tokenizing that the right way for your domain, is likely to be more fruitful than grafting extra differently-tokenized data onto some model from elsewhere.

- Gordon

Reply all
Reply to author
Forward
0 new messages