How did online/incremental training work in Word2vec model using Genism

534 views
Skip to first unread message

Mahfuja Nilufar

unread,
May 25, 2021, 11:49:57 PM5/25/21
to Gensim

Using the Genism library, we can load the model and update the vocabulary when the new sentence will be added. That’s means If you save the model you can continue training it later. I checked with sample data, let’s say I have a word in my vocabulary that was previously trained (i.e. “women”). And after that let’s say I have new sentences and using model.build_vocab(new_sentence, update=True) and model.train(new_sentence), the model is updated. Now, in my new_sentence I have some word that already exists(“women”) in the previous vocabulary list and have some new word(“girl”) that not exists in the previous vocabulary list. After updating the vocabulary, I have both old and new words in the corpus. And I checked using model.wv[‘women’], the vector is updated after update and training new sentence. Also, get the word embedding vector for a new word i.e. model.wv[‘girl’]. All other words that were previously trained and not in the new_sentence, those word vectors not changed.

Code: 
model = Word2Vec(old_sentences, vector_size=100,window=5, min_count=1) model.save("word2vec.model") 
model = Word2Vec.load("word2vec.model")  model.build_vocab(new_sentences,update=True,total_examples=model.corpus_count, epochs=model.epochs) 
 model.train(new_sentences)

However, I don’t understand the inside depth explanation of how the online training is working.  I get the code but want to understand how the online training working in theoretically. Is it re-train the model on the old and new training data from scratch?

Thanks!

Mahfuja Nilufar

unread,
May 26, 2021, 12:18:33 AM5/26/21
to Gensim
I followed the tutorial by Rutu Mulkar

Radim Řehůřek

unread,
May 26, 2021, 12:56:35 PM5/26/21
to Gensim
Hi,

I don't think Rutu ever finished that work. And that blog post is from 2015, likely heavily outdated. I think it's fair to say Gensim doesn't support online updates to existing word2vec models.

We generally recommend retraining from scratch, from corpus_new + corpus_old, if possible.

HTH,
Radim

Mahfuja Nilufar

unread,
May 26, 2021, 1:45:03 PM5/26/21
to gen...@googlegroups.com
Hello,

Thanks for your reply. 

I checked on the genism site, it says "If you save the model you can continue training it later:"
Screen Shot 2021-05-26 at 1.31.37 PM.png

Screen Shot 2021-05-26 at 1.40.19 PM.png
What does that mean?

Also in the build_vocab() method, we have the option "update = True", which means "If true, the new words in sentences will be added to model’s vocab." 

Screen Shot 2021-05-26 at 1.32.36 PM.png

All of the above explanation, what actually it means by "If you save the model you can continue training it later". Does that mean when a new sentence is coming, we are loading the model basically  getting all vocabulary and then by updating vocabulary with new words and then retraining from scratch? 

Please let me know. 

Thanks
Mahfuja

--
You received this message because you are subscribed to a topic in the Google Groups "Gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/kM8lYl_QjMo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/bb20f6fd-4f5b-4443-84fe-267577719302n%40googlegroups.com.

Mahfuja Nilufar

unread,
May 26, 2021, 3:00:01 PM5/26/21
to gen...@googlegroups.com
Hi,

I also ran some tests on my dataset. Initially I trained my model with 450k data(each row is a paragraph). After running, my total vocabulary was 101,949 and it took 3.11min. Here, I am not considering data cleaning and pre_processing time. 

Then, I tested with 9100 new_sentence by using the following code. 
model = Word2Vec.load("word2vec.model")  
model.build_vocab(new_sentences,update=True,total_examples=model.corpus_count, epochs=model.epochs) 
model.train(new_sentences)

In the result I got, my total vocabulary is now 102,300 and it took only 0.21min. So, how can we say after updating the vocabulary it's retraining the model from scratch with old_word + new_word?

Please help me understand in depth. 

Thanks

xx liu

unread,
May 26, 2021, 11:47:58 PM5/26/21
to Gensim
I have try this way too, and my post is here: https://groups.google.com/g/gensim/c/aw-D7P5dqEw/m/JzhqJncqAAAJ
As I know, you can train your corpus with update, BUT it is NOT a good idea .

In NLP , it is common that finetune word embedding with downstream task, so I think "online update " is same with that.
BUT , if you train word embedding like this, Maybe your learning rate\ weight decay or other params are not right . 

Reply all
Reply to author
Forward
0 new messages