Hi,
I would like to understand the right way to resume a Word2Vec model and continue the training process. There is something I am still not clear. Could you please help?
After
==
from gensim.models.word2vec import Word2Vec
all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]
some_sentences = [['first', 'sentence'], ['second', 'sentence']]
model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.train(some_sentences)
print "sisimilarity_1 for some_sentences:"
print model.similarity('first','second')
print "mose_sisimilarity_1 in some_sentences:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)
model.save('mdlObj')
==
was executed, it returned
==
similarity_1:
-0.0450417522552
most_similarity_1:
[('fourth', -0.09383071959018707)]
==
Next,
==
from gensim.models.word2vec import Word2Vec
all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]
model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.load('mdlObj')
print "similarity_2:"
print model.similarity('first','second')
print "most_similarity_2:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)
other_sentences = [['third', 'sentence'], ['fourth', 'sentence']]
model.train(other_sentences)
print "similarity_3:"
print model.similarity('first','second')
print "most_similarity_3:"
print model.most_similar(positive=['first', 'sentence'], negative=['second'], topn=1)
==
was executed. It returned
==
similarity_2:
-0.0451681058758
most_similarity_2:
[('fourth', -0.09381453692913055)]
similarity_3:
-0.0451681058758
most_similarity_3:
[('fourth', -0.09404119849205017)]
==
Question
1: Is this a right way to resume a word2vec model and continue the
training? In other words, I built the vocabulary tree based on all the sentences after loading and then trained the other sentences. I expect to have the same word vectors for
'first', 'sentence', 'second', 'third' and 'fourth' after the above 2 executions, just like what we get from
==
from gensim.models.word2vec import Word2Vec
all_sentences = [['first', 'sentence'], ['second', 'sentence'], ['third', 'sentence'], ['fourth', 'sentence']]
model = Word2Vec(min_count=1)
model.build_vocab(all_sentences)
model.train(all_sentences)
==
Question 2:
Should similarity_1 and similarity_2 be identical theoretically?
Should most_similarity_1 and most_similarity_2 be identical theoretically?
Are they not identical due to model loading and/or vocabulary tree rebuilding?
Question 3:
Should most_similarity_2 and most_similarity_3 be different, because additional sentences were trained.
Thanks for your help.
Best,
Henry