Hi Gordon,
Thanks for the answer.
I agree that an ideal approach would be to retrain the model from
scratch including new texts in the training corpus. But sometimes it is
impossible due to time limits, especially if the initial corpus is very
large. Also, tracking how distributional models develop after
additional training is an interesting research problem in itself.
This is why I try to update a model with new data, not simply retrain
from scratch.
Considering your other questions:
1) The original model hyperparameters precisely mimic those of the new
model, including `sample=0'. In this case, I do not use downsampling,
because stop words were removed from the training corpus beforehand.
It means that `sample_int' is the same for all words in the model, if I
understand correctly.
2) I tried to retain original syn1neg and simply append to it new zero
rows to match the new vocabulary size. It didn't change anything.
3) The `data' certainly provides examples. It is a instance of
LineSentence on a simple gzipped text file. Logging outputs proper
progress, with the right number of words in the end.
And the corpus from which `data' is loaded, surely contains my test
words. When I train a model from scratch on this corpus, these words get
quite good and meaningful vectors. So, I expected that this will hold
when I feed this new data to the original model: that it will acquire
these new words ad learn vectors for them.
Why the vectors stay unchanged is a big puzzle for me. May be, you could
try my code with your own data?
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
gensim+un...@googlegroups.com
> <mailto:
gensim+un...@googlegroups.com>.
> For more options, visit
https://groups.google.com/d/optout.