Online training / Continue training of doc2vec model

1,418 views
Skip to first unread message

Satya Gunnam

unread,
Apr 10, 2017, 6:03:17 PM4/10/17
to gensim
I have seen some older posts refer to this but not sure if the below pull request ever made it to doc2vec.


Is there a update_vocab() method available as shown in the code snippet in the pull request:

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.update_vocab(new_sentences)
model.train(new_sentences)
model.save("updated_model")

When I say online training, loading an existing model and  adding new documents(vocab) .

Thanks





Lev Konstantinovskiy

unread,
Apr 11, 2017, 9:21:42 AM4/11/17
to gensim
Hi,

Vocabulary expansion for doc2vec is not a supported feature. Some people got it to work with hs=0 while other modes gave errors. See discussion in this github ticket.

Regards
Lev

Satya Gunnam

unread,
Apr 11, 2017, 11:46:28 AM4/11/17
to gensim
Hi Lev :

Thanks again.

Is this something you are considering adding as a supported feature in future gensim versions?

Thanks

Lev Konstantinovskiy

unread,
Apr 11, 2017, 3:50:29 PM4/11/17
to gensim

If someone implements it then we will maintain it, but it's not a priority.
The reason is that the value of this feature is not clear. There hasn't been a study on this vocab-expansion technique and how useful are the results.

Satya Gunnam

unread,
Apr 18, 2017, 1:09:32 PM4/18/17
to gensim
Hi Lev:

The use case for this feature is : We have new documents getting added to the corpus all the time and instead of re-creating the model
every few weeks one would rather have a feature of incremental addition to the existing model. I am probably not understanding the technical
difficulties implementing this feature..

Anyway at this point in a use case like ours, is my understanding correct..We create the model every few weeks ( or whatever time period we decide)
with all docs in the corpus at that point of time.

Thanks

Lev Konstantinovskiy

unread,
Apr 18, 2017, 6:56:21 PM4/18/17
to gensim
Hi,

If the vocabulary is fixed, then It is possible to update the model as often as needed. So if it is just new documents containing all words then you are fine updating as soon as you receive new docs.

Satya Gunnam

unread,
Apr 19, 2017, 6:35:44 PM4/19/17
to gensim
Hi Lev:
No there could be new words added to the vocab via the new docs ..

Thanks

Sumeet Sandhu

unread,
Jun 18, 2017, 7:08:48 PM6/18/17
to gensim
Adding my vote to request support for this feature - it takes a long time to train a large model, and it would be HUGELY useful to update the model periodically with new documents. Rather than retrain the whole model periodically, assuming the update will take a shorter time.

In any application domain, new data will definitely have new words as the domain evolves. 
Message has been deleted

tonko dvadva

unread,
Jun 22, 2017, 4:48:21 AM6/22/17
to gensim
Looks like I've found a custom solution here: https://github.com/RaRe-Technologies/gensim/issues/1019
I haven't tried it yet.

Xiaowei Liu

unread,
Sep 13, 2017, 3:03:10 AM9/13/17
to gensim
Request support for this feature  : )

在 2017年6月19日星期一 UTC+8上午7:08:48,Sumeet Sandhu写道:

Ivan Menshikh

unread,
Sep 13, 2017, 4:50:10 AM9/13/17
to gensim
Hi Xiaowei,
Vote in issue, please. We listen to the opinions of our users.

Sumeet Sandhu

unread,
Sep 13, 2017, 11:41:09 AM9/13/17
to gen...@googlegroups.com
is there an existing issue/request on this topic?

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/_JH8BXkdEn4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan Menshikh

unread,
Sep 14, 2017, 4:14:42 AM9/14/17
to gensim
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

Gordon Mohr

unread,
Sep 14, 2017, 12:01:11 PM9/14/17
to gensim
Issue #1019 looks to me like the report of one specific seg-fault crash when trying to use Word2Vec vocabulary-expansion in Doc2Vec. 

Even if that crash is fixed, these users' desire for a fully online/incremental training for Doc2Vec (or even Word2Vec) won't necessarily be met. There's really no write-ups on whether or how this might be done effectively – the steps and parameters to use – and the naive things people try may mostly drive a model 'sideways' in its desirable properties. Making it work, even in just a few well-defined situations, would be a research project, and then a documentation effort, beyond any simple crash fixes. 

- Gordon

Andrey Kutuzov

unread,
Sep 15, 2017, 1:07:55 PM9/15/17
to gen...@googlegroups.com
Incrementally updating Word2Vec models with simple vocabulary expansion
(as implemented in Gensim) does work, at least for some specific tasks.
See this paper:
http://aclanthology.info/papers/D17-1194/d17-1194

Of course this is kind of a dirty hack, considering all the issues with
the learning rate. But it performed better than training on all the data
in this particular setting. So, this feature is useful.

By the way, there is another paper which suggests a more complicated
approach to incremental skipgram models with online vocabulary
expansion. But it would have to be implemented from scratch, as their
code is proprietary:
http://aclanthology.info/papers/D17-1037/d17-1037


On 09/14/2017 06:01 PM, Gordon Mohr wrote:
> Issue #1019 looks to me like the report of one specific seg-fault crash
> when trying to use Word2Vec vocabulary-expansion in Doc2Vec. 
>
> Even if that crash is fixed, these users' desire for a fully
> online/incremental training for Doc2Vec (or even Word2Vec) won't
> necessarily be met. There's really no write-ups on whether or how this
> might be done effectively – the steps and parameters to use – and the
> naive things people try may mostly drive a model 'sideways' in its
> desirable properties. Making it work, even in just a few well-defined
> situations, would be a research project, and then a documentation
> effort, beyond any simple crash fixes. 
>
> - Gordon
>
> On Thursday, September 14, 2017 at 1:14:42 AM UTC-7, Ivan Menshikh wrote:
>
> Yep, issue #1019
> <https://github.com/RaRe-Technologies/gensim/issues/1019>
> <https://github.com/RaRe-Technologies/gensim/issues/1019>
> <https://groups.google.com/d/topic/gensim/_JH8BXkdEn4/unsubscribe>.
> To unsubscribe from this group and all its topics, send an
> email to gensim+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey
Reply all
Reply to author
Forward
0 new messages