Online training of Gensim's Doc2vec

Nandan Thakur

unread,

Apr 24, 2019, 2:48:22 PM4/24/19

to Gensim

Hello to whomsoever it may concern,

So, I recently trained a gensim doc2vec model using approximately 7k documents with text. and mainly I am using this to find similar documents in my corpus and flagging them. Doc2vec works perfectly fine with this and all the results come out good.

Now, In my project, anyone can upload a new document, hence which means I have a new document now to index to my model, so that means my count should get increased to 7,001 documents. because in the future, I might be getting more and more documents and the doc2vec algorithm should check the similarity with these new docs as well.

So, I wanted to figure if there is a way to retrain the model, by not actually retraining the whole of 7,001 documents, but rather reuse the 7k doc2vec model and train with the only the new document. If such a solution is possible?

If not, could you suggest me some alternative approach which I could take?

I have already tried using build_vocab(update= True) and other solutions and also searched a lot, but nowhere it mentions as a solution. Can it be done?

Cheers,
Nandan

Gordon Mohr

unread,

Apr 24, 2019, 3:28:32 PM4/24/19

to Gensim

That's a very small corpus for a `Doc2Vec` model.

But, once you have a trained model, you can infer-vectors for new documents using the `infer_vector()` method, passing it a version of the document tokenized the same as the training data:

https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec.infer_vector

If the new doc has new vocabulary words not already known by the model, they are ignored.

The inferred vector returned is *not* added to the model's bulk-learned vectors (against which `most_similar()` operations collect their results) – if you want to save it aside, and compare to other future documents, you'll have to do that in your own code.

If the new docs all generally "fit within" the vocabulary/variety of the original training docs, you could do this indefinitely... but if the new docs actually introduce useful new vocabulary/variety, then you'd likely eventually want to re-train a fresh model, from scratch, with as many relevant corpus documents as can be included.

- Gordon

Nandan Thakur

unread,

Apr 26, 2019, 6:35:53 AM4/26/19

to Gensim

Hi Gordon, Thanks for the reply.

What if the new docs actually introduce new vocabulary/variety and I want to do incremental learning (Since retraining from scratch is a time-consuming and costly process), is this feature possible with doc2vec? If not should I change over to word2vec as it supports incremental learning?

Cheers,

Nandan

Gordon Mohr

unread,

Apr 26, 2019, 5:09:44 PM4/26/19

to Gensim

The only supported answer is occasionally retraining with the full document set to understand the new vocabulary.

The `build_vocab(..., update=True)` option, which was added to `Word2Vec` (with a lot of caveats), has always had a crashing bug if attempted with Doc2Vec. (Issue <https://github.com/RaRe-Technologies/gensim/issues/1019>.) But note that even in `Word2Vec` where vocabulary-expansion is possible, it's easy to do wrong: repeated training with small updates, not representative of the full domain, can worsen a model. I don't know of any straightforward guides to good choices in incremental-corpus construction & parameter choices to ensure it's helping, leaving it up to each user to develop their own project-specific practices.

- Gordon

Ilker

unread,

May 27, 2019, 7:33:08 AM5/27/19

to Gensim

Hi Nandan and Gordon,

I am in need similar solution and my understanding best option from among all is that using Similarity class (https://radimrehurek.com/gensim/similarities/docsim.html) I guess.

Especially where we need compare small (~1000) & continuously updated document (paragraphs) list.

In Similarity Class, with corpus parameter we can pass fasttext or word2vec model (previously prepared like google news etc..) to allow semantic similarity comparison between documents and iwth "add_documents" method we can add new document to be able to compare with previously added documents for next query.