Incremental skipgram using gensim

45 views

Skip to first unread message

Manuel Jimenez

unread,

Oct 26, 2021, 1:53:20 PM10/26/21

to Gensim

Hi all,

For my master thesis I need run some experiment using the incremental skipgram with negative sampling,

https://arxiv.org/abs/1704.03956

so I was trying to replicate the algorithm with gensim, but during the development I had some doubts. I know that exist a online training for word2vec models...

https://rutumulkar.com/blog/2015/word2vec/

but this one need a first vocabulary for update the embedding and the paper update the vocabulary (and weights) as new words arrive, it follow a incremental learning. Maybe the part of track new word with the misra gries algorithm can be removed and just ignore new words when the vocabulary is full. So my questions if it is possible use the current gensim api for get that modify skipgram or it needs change some package? because I read the documentation for check a data structure for add new words to the model as new words are coming in but I didn't find anything like that.

Thank all!

Gordon Mohr

unread,

Oct 26, 2021, 7:34:47 PM10/26/21

to Gensim

Gensim's `Word2Vec` has a pretty strong assumption that the vocabulary & effective word-frequencies are each fixed during a training pass.

Even the support for adding to the vocabulary of a prior model, the `build_vocab(..., update=True)` option you mention, relies on a specific step that scans a new corpus, adding all new words at once, before proceeding with traditional training with the new stable vocabulary/frequencies.

Of course anything in the source can be changed, but it would require some pretty deep surgery to enable an update of vocabulary/frequencies with every new training text. It's hard for me to imagine situations where it'd be worth the extra complexity/overhead, compared to just waiting a little longer for a bigger batch of new texts to collect, and doing a smaller number of batch vocab-expansions – or even fresh full retrainings from the new larger corpus, to avoid risks of imbalance/overweighting with regard to later-seen texts.