Adding and removing vectors from a gensim KeyedVectors Model

2,130 views
Skip to first unread message

Gerhard Wohlgenannt

unread,
May 26, 2017, 8:41:50 AM5/26/17
to gensim
Hi everyone,

my question in quick and easy terms:
Is it possible (how?) to add and remove vectors from a word2vec/fasttext/whatever model on the fly?
Given that I have 2 models, model1 and model2, I would like take selected terms from model2 and then get
the most similar term in model1 .. where model1 is probably a small subset of model2.
For this I need to somehow add the vector into model1, call .most_similar() and then remove it again?!

Why do I want to do this?
Given we have a big initial model (like the pretrained fasttext on Wikipeda), and my set of results should be within eg 2000 per-selected terms, then we need this small model (subset of the whole).
Furthermore the system should be very fast, so the solution to always save the subset-model + the new term to disk and loading it is too slow.

I played around a bit with adding vectors to vocab, syn0, etc .. but it's messy, maybe not working  and not elegant.
Saving and loading from disk for every query would be easier -- but slow.

I hope my explanations are understandable, what I basically look for is an easy way to
model.add_vector ( term, nparray )
model.delete_vector ( term )

Cheers, Gerhard

Gordon Mohr

unread,
May 26, 2017, 1:08:33 PM5/26/17
to gensim
There's no current support for such incremental adds/removes in KeyedVectors. And, any such support would require mutating the `vocab` dict and `syn0` array – the latter of which, as a densely-allocated numpy array, having its own problems with quick/efficient incremental edits. (I think to really support this would require a system for pre-allocating space, or splitting the `syn0` into multiple segments, making accesses to a single index actually a two-level operation.)

But, if your motivating need is really just "sometimes use vectors from a larger model2 to find similars in a smaller model1", note that `most_similar()` can take a raw vector, rather than the index/key to a vector. That is, a vector need not be 'in' the model to serve as the target for ranking most-similars. 

However, given the way `most_similar()` tries to support many different combinations of parameters – including lists of keys, or mixtures of positive/negative examples – in order to pass a raw vector you should be explicit that it is a single positive example. (Otherwise its listlikeness will trigger a different type error, as the method assumes each dimension is a single positive example.) Specifically:

    raw_vec = model2['rareobscurekey']
    similars = model1.most_similar(positive=[raw_vec])

- Gordon
Reply all
Reply to author
Forward
0 new messages