Add custom words to GoogleNews-vectors-negative300.bin pretrained model

pradeep t

unread,

Jun 30, 2023, 6:16:30 AM6/30/23

to Gensim

I want to get the embedding of a custom word named eg- 'sakariya' from the word2vec algoritm.

When i used the pretrained model - GoogleNews-vectors-negative300.bin it shows the word sakariya is OOV and no word embedding for this?

How can I add the embeddings of my new custom words to the existing pre trained word2 vec (GoogleNews-vectors-negative300.bin) model

If i trained word2vec model in my own corpus i am losing the general word embeddings.

Because I want the embeddings of general words in the pretrained model and the embeddings of my custom corpus also.

Please give me a solution for this

Gordon Mohr

unread,

Jun 30, 2023, 2:04:20 PM6/30/23

to Gensim

Where do you propose to get a word-vector for this new word, that's *not* in the 3 million words & word-phraes that Google trained from news articles circa 2012-and-earlier, but will somehow have 300 dimensions that are meaningful with respect to those older words' coordinates?

It's not enough to just train your own new 300-dimensional model that has examples of the new word's usage - its coordinates will not be comparable with the separately-trained model, unless you take certain extra advanced steps to try to achieve that. You can see this by comparing a word that is in both the old model, and your new model, and seeing how (wildly) different their coordinates are.

- Gordon

pradeep t

unread,

Jul 1, 2023, 10:46:10 PM7/1/23

to gen...@googlegroups.com

Okay Got it.

So you are saying two things

1- Using glove pertained model is meaning full and usefull

2- Training our own glove embedding in our custom corpus is also meaning full and usefull

3- But appending the new custom word embedding to the existing embeddings is not meaning full and invalid.

Correct me if I am wrong.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/f56d56f1-b3b2-45d6-bd17-1c06cf846684n%40googlegroups.com.

Gordon Mohr

unread,

Jul 6, 2023, 5:22:40 PM7/6/23

to Gensim

Yes, & this is generally the case with word-vectors (word2vec, FastText, GloVe, etc). Their coordinates only have meaning in comparison to other vectors that were co-trained into the same model.

So you can't just append some vectors from another model into an existing set, and have the various similarities/directions work between those added vectors, & the original vectors, for usual word-vector benefits.

There are ways people have improvised to force word-vectors into existing coordinate systems. As once example, section 2.2 ("Vocabulary Expansion") of this 2015 paper describes one strategy to learn-a-projection from one set of vectors, to another, using words they have in common, but with the benefit of moving the unique/extra words in one to another. Inside Gensim, the `TranslationMatrix` class offers a similar functionality, but I'm not sure of its overall utility for such a purpose:

docs: https://radimrehurek.com/gensim/models/translation_matrix.html

example notebook: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

Still, training one combined word-vector model, on a generous corpus with good example usages of all words of interest, is likely to be the most-simple/robust approach.

- Gordon

pradeep t

unread,

Jul 7, 2023, 12:05:20 AM7/7/23

to gen...@googlegroups.com

Thank you so much for the updates

To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/12ed8609-948d-4e46-9329-40cdf198203en%40googlegroups.com.

--

Thanks and regards

Pradeep.T

Reply all

Reply to author

Forward