Add custom words to GoogleNews-vectors-negative300.bin pretrained model

176 views
Skip to first unread message

pradeep t

unread,
Jun 30, 2023, 6:16:30 AM6/30/23
to Gensim
I want to get the embedding of a custom word named eg- 'sakariya' from the word2vec algoritm.

When i used the pretrained model - GoogleNews-vectors-negative300.bin  it shows the word sakariya is OOV and no word embedding for this?

How can I add the embeddings of my new custom words to the existing pre trained word2 vec (GoogleNews-vectors-negative300.bin) model 

If i trained word2vec model in my own corpus i am losing the general word embeddings.

Because I want the embeddings of general words in the pretrained model and the embeddings of my custom corpus also. 


Please give me a solution for this

Gordon Mohr

unread,
Jun 30, 2023, 2:04:20 PM6/30/23
to Gensim
Where do you propose to get a word-vector for this new word, that's *not* in the 3 million words & word-phraes that Google trained from news articles circa 2012-and-earlier, but will somehow have 300 dimensions that are meaningful with respect to those older words' coordinates?

It's not enough to just train your own new 300-dimensional model that has examples of the new word's usage - its coordinates will not be comparable with the separately-trained model, unless you take certain extra advanced steps to try to achieve that. You can see this by comparing a word that is in both the old model, and your new model, and seeing how (wildly) different their coordinates are. 

- Gordon

pradeep t

unread,
Jul 1, 2023, 10:46:10 PM7/1/23
to gen...@googlegroups.com
Okay Got it.
So you are saying two things

1- Using glove pertained model is meaning full and usefull

2- Training our own glove embedding in our custom corpus is also meaning full and usefull

3- But appending the new custom word embedding to the existing embeddings is not meaning full and invalid.



Correct me if I am wrong.

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/f56d56f1-b3b2-45d6-bd17-1c06cf846684n%40googlegroups.com.

Gordon Mohr

unread,
Jul 6, 2023, 5:22:40 PM7/6/23
to Gensim
Yes, & this is generally the case with word-vectors (word2vec, FastText, GloVe, etc). Their coordinates only have meaning in comparison to other vectors that were co-trained into the same model.

So you can't just append some vectors from another model into an existing set, and have the various similarities/directions work between those added vectors, & the original vectors, for usual word-vector benefits.

There are ways people have improvised to force word-vectors into existing coordinate systems. As once example, section 2.2 ("Vocabulary Expansion") of this 2015 paper describes one strategy to learn-a-projection from one set of vectors, to another, using words they have in common, but with the benefit of moving the unique/extra words in one to another. Inside Gensim, the `TranslationMatrix` class offers a similar functionality, but I'm not sure of its overall utility for such a purpose:


Still, training one combined word-vector model, on a generous corpus with good example usages of all words of interest,  is likely to be the most-simple/robust approach. 

- Gordon

pradeep t

unread,
Jul 7, 2023, 12:05:20 AM7/7/23
to gen...@googlegroups.com
Thank you so much for the updates



--
Thanks and regards
Pradeep.T
Reply all
Reply to author
Forward
0 new messages