(newbie) Word2Vec KeyedVectors to TfidfModel?

S. P.

unread,

Jul 12, 2021, 8:17:40 PM7/12/21

to Gensim

This question might reflect a mistaken approach, but here it is....

I have a corpus, and I was going to transform its bag-of-words using a model from a different (large) corpus (Google News). I was thinking that is better than (like in the tutorials) using the corpus itself (which is relatively small). I'm not sure this approach is sound, so let me know if I'm off base.

So, I get the model of the big corpus via Word2Vec... which gives me a KeyedVectors object, i.e., a list of vectors, each being a vector of numbers. Specfically, 3 million vectors of 300 floats.

Now, I want to take this model, in the form of KeyedVectors, and use it to transform my corpus bag-of-words. But I don't see any way to do that. I see lots of info on how I can use this Word2Vec model to do similarity searches, etc. But I don't want to search that corpus. I want to search my own, smaller corpus. But I want to use this big model to do a better transformation of my bag-of-words.

Any pointers?

Thanks.

Gordon Mohr

unread,

Jul 13, 2021, 1:57:18 PM7/13/21

to Gensim

Using a related-but-larger corpus to establish your full-vocabulary & word-frequencies could perhaps be a good idea, in some situations.

But note: to the extent that other corpus uses a different domain's docs/lingo, with different word senses/frequencies, it might not be a good fit for your docs/domain. And, if you have many word slots that don't appear at all in your smaller corpus, you've got lots of 'blanks for future use' that may just widen/complicate your own IR/classification steps.

For example, the `GoogleNews` set of word-vectors includes 3 million tokens, from a news-article training set circa 2012 with perhaps a hundred billion or more words. But: the actual corpus isn't available. All of the preprocessing & phrase-combination steps Google followed (to create compound phrase tokens) haven't ever been publicly documented. (The best outsiders can do is try to approximate the same steps on their texts.) And, their word-vectors format doesn't include the relative word-frequencies though you could assume/approximate them applying a Zipfian distribution on them as an most-to-least-frequent ordered list).

If your smaller corpus only has, say, 50K unique tokens, then a bag-of-words or TF-IDF representation of your documents is only ever going to only fill 50k of the 2,950,000 word-slots in the 3M-word-wide vocabulary it provides. Even with sparse representations, feeding those doc-representations to downstream steps, like classifiers, where 98% of all slots are null/irrelevant, may be cumbersome.

So my sense would be: while it's not out of the question to try leveraging someone else's larger model/corpus, the `GoogleNews` word-vectors aren't a great basis for bag-of-words/TFIDF models. (It's even limited, in its age/domain/undocumented-properties, for word-vector applications.)

There are also good reasons, in BoW/TFIDF models, to start with simple models just based on your definitely-relevant corpus – and then re-model when the corpus grows to include new word usages in relevant contexts. That keeps things manageable, relevant, & fast at leat through getting initial baseline results. Only after getting some results from that simple approach would I check possible enhancements from leveraging other corpora/lexicons/etc. And, gathering definitely-relevant same-domain texts, even if just from related public datasets, may better model your problem domain than something from a more generic or alien domain.

- Gordon

S. P.

unread,

Jul 13, 2021, 5:20:33 PM7/13/21

to Gensim

Thank you. Those are good point.

The only way now that I can think of using this other word2vec model, which is just word KeyedVectors, applied to my corpus of multiword documents, is to map my documents into the vector space of the word2vec vectors (300 dimensions). That is, I could treat my document, which is a sequence of words, as a sum of word vectors. So, I could use the word2vec vectors for the words in my document, adding the vectors up--just like adding vectors in a vector space. So, then my documents could then become a new KeyedVectors collections, but where the key is my document, not a single word. And then I could use this KeyedVectors to do similarity searches. This assumes the 300-dimensional word vectors comprise a genuine vector space, and so a multiword document is a summation of these vectors in that space. I don't know if the original word vectors were normalized to unit vectors, but I don't think so. But I'm not sure I need my new multiword vectors to be normalized anyway. It's basically treating concatenation of words into a sequence as vector addition--and then treating that like an extended word vector of its own.

Other that this, I can't see a way to use the Google News word2vec Keyed Vectors to help me to a similarity-based document search into my smaller corpus.