Using a related-but-larger corpus to establish your full-vocabulary & word-frequencies could perhaps be a good idea, in some situations.
But note: to the extent that other corpus uses a different domain's docs/lingo, with different word senses/frequencies, it might not be a good fit for your docs/domain. And, if you have many word slots that don't appear at all in your smaller corpus, you've got lots of 'blanks for future use' that may just widen/complicate your own IR/classification steps.
For example, the `GoogleNews` set of word-vectors includes 3 million tokens, from a news-article training set circa 2012 with perhaps a hundred billion or more words. But: the actual corpus isn't available. All of the preprocessing & phrase-combination steps Google followed (to create compound phrase tokens) haven't ever been publicly documented. (The best outsiders can do is try to approximate the same steps on their texts.) And, their word-vectors format doesn't include the relative word-frequencies though you could assume/approximate them applying a Zipfian distribution on them as an most-to-least-frequent ordered list).
If your smaller corpus only has, say, 50K unique tokens, then a bag-of-words or TF-IDF representation of your documents is only ever going to only fill 50k of the 2,950,000 word-slots in the 3M-word-wide vocabulary it provides. Even with sparse representations, feeding those doc-representations to downstream steps, like classifiers, where 98% of all slots are null/irrelevant, may be cumbersome.
So my sense would be: while it's not out of the question to try leveraging someone else's larger model/corpus, the `GoogleNews` word-vectors aren't a great basis for bag-of-words/TFIDF models. (It's even limited, in its age/domain/undocumented-properties, for word-vector applications.)
There are also good reasons, in BoW/TFIDF models, to start with simple models just based on your definitely-relevant corpus – and then re-model when the corpus grows to include new word usages in relevant contexts. That keeps things manageable, relevant, & fast at leat through getting initial baseline results. Only after getting some results from that simple approach would I check possible enhancements from leveraging other corpora/lexicons/etc. And, gathering definitely-relevant same-domain texts, even if just from related public datasets, may better model your problem domain than something from a more generic or alien domain.
- Gordon