Word2vec corpus adequacy is a chiefly a matter of having many varied usage examples for every word of interest. If your 2300 documents are book-length, you may have more than enough data. If they're sentences, you don't.
Lemmatization may help a bit, by coalescing alternate word forms that individually have too few examples to get good word-vectors into a single token that has more usage examples. But it also destroys some utility, by hiding distinctions between word forms. The theme of published work, and what I'd recommend, is gathering more training data. For example, you could add other sources of text from compatible domains (where similar lingo/word-senses are in use).
- Gordon
On Saturday, November 28, 2020 at 4:46:20 AM UTC-8 andrew wrote: