Is it correct to perform lemmatization before training?

219 views
Skip to first unread message

andrew

unread,
Nov 28, 2020, 7:46:20 AM11/28/20
to Gensim
I have small corpus, made up by roughly 2.300 documents, and I have train a word2vec on it.

Would it be correct to perform lemmatization on it before passing the corpus for training the model?

The idea would be to reduce the number of words and the size of the vocabulary, to made up for the small size of the corpus.

Thanks again and best regards,
Andrew

ben.r...@gmail.com

unread,
Nov 28, 2020, 6:24:02 PM11/28/20
to Gensim
Yes you certainly want to lemmatize it, and because it's so small you may also want to do Entity Recognition. That further reduces the number of words by changing, for example, all URLs to "URL" if the detailed content of that URL is not important. spaCy is one python module which does a good job of both. Some more are described at https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

Gordon Mohr

unread,
Nov 29, 2020, 12:16:05 AM11/29/20
to Gensim
Word2vec corpus adequacy is a chiefly a matter of having many varied usage examples for every word of interest. If your 2300 documents are book-length, you may have more than enough data. If they're sentences, you don't. 

Lemmatization may help a bit, by coalescing alternate word forms that individually have too few examples to get good word-vectors into a single token that has more usage examples. But it also destroys some utility, by hiding distinctions between word forms. The theme of published work, and what I'd recommend, is gathering more training data. For example, you could add other sources of text from compatible domains (where similar lingo/word-senses are in use). 

- Gordon
On Saturday, November 28, 2020 at 4:46:20 AM UTC-8 andrew wrote:

ben.r...@gmail.com

unread,
Nov 29, 2020, 4:16:43 AM11/29/20
to Gensim
I agree with Gordon. Andrew said he had a small corpus, so I was assuming each document was not long. However, if the documents are long, lemmatizing might not help. Always want more data!
Reply all
Reply to author
Forward
0 new messages