Is stopword removal & stemming recommended before applying doc2vec?

1,933 views

Skip to first unread message

Deepak George

unread,

Nov 8, 2016, 6:29:36 AM11/8/16

to gensim

Is stopword removal & stemming recommended before applying doc2vec? What is the recommended practice?

Thanks

Deepak

Gordon Mohr

unread,

Nov 8, 2016, 2:45:23 PM11/8/16

to gensim

The papers I've seen that are primarily focused on "Paragraph Vectors" (Doc2Vec) – like the original "Paragraph Vectors" paper or "Document Embedding With Paragraph Vectors" – don't mention stemming or stop-word removal.

Meanwhile, other projects often seem to do it, perhaps just out of habit. I think I've seen where bulk comparisons, that include Doc2Vec among many other methods, seem to do it – but don't explain their motivation, or compare results with/without such preprocessing, so again it just may be out of habit.

So I haven't seen a strong enough trend or published analysis to make a recommendation either way. What's best may vary based on your quality/volume of data.

It's certainly not necessary, especially when you have lots of text. The subsampling (via the 'sample' parameter) of Word2Vec/Doc2Vec is another way to reduce the influence/interference from very-common words, including 'stop' words. Some of the original word2vec analogy-accuracy evaluations depend on arranging close-variants of the same word (tenses/comparatives/adjective-to-adverb/etc), so stemming would get in the way. But on the other hand, with limited data, and for certain end-purposes (especially measures of broad-topicality), combining related tokens and de-facto shortening windows (by removing low-meaning filler words) might help! You'd have to try both to know.