Data preparation for doc2vec

Prasanth Regupathy

unread,

Jun 6, 2017, 3:48:11 PM6/6/17

to gensim

I'm training a doc2vec model to compute document similarity. Even though there are tutorials that explain how to train a doc2vec model, I did not find a detailed article on data preparation.

1. What are the best practices in data preparation for doc2vec?

2. I know it depends on data but generally which one is better? Keeping the punctuations or removing them?

3. How do stop words impact the doc2vec model. Unlike lda, will doc2vec model benefit from stop words?

4. Does stemming help in improving the model?

Message has been deleted

Shiva Manne

unread,

Jun 8, 2017, 3:06:10 AM6/8/17

to gensim

Hi Prasanth,

1. The usual/general practices for data pre processing are:

Stemming/Lemmatizing
Converting all words to lower case
Punctuation removal
Stop words removal
Converting Numerics to words(1990 to one nine nine zero)

2. Punctuation usually adds noise to data(warning: depends on language) or doesn't mean anything(eg.: ',','-','/','(',')' aren't meaningful when training doc2vec). For example we would like I'd, He'd etc to be close to I, He. Thus, getting rid of punctuation is generally better.

3. Stop words do not add any meaningful information to the context when training doc2vec. Keeping stop words, would in a way dilute the context of words decreasing its effective window size. Gensim's implementation of doc2vec implicitly handles this by downsampling words according to their frequency(verify this). This again depends on your language too. In some languages, stop words might completely change the meaning of the sentence and might add extra information to the context of a word.

4. Again, depends on your langauage/task. Stemming ensures that you consider the context of all derivationally related words(eg. bike, bikes, bike's) to get the vector for the base word(bike). In case you want a different representation for every form(hapiness, happy, happily) you are good without stemming.

Shiva.

Prasanth Regupathy

unread,

Jun 8, 2017, 8:14:05 AM6/8/17

to gensim

Thanks Shiva. I will try these out.

Gordon Mohr

unread,

Jun 8, 2017, 9:50:16 PM6/8/17

to gensim

While all those are all common techniques in related text-processing efforts, the word2vec/doc2vec practices are often different.

For example, the original word2vec paper and evaluations didn't mention any stemming/lemmatization or stop-word removal, and retained punctuation as word-like tokens.

The words in the 3 million Google pre-trained vector set (from GoogleNews stories) aren't stemmed/lemmatized, include stop-words and mixed-case words, and use some other form of numerics-flattening.

The original 'Paragraph Vector' paper (on which gensim's Doc2Vec is based), and a followup ("Document Embedding with Paragraph Vectors"), seemed to do things similarly – with no mention of extra preprocessing before calculating doc-vectors for their sentiment/topicality evaluations.

It's possible the influence of stop-words and punctuation may be different (and more positive) in Word2Vec/Doc2Vec training than in other forms of NLP. For example, with respect to bootstrapping word-vectors, these tokens might provide a useful signal by aligning words commonly used with the same stop-words, or with certain shifts in sentences/clauses/etc, and thus indicate some relevant aspect-of-similarity. With a sole focus on modeling document topicality – as for example in pure Doc2Vec PV-DBOW mode – perhaps they'd be less useful. But it may be something worth evaluating with respect to your own corpus/goals.

- Gordon

Andrey Kutuzov

unread,

Jun 9, 2017, 8:26:30 AM6/9/17

to gen...@googlegroups.com

Of course, it all depends on the downstream task you have.
However, the removal of stop words was shown to improve the scores for
the analogies task while not harming the semantic similarity task (for
English). So, at least with intrinsic evaluation it seems that stop
words removal is generally beneficial. See this paper:
http://www.ep.liu.se/ecp/131/039/ecp17131039.pdf

> * Stemming/Lemmatizing
> * Converting all words to lower case
> * Punctuation removal
> * Stop words removal
> * Converting Numerics to words(1990 to one nine nine zero)

> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Gordon Mohr

unread,

Jun 9, 2017, 7:05:43 PM6/9/17

to gensim

That's an interesting paper! From a quick look, I'd still be reluctant to adopt the idea "removing stop words usually helps" just yet, because:

* analogies performance, while a nifty fast evaluation, isn't everything – for some tasks those word-vecs best at analogies might be suboptimal

* there's no mention in the paper of what (if any) frequent-word downsampling was applied, which might provide similar benefits without requiring any explicit stop-word filtering

* a plausible reason stop-word removal could help on analogies is that its effect is similar to that of using a larger-window – so it'd be interesting to compare stop-word removal against other steps that retain stop-words but also achieve a larger effective window. This could mean trying a larger `window` but then adjusting other parameters, like `min_count` or `sample`, to keep processing time similar. Or trying CBOW with a larger window, because CBOW's performance is less sensitive to larger-windows. (The authors express a desire to evaluate CBOW in the future.)

So perhaps: stop-word removal is reasonable, and may help in some modes and goals similar to the analogies-evaluation. But the further one's goals are from maximum-analogies performance, and the more one is willing to consider wider meta-parameter experimentation (including CBOW or aggressive downsampling), the more you'd also consider making a project-specific evaluation of stop-word/punctuation removal.

- Gordon

Andrey Kutuzov

unread,

Jun 10, 2017, 5:47:31 AM6/10/17

to gen...@googlegroups.com

I absolutely agree that analogies (or any other intrinsic evaluation,
for that matter) performance is rarely strongly correlated with the
performance on downstream tasks. So, in the end you of course want to
test on your practical application.

And yes, stop words removal effectively increases the window size `for
free'. Which is good, I think (considering, of course, all the caveats
expressed in your message).

As for downsampling: well, for me it seems like it is a kind of
poorman's stop words removal. The idea is the same, but the
implementation is obscure: it's difficult to know exactly what words are
penalized and how. But yes, it would be interesting to compare explicit
stop words removal and statistical downsampling quantitatively.

> <https://groups.google.com/d/optout>.

Kamal Garg

unread,

Apr 25, 2018, 10:38:54 AM4/25/18

to gensim

I have made a doc2vec model on wikipedia dump and it works good. When I search for 'Artificial intelligence' , it gives me words that are related to artificial intelligence. But when i search for 'artificial intelligence' , it fails. This is because Artificial intelligence is present in vocab not artificial intelligence. Is there a way where i can convert doc2vec vocab that i made into lowercase, remove - between words and replace them with space. This would be helpful because when user will enter anything, i will first convert string to lowercase, remove - and replace it with space , then search in vocab to get relevant keywords.

Thanks in advance

Reply all

Reply to author

Forward