Text pre-processing steps - best practices?

Myrthe van Dieijen

unread,

Sep 8, 2017, 5:26:28 AM9/8/17

to gensim

Hi,

I've read in many posts that the text pre-processing step has a big impact on the results you can get with NLP methods. I'm using topic models myself (LDA and DTM/LdaSequence) and cleaned the texts beforehand extensively, but I still have some doubts about the order of the steps I used. I'm hoping someone here can give me some advice on that. The current order is:

1. Lower case all words and clear texts from punctuation
2. Remove stopwords
3. Remove words with a length below 3 characters
4. Lemmatize words
5. Remove words with a length below 3 characters (again, as for example 'doing' will now be 'do' after step 4)
6. Create bigrams via Phrases method (first train it on the texts after step 5 and then apply it to those texts)
7. Prune the dictionary from high frequency words (words that occur in over 95% of the documents) and then prune the texts by excluding those words from the text
8. Create the corpus

I know I do not have to create new texts in step 7 necessarily as I can just create a corpus object with the pruned dictionary, but I would like to inspect the texts after all pruning conditions (and with a corpus object that's not possible), so hence this step.

What I'm not sure of is mostly the timing of the bigram selection. I can imagine I could create 'fake bigrams' because the stopwords and small words are removed beforehand. That being said, if there's always just one stopword in between, I was thinking the result could still be informative. For example, 'price to earnings' would become 'price earning' with the current order of the steps, and that is more informative for me than either 'price to' or 'to earnings' for example. I know I can create trigrams as well, but I'm not keen on doing that as it will only increase the number of words in the dictionary even more. Moreover, I thought about creating bigrams first (i.e., after step 1) but I don't want things like 'this_is' appearing in my vocabulary, as that's just noise. Are these valid enough reasons for this order? Or is this not what is recommended?

Finally, would it be ok to remove words with a length below 4 characters instead of 3? Or is that considered to be too much?

I hope someone can tell me what the best practices are. I'm doing the analysis for a scientific paper and if you have any literature that can help with this, feel free to add that as well.

Many thanks in advance for your help!

Myrthe

Ivan Menshikh

unread,

Sep 8, 2017, 7:39:14 AM9/8/17

to gensim

Hi Myrthe,

Your plan looks good, also you can look at this thread and find some additional technics.

About plan:

- remove step (3) (because after lemmatization typically your words will be shorter and in (5) you do all that needed)

- In (7) you can "relax" border for high-freq words (from 95% to 10% for example or more (1% if you have very large dataset)), besides prune very rare words.

For bigrams, you can try two variants: make bigrams before (2), and next you work with already "bigramed" corpus (bigrams like "this_is" you can prune later) OR your current variant.

Remove words with length below 4 can be very strict, look to words with length 3 in your corpus and think about "this words is informative?", but I think 3 is enough here.

Myrthe van Dieijen

unread,

Sep 8, 2017, 8:32:02 AM9/8/17

to gensim

Hi Ivan,

Thanks so much for the swift reply! Just as additional info: My corpus consists of 1173 texts and they are speeches (in total over 3 million tokens and approximately 24000 types after removing punctuation and lower casing).

I don't quite understand what you mean with relaxing the border for high-frequency words, do you mean 90% instead of 95%?

Is pruning very rare words considered 'normal'? I was thinking of perhaps omitting words that occur in just 1 document, is that ok? With all the extra unique words created in the bigram detection I'll be deleting many words by doing that, so I want to be sure it's appropriate.

To summarize what you propose in terms of the order. Option 1:

1. Lower case all words and clear texts from punctuation

2. Lemmatize words
3. Remove stopwords
4. Remove words with a length below 3 characters
5. Create bigrams via Phrases method (first train it on the texts after step 4 and then apply it to those texts)
6. Prune the dictionary from high frequency words (>95 or >90%?) and low frequency (docfreq=1?) and then prune the texts by excluding those words from the texts
7. Create the corpus

Option 2:

1. Lower case all words and clear texts from punctuation

2. Create bigrams via Phrases method (first train it on the texts after step 1 and then apply it to those texts)
3. Lemmatize words
4. Remove stopwords

5. Remove words with a length below 3 characters

6. Prune the dictionary from high frequency words (>95 or >90%?) and low frequency (docfreq=1?) and then prune the texts by excluding those words from the texts
7. Create the corpus

Which option is best in your expert opinion? I'm sort of leaning towards option 2 as I'm slightly afraid of creating 'fake bigrams' with option 1, but I don't know whether that's considered a big problem in general. The downside of creating bigrams sooner is not only that things like 'this_is' will be created (I could probably be pruned later on if I use 90% for example), but that sometimes bigrams are created with a word in plural form and other times singular form (apple_tree and apple_trees), whereas I just want the singular form, but lemmatization doesn't lemmatize bigrams.. So basically I'm wondering which of the two is better (given these downsides)?

And is it really ok to remove small words (of length 1 or 2)? Or should those words be in the stopwords list? I now have 3 steps where I delete words: stopwords, small words and high (and/or low) frequency words. I just want to be sure it's not too much.

Again, thanks so much for your advice!

Myrthe

Reply all

Reply to author

Forward