Hi Ivan,
Thanks so much for the swift reply! Just as additional info: My corpus consists of 1173 texts and they are speeches (in total over 3
million tokens and approximately 24000 types after removing punctuation
and lower casing).
I don't quite understand what you mean with relaxing the border for high-frequency words, do you mean 90% instead of 95%?
Is pruning very rare words considered 'normal'? I was thinking of perhaps omitting words that occur in just 1 document, is that ok? With all the extra unique words created in the bigram detection I'll be deleting many words by doing that, so I want to be sure it's appropriate.
To summarize what you propose in terms of the order. Option 1:
1. Lower case all words and clear texts from punctuation
2. Lemmatize words
3. Remove stopwords
4. Remove words with a length below 3 characters
5. Create bigrams via Phrases method (first train it on the texts after step 4 and then apply it to those texts)
6.
Prune the dictionary from high frequency words (>95 or >90%?) and low frequency (docfreq=1?) and then prune the texts by excluding those
words from the texts
7. Create the corpus
Option 2:
1. Lower case all words and clear texts from punctuation
2. Create bigrams via Phrases method (first train it on the texts after step 1 and then apply it to those texts)
3. Lemmatize words
4. Remove stopwords
5. Remove words with a length below 3 characters
6.
Prune the dictionary from high frequency words (>95 or >90%?) and
low frequency (docfreq=1?) and then prune the texts by excluding those
words from the texts
7. Create the corpus
Which option is best in your expert opinion? I'm sort of leaning towards option 2 as I'm slightly afraid of creating 'fake bigrams' with option 1, but I don't know whether that's considered a big problem in general. The downside of creating bigrams sooner is not only that things like 'this_is' will be created (I could probably be pruned later on if I use 90% for example), but that sometimes bigrams are created with a word in plural form and other times singular form (apple_tree and apple_trees), whereas I just want the singular form, but lemmatization doesn't lemmatize bigrams.. So basically I'm wondering which of the two is better (given these downsides)?
And is it really ok to remove small words (of length 1 or 2)? Or should those words be in the stopwords list? I now have 3 steps where I delete words: stopwords, small words and high (and/or low) frequency words. I just want to be sure it's not too much.
Again, thanks so much for your advice!
Myrthe