Applying Doc2Vec to small corpus (for classification purpose) - how to improve results further?

537 views
Skip to first unread message

Felix Forge

unread,
Dec 4, 2019, 6:13:38 AM12/4/19
to Gensim
Hey there :-)

I read a lot about finding optimal parameters for Doc2Vec: 

My corpus is indeed quite small - round about 10.000 documents but each of them being between 4.000-10.000 tokens long. Using Doc2Vec-Vectors for a LogisticRegression (simple classification task 0 or 1), I get quite poor results (around 55% accuracy).

Now my thoughts:
I set vector_size = 200 to better represent the length of the documents / nuance of each document; and dbow_words=0 to train faster and just the DocVecs.

model = gensim.models.Doc2Vec(vector_size=200, dm=0, min_count=1, dbow_words=0, alpha=0.025,  workers=8, window=10)
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=20)

I know most real world examples set epochs between 10-20, but how can I improve my model further (except collecting more data doh)?
Setting epochs higher helped a bit.. I could imagine setting epochs=50 at the risk of overfitting but anyways..

I could also set min_count=1 higher, to remove words that occur only once - but Ive also read that min_count affects the training of the document vectors?? Nevertheless, I tried both a corpus with stopwords and one without stopwords but similar result (~55% accuary) -- and I know most Doc2Vec applications dont clean for stopwords.

Does it make sense to set the learning rate higher if I increase my epochs? (Gordon mostly recommends not adjusting the learning rate or setting a min_alpha).

I could play with the parameters a little bit, but does anyone have some suggestions for a 10.000 documents corpus with token length of 5000-10000 each?


Thanks a lot!!


Gordon Mohr

unread,
Dec 4, 2019, 3:24:31 PM12/4/19
to Gensim
Don't fear trying more `epochs` – if that ever hurts, it may be a sign of overfitting, but it's not the true cause, nor is reducing the epochs an appropriate way to prevent overfitting. Instead, fight overfitting with a smaller vector `size`, or smaller vocabulary – so the model has fewer 'free parameters' to memorize idiosyncracies of the training data.

Wondering what your classes are – sentiment, or something else? And, what's the quality/balance of your training data? Is the 55% accuracy on a randomly held-back test set? 

Do you hold back the test set from both `Doc2Vec` training & the classifier, or just the supervised classifier? (It's somewhat defensible to use all available data, without labels, for the unsupervised Doc2Vec training, even if held back from supervised classsifier training. If you can collect other unlabeled data from the same domain, adding it to Doc2Vec training improve the Doc2Vec model's general vocabulary can also make sense.)

I'd usually expect a higher accuracy, even from very simple techniques, but perhaps the problem is really hard and/or the training data very thin. Still, I'd double check all steps for process errors like mismatches of item-ids across any shuffling/sampling, or mistakenly creating imbalanced training/testing sets, etc. 

The silver lining of a small (& quick) dataset is you can run a broader search across metaparameters. I'd especially try smaller vectors, adding `dbow_words` to `dm=0` training or trying `dm=1`, varied `negative` and `window` sizes, alternate `ns_exponent` values, & a larger `min_count`. 

Though you may hate to throw anything out with such a small corpus, setting `min_count=1` (or any very-low value) can backfire, as all those rare words can't acquire powerful generalizable meanings from single (or few) usage examples, and thus serve as noise interfering with the training of other words (and reservoirs of excess model state to overfit training data). 

Other tokenization strategies could be worth trying. With big datasets, stemming/lemmatization can be superfluous for Word2Vec/Doc2Vec/etc algorithms - there are enough examples of every word variant that they all wind up near each other. But in tiny datasets, the extra hint providing by canonicalizing variants into one token may help.

I would definitely be comparing against other sparser representations – bag of words, and also trying word n-grams or character n-grams – because there might be some individual word/multiword/subword features that are highly indicative for your specific classification, and get 'smoothed out' by the dense embedding featurization. 

You could also try FastText's supervised classification mode, where word-vectors are trained specifically to be good at feeding a prediction of known-labels – though that mode is not supported by `gensim`.

- Gordon

Felix Forge

unread,
Dec 5, 2019, 5:17:40 AM12/5/19
to Gensim
The label is kind of similar to sentiments and denotes a certain rating improvement - so 1 if the text "caused" a rating improvement. My assumption is that the label  itself might not be predictable, cause there is no causal - or lets say significant - relationship between the label and the text (altough there should be, but thats what I try to figure out - or its due to my lack of data). However, I would somehow expect a better accuracy for the training set (which is also just around random guessing).

Since my corpus is small, I dont hold back any data for the Doc2Vec training and I use a random 10%test and 90% training split for the classifier. My labels are also quite balanced ~55%=1; ~45%=0. And I cant collect any more data because there is none.. just if more time passes lol. Anyways,..

Thanks a lot for your suggestions

I will try n-grams and lemmatizations (havent thought about that yet) and also try smaller vector-sizes.
I did run 30 epochs yesterday and higher min_count, but I will try to increase it further.

Thanks, I will just post here if an improvement was possible with adjusting some parameters/variation in the model.
Reply all
Reply to author
Forward
0 new messages