Weird Learning curve... How should I interpret that?

8 views
Skip to first unread message

Alexis Parenty

unread,
Feb 19, 2021, 9:23:20 AM2/19/21
to Yellowbrick
Dear All,

Many thanks for the contributors of Yellowbrick. Great tool!

I have an imbalanced binary dataset (1:4) and I use the synthetic oversampling technique ADASYN to rebalance before training. 

I have the following Learning Curve spiking after 1500 training instance and I wonder how I should interpret such behavior. What do you guys thinks of that?

2021-02-19_15-08-39.png
Many thanks and regards,

Alexis

Benjamin Bengfort

unread,
Feb 25, 2021, 9:31:55 AM2/25/21
to Yellowbrick
Hi Alexis, 

Thank you for using Yellowbrick, I'm glad to hear that it's useful to you! 

That learning curve is very strange, and my first guess would be that it is related to a feature in the data. Without knowing too much about it, I'd be suspicious that the oversampling technique appended instances to the end of the data set, and that's what may have caused the dramatic change in CV score? 

Perhaps you could use shuffle=True in the learning curve, this applies a shuffle to the training data before taking prefixes. It isn't done by default to protect time series data, but if you're doing oversampling, it's probably a good step to use. If you did use shuffle, perhaps you could invert the order of your dataset and not shuffle? If the effect still remains, then it could be that you simply need far more than 1600 instances. 

Hope that helps, good luck!

Best Regards,
Benjamin Bengfort
Reply all
Reply to author
Forward
0 new messages