Classifying imbalanced data

Eli Gibson

unread,

May 16, 2016, 9:52:56 AM5/16/16

to Caffe Users

If I am training a classifier on an imbalanced dataset, I can either use an infogain loss to weight the less frequent class more heavily, or I can oversample the less frequent class to make a balanced data set.

Are these equivalent, or are there theoretical or empirical advantages to either of these approaches?

-- Eli

Antonio Paes

unread,

May 16, 2016, 10:56:02 AM5/16/16

to Caffe Users

Hi Eli, maybe this paper can help you http://link.springer.com/article/10.1007%2Fs10115-014-0794-3

You try use data-augmentation only on imbalanced classes?

Eli Gibson

unread,

May 18, 2016, 6:31:20 AM5/18/16

to Caffe Users

Hi Antonio,

That is a fascinating paper. The take home message, for my question, seems to be that random oversampling and reweighting perform similarly except in the presence of extreme imbalance (99:1 ratio), where random oversampling is better. Their reweighting algorithm MetaCost is a complicated one, which to some extent relies on reasonable predictions from the initial unbalanced dataset; so the results might not be directly applicable to simple loss reweighting. It also suggests that the accuracy penalty for moderate imbalance (at least for shallow networks) is not devastating (~5% for 95:5 ratios and ~2.5% for 90:10 ratios), and that most of that cannot be recovered with either approach. Thanks for pointing me to the paper.

Eli

Reply all

Reply to author

Forward