Using nltk classifiers with imbalanced training/testing set

48 views
Skip to first unread message

Gregory Larchev

unread,
Sep 28, 2016, 7:40:16 PM9/28/16
to nltk-users
I'm currently playing around with some of the nltk classifiers (NaiveBayesClassifier, DecisionTreeClassifier, MaxentClassifier). I have a binary classification problem, where I'm trying to classify a list of sentences into 2 classes (A and B). However, my set contains a lot more samples of class A (80%) than class B (20%). It seems like NaiveBayes and DecisionTree classifiers take advantage of that (haven't tried Maxent yet).

Is there a way to address the training/testing set imbalance? Perhaps one could adjust the weights assigned to each class inside the classifier? I could cut down the size of class A, but then I might not have enough training data, and there's not an easy way for me to generate more.

Thanks.

Mark Broomer

unread,
Dec 17, 2016, 2:51:08 PM12/17/16
to nltk-users
can you pick a better training set or delete files and see if accuracy improves?  I hear you need about 400-600 files to get accuracy...you could try the ada-boost algorithm (I have only read about it, sorry no personal experience, but should have some soon.) 

see Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert System Naveen Kumar Korada , N Sagar Pavan Kumar, Y V N H Deekshitulu 
Reply all
Reply to author
Forward
0 new messages