I'm currently playing around with some of the nltk classifiers (NaiveBayesClassifier, DecisionTreeClassifier, MaxentClassifier). I have a binary classification problem, where I'm trying to classify a list of sentences into 2 classes (A and B). However, my set contains a lot more samples of class A (80%) than class B (20%). It seems like NaiveBayes and DecisionTree classifiers take advantage of that (haven't tried Maxent yet).
Is there a way to address the training/testing set imbalance? Perhaps one could adjust the weights assigned to each class inside the classifier? I could cut down the size of class A, but then I might not have enough training data, and there's not an easy way for me to generate more.
Thanks.