Hi All,
I'm attempting to classify two-class data from an ARFF file with varying labelling (from 100% labelled to 0%), in order to ascertain the performance of various data stream classifiers. Using a small fully labelled training set, I firstly pre-train my classifier. I then use the trained classifier on the unlabelled data to make predictions. I'm doing this as follows:
I'm reading in fully labelled training data from an ARFF file, and using it to train a Hoeffding tree - I know this tree does not work with unlabelled data, this is something I need to demonstrate for my data.
ArffFileStream trainingStream = new ArffFileStream ("path to training set", -1);
ArffFileStream testStream = new ArffFileStream ("path to test set", -1);
while ( trainingStream.hasMoreInstances() )
{
Instance trainInst = trainingStream.nextInstance();
learner.trainOnInstance(trainInst);
}
Once I've trained the learner, I then use it to classify the test data. In other words the test ARFF file is of the basic form:
@RELATION TestSet
@ATTRIBUTE Score1 NUMERIC
@ATTRIBUTE Score2 NUMERIC
@ATTRIBUTE class {0,1}
@DATA
152.3,1.0,?
119.4,1.4,?
Before classifying, I pre-load some meta information regarding what labels the classifier should apply for each instance. As each test instance arrives, I know what its classification should be, so then I compare this with what the learner predicts. I do this as I need to retain knowledge of those instances which were incorrectly classified as false negatives and false positives. Based on the result I update summary statistics. Here's the general approach:
while ( testStream.hasMoreInstances() )
{
Instance testInst = testStream.nextInstance();
// I need to know the class predicted.
double[] votes = learner.getVotesForInstance(testInst);
int classification = Utils.maxIndex(votes);
// Decide how to interpret this classification, update
// the summary statistics, retain info if a FP or FN.
if(classsification == 0 && realClass == 1)
falseNegatives++;
// write instance to file...
else if (classification == 1 && realClass == 0)
...
learner.trainOnInstance(testInst);
}
The results I'm getting from this approach are not as expected. As each unlabelled instance is has the class "?", when you obtain the class value via calling:
testInst.classValue()
The value returned is NaN.
If you then call,
learner.correctlyClassifies(testInt)
this NaN value is cast to an integer, and obtains the value 0. You can see the method from AbstractClassifier class below.
@Override
public boolean correctlyClassifies(Instance inst) {
return Utils.maxIndex(getVotesForInstance(inst)) == (int) inst.classValue();
}
This means that my unlabelled instances are denoted as class zero. Obviously this has bad consequences :D.
So to mitigate this problem, I simply did a test for the presence of NaN:
while ( testStream.hasMoreInstances() )
{
Instance testInst = testStream.nextInstance();
// Get the class of the instance - it is unlabelled so
String clazz = Double.toString(testInst.classValue());
// I need to know the class predicted.
double[] votes = learner.getVotesForInstance(testInst);
int classification = Utils.maxIndex(votes);
// Decide how to interpret this classification.
if(class.equals("NaN")) // MOA doesn't know the correct label
{
// Compare MOA's prediction to the actually class.
// Update TP,TN,FN,FP as before. Crucially do not train
// on this instance!
}
else // MOA knows the label.
{
// Compare MOA's prediction to the actually class.
// Update TP,TN,FN,FP as before.
// We CAN train on this instance.
}
}
Is there a better we to achieve this using MOA? Also how can I generate the ROC and AUROC for this tree and assuming unlabelled data? Based on the WEKA ROC generation code, I know I must write code approximately like:
// load data
Instances data = new Instances( new BufferedReader( new FileReader(pathToTestData)));
data.setClassIndex(data.numAttributes() - 1);
Classifier cl = new HoeffdingTree();
// train classifier in other method.
// WHAT CODE TO PUT HERE?
ThresholdCurve tc = new ThresholdCurve();
while ( testStream.hasMoreInstances() )
{
Instance testInst = testStream.nextInstance();
// What to do?
votes = classifier.getVotesForInstance(instance);
}
//plot curve code
....
But I simply don't know what I need to fill in the gaps with. I know I must be missing something very basic here! I fear that for the unlabelled data, I can't generate the ROC using MOA. I'm prepared to implement the approach described by Tom Fawcett in
ROC Graphs: Notes and Practical Considerations for Researchers:
http://binf.gmu.edu/mmasso/ROC101.pdf for my purposes, but obviously If any of you know a better way (or can advise my next steps) I'd be grateful. Apologies if this was long winded.
Rob