ROC Curves and Unlabelled data.

rober...@gmail.com

unread,

Mar 12, 2013, 2:40:32 PM3/12/13

to moa-...@googlegroups.com

Hi All,

I'm hoping someone can offer some advice regarding the best way to classify unlabelled data and obtain the ROC (and AUROC), as I'm encountering difficulties. I've already looked at this question https://groups.google.com/forum/?fromgroups=#!msg/moa-users/EsdEiTMi2Dc/AGwuBT0cyiMJ but I still couldn't resolve my problem.

I'm attempting to classify two-class data from an ARFF file with varying labelling (from 100% labelled to 0%), in order to ascertain the performance of various data stream classifiers. Using a small fully labelled training set, I firstly pre-train my classifier. I then use the trained classifier on the unlabelled data to make predictions. I'm doing this as follows:

I'm reading in fully labelled training data from an ARFF file, and using it to train a Hoeffding tree - I know this tree does not work with unlabelled data, this is something I need to demonstrate for my data.

ArffFileStream trainingStream = new ArffFileStream ("path to training set", -1);

ArffFileStream testStream = new ArffFileStream ("path to test set", -1);

while ( trainingStream.hasMoreInstances() )

{

Instance trainInst = trainingStream.nextInstance();

learner.trainOnInstance(trainInst);

}

Once I've trained the learner, I then use it to classify the test data. In other words the test ARFF file is of the basic form:

@RELATION TestSet

@ATTRIBUTE Score1 NUMERIC

@ATTRIBUTE Score2 NUMERIC

@ATTRIBUTE class {0,1}

@DATA

152.3,1.0,?

119.4,1.4,?

Before classifying, I pre-load some meta information regarding what labels the classifier should apply for each instance. As each test instance arrives, I know what its classification should be, so then I compare this with what the learner predicts. I do this as I need to retain knowledge of those instances which were incorrectly classified as false negatives and false positives. Based on the result I update summary statistics. Here's the general approach:

while ( testStream.hasMoreInstances() )

{

Instance testInst = testStream.nextInstance();

// I need to know the class predicted.

double[] votes = learner.getVotesForInstance(testInst);

int classification = Utils.maxIndex(votes);

// Decide how to interpret this classification, update

// the summary statistics, retain info if a FP or FN.

if(classsification == 0 && realClass == 1)

falseNegatives++;

// write instance to file...

else if (classification == 1 && realClass == 0)

...

learner.trainOnInstance(testInst);

}

The results I'm getting from this approach are not as expected. As each unlabelled instance is has the class "?", when you obtain the class value via calling:

testInst.classValue()

The value returned is NaN.

If you then call,

learner.correctlyClassifies(testInt)

this NaN value is cast to an integer, and obtains the value 0. You can see the method from AbstractClassifier class below.

@Override

public boolean correctlyClassifies(Instance inst) {

return Utils.maxIndex(getVotesForInstance(inst)) == (int) inst.classValue();

}

This means that my unlabelled instances are denoted as class zero. Obviously this has bad consequences :D.

So to mitigate this problem, I simply did a test for the presence of NaN:

while ( testStream.hasMoreInstances() )

{

Instance testInst = testStream.nextInstance();

// Get the class of the instance - it is unlabelled so

String clazz = Double.toString(testInst.classValue());

// I need to know the class predicted.

double[] votes = learner.getVotesForInstance(testInst);

int classification = Utils.maxIndex(votes);

// Decide how to interpret this classification.

if(class.equals("NaN")) // MOA doesn't know the correct label

{

// Compare MOA's prediction to the actually class.

// Update TP,TN,FN,FP as before. Crucially do not train

// on this instance!

}

else // MOA knows the label.

{

// Compare MOA's prediction to the actually class.

// Update TP,TN,FN,FP as before.

// We CAN train on this instance.

}

Is there a better we to achieve this using MOA? Also how can I generate the ROC and AUROC for this tree and assuming unlabelled data? Based on the WEKA ROC generation code, I know I must write code approximately like:

// load data

Instances data = new Instances( new BufferedReader( new FileReader(pathToTestData)));

data.setClassIndex(data.numAttributes() - 1);

Classifier cl = new HoeffdingTree();

// train classifier in other method.

// WHAT CODE TO PUT HERE?

ThresholdCurve tc = new ThresholdCurve();

while ( testStream.hasMoreInstances() )

{

Instance testInst = testStream.nextInstance();

// What to do?

votes = classifier.getVotesForInstance(instance);

}

//plot curve code

....

But I simply don't know what I need to fill in the gaps with. I know I must be missing something very basic here! I fear that for the unlabelled data, I can't generate the ROC using MOA. I'm prepared to implement the approach described by Tom Fawcett in ROC Graphs: Notes and Practical Considerations for Researchers: http://binf.gmu.edu/mmasso/ROC101.pdf for my purposes, but obviously If any of you know a better way (or can advise my next steps) I'd be grateful. Apologies if this was long winded.

Rob

Albert Bifet

unread,

Mar 13, 2013, 1:52:46 PM3/13/13

to moa-...@googlegroups.com

I think that you may not use unlabelled data to evaluate a classifier.
Are you sure you need to use ROC? Can you use prequential evaluation
or kappa statistic? Are you aware of the controversy of this measure
from David Hand?

Cheers, Albert

> --
> You received this message because you are subscribed to the Google Groups
> "MOA users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to moa-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

rober...@gmail.com

unread,

Mar 14, 2013, 2:09:57 PM3/14/13

to moa-...@googlegroups.com, abi...@cs.waikato.ac.nz

Hi Albert,

Thanks for taking the time to look at this for me, its much appreciated.

I don't think prequential evaluation (test then train on each instance incrementally if I understand correctly) is going to be suitable for my purposes. Unfortunately I'm working with data that is overwhelmingly unlabelled. The labelled data I do posses is used only to demonstrate difficulties arising from my problem domain regardless of labelling. But ascertaining classifier performance on these labelled data sets doesn't reflect what my algorithms must face in the real world. Ideally I'd like to be able to gather ROC curves and other stats that describe what happens as the proportion of labelled instances is varied.

The Kappa statistic also doesn't appear to be a sufficient metric for my purposes either, as it is dependent on the prevalence of the positive examples (if I've misunderstood let me know :D). So if we have:

PREDICTED

- +

Actual - 1000 100

+ 5 1

TP = 1

TN = 1,000

FP = 100

FN = 5

TOTAL = 1106

Random Accuracy 0.904245787

Total Accuracy 0.905063291

Kappa = 0.008537522

But if I set TP= 15 then Kappa = 0.19781719. I don't think Kappa is descriptive enough for my data.

I wasn't aware of the controversy surrounding AUROC, but I am now, so thank you very much! Unfortunately I will still need the ROC curves though, and it appears that I can't get these from MOA if I have unlabelled data.

My current idea to get around this, is to wrap the classification procedure in MOA as I'm currently doing, to get the desired behaviour. So for instance:

I could subclass the Hoeffding Tree object and modify the tree nodes so that they record misclassification errors. I could do this by comparing the predicted class for an unlabelled instance at a leaf, with the actual class label obtained separately during a pre-test step. I should then be able to construct the ROC using the approach described by Fawcett. If you have an alternative suggestion then I'd be grateful if you could share it.

I'm also struggling to write the code that generates the ROC for the labelled case. So far I have:

// Load data

Instances data = new Instances( new BufferedReader( new FileReader(pathToTestData)));

Evaluation eval = new Evaluation(data);

data.setClassIndex(data.numAttributes() - 1);

Classifier cl = new HoeffdingTree();

// Train classifier in other method.

// Am I missing setup calls?

ThresholdCurve tc = new ThresholdCurve();

while ( testStream.hasMoreInstances() )
{
Instance testInst = testStream.nextInstance();
// What to do?
votes = classifier.getVotesForInstance(instance);
}

//plot curve code

In another question you provide a code fragment as follows:

//You should replace eval.predictions() with MOA predictions: so, for each instance tested

if (predictions == null)
predictions = new FastVector();
}
votes = classifier.getVotesForInstance(instance);
predictions.addElement( new NominalPrediction(instance.classValue(), votes, instance.weight()));

But I'm unsure as to what the predictions object is supposed to be (a FastVector() presumeably), and how do I pass this to the evaluation object?

Finally I've noticed that a call to learner.measureByteSize() is returning negative values. I must be making the wrong call, I would like to to know the size of the tree in bytes, and if available the number of nodes/leaves.

Thanks again for your help,

Rob

Albert Bifet

unread,

Mar 14, 2013, 2:49:38 PM3/14/13

to moa-...@googlegroups.com

Hi Rob,

What about using the geometric mean of the accuracies for each class?
How do you plan to compute ROC in an evolving setting?

You should use

ThresholdCurve tc = new ThresholdCurve();

int classIndex = 0;

if (predictions == null) {
predictions = new FastVector();
}
votes = classifier.getVotesForInstance(instance);
predictions.addElement( new NominalPrediction(instance.classValue(), votes,
instance.weight()));

Instances result = tc.getCurve(predictions, classIndex);

You are missing setup calls, take a look at Tutorial 2.

If learner.measureByteSize() is negative, is because you are not
using sizeofag.jar.

Cheers, Albert

rober...@gmail.com

unread,

Mar 14, 2013, 5:24:03 PM3/14/13

to moa-...@googlegroups.com, abi...@cs.waikato.ac.nz

Hi Albert,

I knew I was making a school boy error with the call to learner.measureByteSize(), at least now I know how to fix it. I'll also check out tutorial two now.

I currently calculate the geometric mean, and it appears to work well. I may have to rely on this as my most useful single value descriptor.

In terms of computing the ROC curve in an evolving setting, I think the best I can do is calculate the ROC curve at a pre-specified sampling intervals and again once at the end of the test. That way I could see how the performance has changed during the course of a run on test examples (I retain other statistics at such intervals also). I know this could be difficult given that it is possible that Hoeffding tree leaves can be pruned.... I certainly need to think on this! My other option would be, in the case of the tree at least, to firstly let it learn under a streaming scenario on a large test set. Then using a holdout set, evaluate the model without learning (so that the tree becomes a static learner) and compute a ROC that describes the performance.

I know, this needs more consideration :D

Thanks again, I really appreciate your time.

Rob

Reply all

Reply to author

Forward