Lesson 3.2: why doesn't oneR give 100% accuracy on training data with minBucketSize=1 ?

Jimmie Felidae

unread,

Mar 29, 2014, 2:27:16 AM3/29/14

to wekamooc...@googlegroups.com

If I got it right, minBucketSize=1 means that oneR splits between every single instance. But as what Professor Ian Witten demonstrated in Lesson 3.2, the accuracy that oneR applied to diabetes dataset was 87.5%, tested on training data -- shouldn't it be 100% that indicates oneR fits all instances in training set?

Thanks,

Jimmie

Message has been deleted

Birone L

unread,

Mar 30, 2014, 5:30:23 PM3/30/14

to wekamooc...@googlegroups.com

I think the key is in the 'min' part. OneR only uses a single attribute, so the one that gives the best accuracy may still have leaves with multiple instances, and these can contain a mixture of the class values. If that makes sense...

I'll look at the data & check tomorrow :)

Birone L

unread,

Mar 31, 2014, 2:46:01 AM3/31/14

to wekamooc...@googlegroups.com

Ok - I see what you mean; because the diabetes dataset has numeric attributes, with a minimum bucket size of one, OneR 'should' be able to match individual values of any attribute (with unique observations) to a single class value.

I thought this might be to do with discretization but OneR is using some 0.001 unit bins, and the data for the variable it's splitting on goes to 3 dp as well. So it's not that. So I don't know - maybe OneR has a 'maximum number of leaves', so that its rules don't get overly complex?

Jimmie Felidae

unread,

Mar 31, 2014, 7:42:29 AM3/31/14

to wekamooc...@googlegroups.com

I figured it out - it was because there were positive and negative instances that shared exactly the same pedi, the attribute that OneR chose. Since OneR worked only on this attribute, there were no way to decide their classes however accurate the decision rule was. So OneR cannot reach 100% accuracy -- unless there was some attribute, unique to all instances.

Birone L

unread,

Mar 31, 2014, 10:10:34 AM3/31/14

to wekamooc...@googlegroups.com

Ah, right... It's actually quite subtle! All instances with pedi < 0.1265, had negative class values except two pairs with identical values for pedi, one positive one negative. So that whole < 0.1265 range was treated as predicting negative.

Reply all

Reply to author

Forward