True accuracy on the segment-challenge dataset - from Lesson 2.2 Activity question 4

Tamir Basin

unread,

Apr 24, 2015, 2:53:08 AM4/24/15

to wekamooc...@googlegroups.com

Hi,

This is the (Split, Accuracy) I found out:

(10%,88%)

(20%,91%)

(40%,99%)

(60%,99%)

(80%,99%)

(90%,97%)

Why would the estimated 'true accuracy' on the segment-challenge dataset be 95%?

I would guess 90% because more data was used for the training set.

What am I missing?

Cheers,

Tamir

Tamir Basin

unread,

Apr 24, 2015, 3:03:25 AM4/24/15

to wekamooc...@googlegroups.com

Just Calculated the average of the accuracy values and got 95%.

If that is the motivation for the estimation than the penny dropped :-)

Tamir

בתאריך יום שישי, 24 באפריל 2015 בשעה 09:53:08 UTC+3, מאת Tamir Basin:

Ian Witten

unread,

Apr 25, 2015, 11:43:42 PM4/25/15

to wekamooc...@googlegroups.com

Just Calculated the average of the accuracy values and got 95%.
If that is the motivation for the estimation than the penny dropped :-)
Tamir

The average is not really what is wanted here. And, in fact, I did not get the same results as you. For (Split, Accuracy), and rounding accuracies to the nearest percentage, I got

(10%, 89%)

(20%, 90%)

(40%, 92%)

(60%, 94%)

(80%, 97%)

(90%, 97%)

It looks like -- as one would expect -- the more training data, the better the accuracy, up to an asymptotic accuracy of around 97%. But as the amount of training data increases, the amount of test data decreases, and -- as Activity 2.2 Questions 2 and 3 imply (and as one would expect) -- if there is not much testing data then the test is unreliable.

The question asks for the "true accuracy". Of the choices given in the question -- 50%, 90%, 95% and 100% -- I would choose 95%, because it's the closest to 97%. In fact, the file segment-test.arff contains a large number (810) of test instances, independent of the training set, and if you specify this under "Supplied test set" Weka uses the whole of the training set for training, and returns an accuracy of 96%.

Hope this helps

cheers

ian

On 24/04/2015, at 7:03 PM, Tamir Basin <tana...@gmail.com> wrote:

בתאריך יום שישי, 24 באפריל 2015 בשעה 09:53:08 UTC+3, מאת Tamir Basin:
Hi,
This is the (Split, Accuracy) I found out:

(10%,88%)

(20%,91%)

(40%,99%)

(60%,99%)

(80%,99%)

(90%,97%)
Why would the estimated 'true accuracy' on the segment-challenge dataset be 95%?
I would guess 90% because more data was used for the training set.
What am I missing?
Cheers,
Tamir

--
You received this message because you are subscribed to the Google Groups "WekaMOOC-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wekamooc-gener...@googlegroups.com.
To post to this group, send email to wekamooc...@googlegroups.com.
Visit this group at http://groups.google.com/group/wekamooc-general.
To view this discussion on the web, visit https://groups.google.com/d/msgid/wekamooc-general/8c0ef407-cd89-47e9-96de-759ba83f0277%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tamir Basin

unread,

Apr 28, 2015, 4:56:55 AM4/28/15

to wekamooc...@googlegroups.com

Thank you Prof.

Your answer clarified my mistake.

I re-calculated the accuracy and got the right numbers this time :-)

Cheers,

Tamir

Reply all

Reply to author

Forward