Differences between training set and testing set

357 views
Skip to first unread message

Charles Pucheu

unread,
Mar 15, 2014, 7:47:45 AM3/15/14
to wekamooc...@googlegroups.com
Hello, 

Thanks for this course about data mining and this new way of learning. I'm Charles from France and i'm a beginner on data mining.

I'm working on the course 2 and espacially about training and testing set. I don' t exactly understand the difference between them.
I understand that the data gives in the example data set is training set. If we want to classify these data, it will produce a difference between results and reality. A kind of distortion made by the algorithm. For reducing this we make a testing set made by a part of the training set. 

In the exemple of the segment data set, we have 1500 instances in the training set and 810 instances in the test set (about 50%).
Why a result would be more accurate on a set with less instances ? And why don't use the training set. 
I understand with a smaller data set (like exemple set) we can reduce errors, so it will be easier to divide data for building a tree. But if errors are reduced, accuracy is not too, isn't it ?

Is this definition true ?
training data is  a complete set of data we use. Exemple data set is a part of training data.

thanks for your help to clarify these notions.

charles pucheu


Wolfgang Radl

unread,
Mar 15, 2014, 8:56:48 AM3/15/14
to wekamooc...@googlegroups.com
Hi Charles,

The difference between test and training set is pretty simple: you use the training set on one hand to create your model, and your test set in to validate it. Think of the activity related to 2.1, when you were the classifier; you used the training set in order to create your model (drawing rectangles around different clusters of data points), and thereafter validated your model against a set of test data.

Usually when you want to build a model, you'll start collecting data and separate them into test and training set. It is vital that they are both from the same statistical population, so "about equal", if you want (i.e. independent samples). That's what happened in the activity (segment dataset): you built your model based on 1,500 instances and validated them against 810 instances of your test set. Makes perfect sense - the more training data you provide, the greater the success rate of the classifier will be (thus a smaller set of data for training will not yield better results at all). Now, let's assume you include data from the test set into your training set: that's like cheating, as the goal is to predict samples that are yet unknown to the classifier.

However, Weka is pretty helpful when it comes to creating both test and validation sets if you only provide one large set of data, as it will automatically split the data for you - stratified cross validation. Check out slides 24 and 25 to see how that works.

Hope that helped,
Wolfgang
Reply all
Reply to author
Forward
0 new messages