Hello,
Thanks for this course about data mining and this new way of learning. I'm Charles from France and i'm a beginner on data mining.
I'm working on the course 2 and espacially about training and testing set. I don' t exactly understand the difference between them.
I understand that the data gives in the example data set is training set. If we want to classify these data, it will produce a difference between results and reality. A kind of distortion made by the algorithm. For reducing this we make a testing set made by a part of the training set.
In the exemple of the segment data set, we have 1500 instances in the training set and 810 instances in the test set (about 50%).
Why a result would be more accurate on a set with less instances ? And why don't use the training set.
I understand with a smaller data set (like exemple set) we can reduce errors, so it will be easier to divide data for building a tree. But if errors are reduced, accuracy is not too, isn't it ?
Is this definition true ?
training data is a complete set of data we use. Exemple data set is a part of training data.
thanks for your help to clarify these notions.
charles pucheu