Dear Michael,
In the blog titled Setting and using validation data, you illustrated that, the training data is used to optimize models, whereas validation data is used to test how well models generalize to new data, and Eureqa also uses the validation data to filter out the best models to display in the Eureqa user interface. which, from my point of view, seems to be a perfect idea here. But perhaps, the software doesn't work the perfect way as I think.
My data set is about 300 rows. If I set the training set to contain 90% of all data and the validation set for the rest 10%, the models I get from user interface look abnormally simple. And I also find that these models can fit the validation set perfectly but the training set poorly. Considering that the training error is now invisible in new version Eureqa, I am wondering whether the training set and validation data still work the same way as you said before. If they do, I must say, these simple models are meaningless because they emphasize the validation data too much but ignore the training data greatly.
To deal with this bias problem, I split my data manually for training and validating but only put the training part in Eureqa. Then, I set both the options to be 100% of all data and run it. Next, I choose the top 10 model listed in the user interface to test the generalization of unused validation set. Finally, I choos the one with the best testing result for my symbolic regresson problem.
My second question is about the numeric constants appeared in Eureqa. I have known that, in GP, it's not easy to create enough constants and make them change properly. But how can you achieve it so perfectly in Eureqa ? What a brilliant and wonderful job you have done.
You know, apart from using Eureqa, I also use a toolbox named ECJ to do the same job, but couldn't get results as good as Eureqa gives. So I'm hoping you can tell me the tricks of doing it and help me perfect my ECJ program.Thank you !
Best wishes for you and your families.