Splitting data into training and validation sets

150 views
Skip to first unread message

Sanjiv

unread,
Apr 24, 2012, 1:34:46 PM4/24/12
to Eureqa Group
While using the "Custom mode" in the "Set Target" window of Eureqa II
the dataset is split properly into training and validation sets
according to the percentages specified in the "Data Setting" window.
This I have verified by manually counting the coloured training and
validation data points in the "Solution fit" plot. However, what I
noticed was that if we select "finding the global model" option in the
"set target" window then the percentages of training and validation
data points differ significantly from what was specified in the data
setting window. For instance, 39 out of a total of 131 points were
selected as training points whereas the remaining were shown as
validation data. In this case I had specified 50 % as the fraction of
first (for training set) as also of last (validation set) rows in the
data setting window. Does this have anything to do with the "Maximum
history data" setting (default 20%). I need help from the Eureqa Gurus
to sort out this data splitting difficulty I am facing. Am I missing
something here? Incidently how does "Maximum history data" setting
works?

Although not relevant to the above query, it would be highly desirable
(if possible) to include in the next version of the Formulize a
setting for specifying the random number seed being used for
generating random numbers which are used for performing numerous
genetic programming operations. Being a stochastic search method, the
performance of the GP (likewise genetic algorithm) can vary
substantially (while all other GP parameters remaining same) if a
different sequence of random numbers is used to implement the GP.

Thanks in advance.

Sanjeev

Michael Schmidt

unread,
Apr 30, 2012, 5:05:09 PM4/30/12
to eureqa...@googlegroups.com
Only the "Custom mode" respects the custom data split settings. The others, like "Find global model" use heuristics to pick the training and validation sets automatically based on the total number of rows. I would recommend always configuring the custom mode, but for most people the automatic options should work well.

The percent of data used for history only affects searches that use the delay() or sma() building-blocks. The setting effectively controls the maximum delay allowed in the delay() or sma() functions. These require dedicating a portion of the data for history, otherwise they would reference rows before the first row. If you don't enable the delay() or sma() building-blocks, this setting is ignored and has no effect.

The software uses a time based seed for random numbers. The only fixed seed is for the data-splitting so that starting and stopping a search respects the previous split. We could make this configurable though, it's a good idea.

Michael




--
Eureqa Formulize ( http://www.nutonian.com )
-------------------------------------------------
Unsubscribe: eureqa-group...@googlegroups.com
View Group: http://groups.google.com/group/eureqa-group

Reply all
Reply to author
Forward
0 new messages