new strategy for multi-class classification - overlapped intervals - needs improvements

16 views

Skip to first unread message

Mihai Oltean

unread,

Oct 4, 2017, 2:42:53 AM10/4/17

to Multi Expression Programming

The existing methods (see previous emails in this group) use multiple genes to generate an answer (for instance the gene with the maximum value gives the class). A more elegant solution is that the class is computed by a single gene. For instance if the output is in [0, 1] interval we say that the item belongs to class \#0; if the output is in interval (1, 2] we say that the item belongs to class \#1 and so on. A simple fitness function for this approach would be easy: if the output for a gene for a particular training data would fall outside its interval, it would be counted as incorrectly classified and the fitness would be equal to the number of incorrectly classified data.

There is a big problem with this approach: the limits for intervals associated with each class. It is not advisable to use a predefined set of values for each interval because shifting (translating and scaling) values to that interval could be expensive and hard to discover (by evolution). Instead, a better idea is to let the evolution to discover the intervals.

So, we do this for each instruction in the chromosome and finally the best instruction will give the fitness of the chromosome. This is the natural way in which MEP behaves.

We take all training data and see (for a particular instruction in the MEP chromosome) which are the interval of values for each class. For instance all data belonging to class \#0 will have the output (generated by the current gene) in the [-3, 2] interval. All data belonging to class \#1 will have the output (generated by the current gene) in the [-1, 7] interval and so on. If the intervals are not overlapping it means that we have a perfect classifier: a data can be clearly assigned to a single class.

The problems appear when the intervals overlap. If the output of a gene for a particular data falls in 2 intervals it is impossible to say to which class the data belongs. In such case we compute the fitness as being equal to the overlap degree of all pairs of intervals. More exactly, if the intervals do not overlap, we do nothing. If one interval is included into the other, we add 100\% to fitness. If the intervals overlap, but not completely, we add the degree in which intervals overlap to the fitness (this is computed as length of the overlap over the sum of the lengths of both intervals).

We want to minimize the overlapping of intervals, thus the fitness has to be minimized.

I have tested this method, but currently it does not provide results comparable with the previously tested method ... so we need help to improve it...