Multi Expression Programming for multi-class classification problems

23 views

Mihai Oltean

Feb 21, 2016, 10:51:52 AM2/21/16
to Multi Expression Programming
Dear all,

So far, my implementations of MEP were focused on solving symbolic regression and binary classification problems. There were multiple ways in which MEP could solve classification problems with multiple classes, but I was not very satisfied with those approaches...

Today, I pushed on github (see: repository https://github.com/mepx/mep and inside src folder, please check the mep_multi_class.cpp file) an implementation of MEP for solving classification problems with multiple classes (2 or more).

How is this implemented? Well ... in MEP way ...

Recall that the main idea of MEP is to encode multiple solutions in the same chromosome ...

Now, back to classification.

Imagine that we have to solve a binary classification problem and we have the following MEP chromosome:

0: a
1: b
2: + 0 1
3: * 0 2
4: + 1 2
5: / 4 3

Which gene will provide the output for class 0 and which one will provide the output for class 1? Good question ...

What if we say that genes from even addresses (like 0, 2 and 4) will provide the output for class 0 and genes from odd addresses (like 1, 3 and 5) will provide the output for class 1? In such setup we have multiple genes providing the output ... MEP way ...

Here is how we do it in more details:

1. for a particular example from the training data, we compute the value of all expressions in that chromosome. In our particular case we have 6 values v0, v1, v2, v3, v4 and v5.
2. we find the maximal value (out of those 6 values). If there are multiple maximal values, we consider the first one.
3. if the maximal value is at an even address (like 0, 2 or 4 in our example), that example from the training data is said to belong to class 0. Otherwise, that data belongs to class 1.
4. we now compute the incorrectly classified examples .. and that is the fitness.

that is all !

If we have 3 classes, we can say that genes from addresses 0 and 3 will provide the output for class 0, genes from 1 and 4 will provide output for class 1 and 2 and 5 will provide the output for class 2.

Currently I have made only few runs on:

1. Iris dataset which has 3 classes and I got only 1 data incorrectly classified, which is 1 step from perfect (I'm curios if a perfect classification is possible here - I must research more).
2. gene dataset from PROBEN1. I have used the entire set for training and in 1 run I have obtained an error of 7.7%. More play with the parameters maybe will bring more.

regards,
mihai