Faster symbolic regression with numpy

596 views
Skip to first unread message

Vinícius Melo

unread,
Oct 21, 2015, 2:32:44 PM10/21/15
to deap-users
Dear all,

I've been using DEAP in my research and it is a great tool. Unfortunately, it became slow when I started working with larger data sets. Then I found the symbreg_numpy example that should be much faster. However, that example is for a single variable, but my dataset has 50 columns and 10000 rows and my adaptation of symbreg_numpy for an ndarray was not working properly.  I tried map and vectorize (because gp.compile returns a lambda function that needs separate arguments, not a list or array), but the timings were worse. Then used eval(gp.PrimitiveTree(individual)) to calculate the fitness, which gave me around 55% of reduction in computing time. However, I'm pretty sure there is a more adequate way of calculating the fitness, but I could'n find it.

So, could someone please provide an example of GP using a numpy matrix? For instance, a random uniform matrix with 3 columns and 10 rows, where y = ARG0**3 + ARG1**2 + ARG2.

Sincerely,

Vinícius.


Marc-André Gardner

unread,
Oct 22, 2015, 12:55:40 AM10/22/15
to deap-users
Hi Vinicius,

Symbolic regression without numpy can be indeed quite slow with large datasets. Fortunately, it is normally easy to use numpy to vectorize the operations.

Here's the requested example : https://gist.github.com/mgard/63d07c5bd79f25671310

Basically, the only things I've modified from the original symbreg_numpy example are :
  • Line 39 : tell DEAP that you want to produce individuals with 3 inputs arguments (instead of 1)
  • Line 44 : renaming of all the arguments (optionnal)
  • Lines 55-56 : definition of the training set. Be careful to avoid defining it in your evaluation function, or you will experience a lot of overhead (the example is correct on this aspect, it is just a remark)
  • Line 63 : the call to the individuals is modified to allow the use of a matrix as input. Basically, I want each row of the samples matrix to be fed as argument, so I transpose it and unpack it with the *. Of course, the dataset could be created in such a way to simplify this line (in particular to avoid the transposition) but I wanted to stick to your problem definition.

And that's all, you're now able to use multidimensionnal inputs. In my research, I used DEAP for symbolic regression on dataset up to 500 features and it works like a charm.


By the way, I don't know your exact setup, but don't forget that the performance of numpy is crucially tied to the backend library (blas/openblas/atlas/mkl/etc.), see for instance this page to learn more about that.


Have fun with DEAP,


Marc-André

Vinícius Melo

unread,
Nov 19, 2015, 7:16:04 AM11/19/15
to deap-users
Hi Marc-André,

thank you for the prompt answer and sorry for my delay.

Your code is working great. My mistake was that I didn't transpose the matrix before calling func.

However, in my setup my code using eval is faster than compiling and using numpy. Could you check that, please?



# Rename functions
from numpy import add as vadd
from numpy import subtract as vsub
from numpy import multiply as vmul

# Create references to the columns, so the terminals (x, y, and z) are actual variables in the environment
# Could use a loop: for i in xrange(samples.shape[1]):  exec "ARG"+str(i)+"=samples[:,"+str(i)+"]"# in locals(), globals()
x=samples[:, 0]
y=samples[:, 1]
z=samples[:, 2]

def evalSymbReg(individual):

# Evaluate the individual, for instance, add(ARG0, sqrt(ARG2)) res = eval(str(gp.PrimitiveTree(individual))) # Evaluate the sum of squared difference between the expression
# and the real function values : x**4 + x**3 + x**2 + x
diff = numpy.sum((res - values)**2)

Timings using func: %timeit %run deap_numpy_multiplevar_symbreg.py
1 loops, best of 3: 898 ms per loop

Timings using eval: %timeit %run deap_numpy_multiplevar_symbreg-eval.py
1 loops, best of 3: 744 ms per loop

Best regards,

Vinícius

Marc-André Gardner

unread,
Nov 19, 2015, 8:47:31 AM11/19/15
to deap-users
Hi Vinicius,

I'm not sure to understand : what are you comparing? Are you using eval() on trees with non-numpy primitives, and using compile() on trees with numpy primitives? I say that because technically, gp.compile(tree) and eval(str(tree)) are basically the same thing.

Could you provide the code your using for your tests, so I can understand better (you can strip your data and replace it by random generated matrices of the same shape if you want)? Numpy should give you a clear edge, especially when dealing with functions taking about 1 second to be evaluated...

Regards,

Marc-André

Vinícius Melo

unread,
Nov 23, 2015, 10:38:48 AM11/23/15
to deap-users
Hi Marc-André.

I uploaded the source code to http://pastebin.com/p4hUdjJM

As you may see, it is a slightly modified version of the code you provided. So, the individuals don't take 1s to be evaluated. Maybe a bigger dataset would give numpy an advantage?

Best,

Vinícius.

Marc-André Gardner

unread,
Nov 23, 2015, 5:30:43 PM11/23/15
to deap-users
Hi Vinicius,

Actually, you're using numpy in both cases :) What matters is not the evaluation function, but the primitive set, and in both cases you are using numpy functions when you do something like pset.addPrimitive(numpy.add, 2, name="vadd").

The difference is that when you use eval(), you're relying on global variables and global functions. Hence, you must define vadd and others by explicitely import them (from numpy import add as vadd), and you must define your inputs in the global scope with a predefined name (in your case, x, y, and z). When using compile, you actually receive a function, standalone, in which you can process any arbitrary arguments.

Not only it is much more trouble, it is also prone to errors. For instance, if you have a test dataset on which you want to assert the performance of the best individual at the end of the evolution, you must redefine x, y, and z in the global scope prior calling its evaluation. With compile, you just receive a function which can evaluate any arbitrary input data.

That said, there's indeed a slight difference in timings between using eval() and compile() (in favor of the prior). However, this difference and quite small (about 0.1 seconds on my computer for the whole evolution) and does not depend on the dataset size, so it does not really matter. It is interesting from a Python point of view (why is access to global variables faster than calling a function?) but does not change a lot in practical cases. In particular, note that your individuals do not take 1 second to be evaluated, the whole evolution process takes 1 second to be done, which a quite a different thing :)

Do not hesitate if you have any remaining questions,

Marc-André

Vinícius Melo

unread,
Nov 25, 2015, 9:48:53 AM11/25/15
to deap-users
Hi Marc-André,

I understand. Thank you for the explanation.

I had the same question about the performance of global vs local variables. Regarding that, my current implementation is not creating x, y, and z as globals, but using that loop inside evalSymbReg. As process takes almost one minute to finish, eval takes around 3 seconds less than compile, but I should probably go with compile to avoid other errors as you suggested.

Anyway, this is just to let you know about that performance issue. I will open another thread with another question.

Best regards,

Vinícius.
Reply all
Reply to author
Forward
0 new messages