Deap performance

656 views
Skip to first unread message

Mark Lefebvre

unread,
Apr 23, 2016, 6:13:35 PM4/23/16
to deap-users
I am doing symbolic regression on a time series and having some real issues getting it to be any faster.
pypy seems to make things about 10x slower (even the latest version) for some reason which I'm not sure I understand.

Is there an easy way to profile DEAP so that I can verify which part is taking up all the time?

Since I'm fairly sure that the issue is the floating point calculations, is there a known way to speed that up?
Has anyone used anything like NUMBA to improve performance?

What other tools/tactics are applicable to improving Symbolic Regression performance with DEAP?

Thanks

-Mark Lefebvre

Marc-André Gardner

unread,
Apr 24, 2016, 1:36:55 AM4/24/16
to deap-users
Hi Mark,

Indeed, pypy is not very good at all with GP problems, because of its internal JIT compilation. If you're curious, there's this blog post explaining this in details.

DEAP can be profiled with any Python profiler, including the already Python-bundled cProfile. However, I suspect that you will just see most of the execution time passed in primitive functions.

Are you using numpy functions as primitives? For floating point computations on medium to large datasets, this makes a huge difference, because all the samples are evaluated at the same thing, in one Python function call. You can have a look at the symbreg_numpy example. Usually, going from pure Python functions to numpy functions is really easy.

If you're already using numpy, then let us know and we'll try to find other ways to speed up your program.

Have fun with DEAP,

Marc-André

Mark Lefebvre

unread,
Apr 24, 2016, 1:40:27 PM4/24/16
to deap-users
I had taken a swing at numpy and did not see a difference.
However, I'm not entirely sure I had everything properly in a numpyarray the way I should have..my code at this point is poorly written organic code, I'm afraid.  Far too much patching from examples and too many chages of GP libraries :)
Let me take another shot at the numpy example and see if I can get all the data types right all the way through.

It was very logical to me that JIT wouldn't help - I was perhaps being optimistic in trying every new version as it came out. :)

I really liked the idea of using the x86 op codes directly that was blogged about here: http://multigrad.blogspot.com/
but it looks like this isn't done (nor do I have the skill to finish it) and I thought the most likely candidate for speedup was just that last compile step instead.
My thought was that if something like NUMBA could just do the encoding instead of COMPILE, that would be pretty quick....

Mark Lefebvre

unread,
Apr 25, 2016, 12:50:37 AM4/25/16
to deap-users
Well, something is up with my implementation - it appears to overwrite numpy (after it is done failing to even execute) somehow - on two different machines.
However, I went to find out with fresh installs how much speed increase could be expected with numpy and it seems like symbreg.py runs faster than symbreg_numpy.py on both of my machines as well.
Has anyone else tested this?  Are these the same - they don't appear to be, the numpy version will run out of tree depth at 90 if you run it over 100 generations, whereas the normal one I ran at 2k generations without errors.

Is there some other example I can use to test the expected performance increase with numpy?

I'm using python 2.7.11 (fresh install on both machines) and the latest numpy from PIP - I've also tried downloading the latest DEAP and installing manually since PIP gives me a slightly older version (1.01 I believe).

One of the machines is running MINT (don't ever run MINT) and the other is a nice gentoo box (why I ever installed anything else is beyond me).

As for profiling the my stuff, I did try this, but was getting all kinds of errors (I had done this with other libraries in the past, however) so I just was not sure if someone else had done it and how.

Any advice would be greatly appreciated as I'm complety baffled at how I'm running into so many problems with such a simple performance test.

Thanks

-Mark Lefebvre

Marc-André Gardner

unread,
Apr 25, 2016, 1:46:15 AM4/25/16
to deap-users
Hi Mark,

symbreg_numpy.py and symbreg.py execute the same problem (quartic polynomial regression), but with different dataset. In particular, you can notice that the numpy version is evaluating 10000 points for each individual instead of 20 for the "standard" version. Since the numpy version uses (on my computer) about two times the time need by the standard version, one can deduce that numpy is actually 10000/20/2 = 250 times more efficient, which is quite significant. This also explains the different results obtained, btw.

As for the profiling issues, technically, you can use the internal Python profiler, so this should work (it works on symbreg examples, at least) :
python -m cProfile -s time symbreg.py
Of course, you can also output the profiling info and generate nice graphics with tools like gprof2dot, but this is a different topic.

 I'm not sure to understand what do you mean by "it appears to overwrite numpy", but given a sufficiently large dataset (> 500 samples with numerous features at least) then numpy should give you a good speedup. Take care however to give the whole dataset at once, so to take advantage of the numpy vectorisation. Do not feed it sample by sample (say, in a Python loop), or you will likely end up with bad performance.

If you're willing to share some basic code (e.g. the parts relevant to the evaluation) then we could have a look at it and tell you whether you're doing it right or not, or how you could improve the efficiency.

Have a good day,

Marc-André

Mark Lefebvre

unread,
Apr 25, 2016, 10:26:33 AM4/25/16
to deap-users

Thanks for the details, I hadn't realized the numpy version was doing 10000 points.  I'll do some modifications and see what I can come up with.

As for loops (plus, I still have to get my data in ndarray from my csv file) I am using one to create a sliding window.
I was never concerned about this before because in my previous attempts, this wasn't slow, but I can see where I'd want to remove it entirely or at least make it efficient.

Here is my sliding window loop in my eval_func (I use a0, b0, c0, etc as ARGs for easier tracking):

   for row in range(start, len(datali)):
      if row > 5:
         a5,b5,c5, ... = datali[row - 5]
         a4,b4,c4, ... = datali[row - 4]
         a3,b3,c3, ... = datali[row - 3]
         a2,b2,c2, ... = datali[row - 2]
         a1,b1,c1, ... = datali[row - 1]
         a0,b0,c0, ... = datali[row]
         terms = [a0,b2,c3, ...] # not using all the terms from above in the calculation! i.e. there are holes in my matrix, first row is shorter than the rest
         evaluated.append(code_comp(*terms))
     <do more stuff>

I would imagine that this may need to change...
I can picture doing more efficient sliding window operations, this was just never slow enough before to worry about.
Other than that, my entire dataset is a single array of 2000 rows or so and somewhere less than 30 columns.

Thanks for your help.

-Mark Lefebvre

Yannick Hold-Geoffroy

unread,
Apr 26, 2016, 6:11:03 PM4/26/16
to deap-users
Hello Mark,

Without proper profiling information, I guess I cannot say for sure where is the speed issue. However, I believe that your implementation may be slowed by all the memory allocation it must do in the evaluated.append(), because of the growth of this variable.

Maybe if you could do this operation in-place, with something like uniform_filter1d from scipy (Depending on what code_comp() does), you would see a big gain in performance. Vectorizing your "for row" loop by letting a numpy/scipy operator doing the looping would probably boost the program efficiency. Maybe even just preallocating your variable `evaluated` using numpy.empty() and inserting the values at their correct position may improve the performance of your program.

If your dataset yields a lot of zeros, I believe the sparse matrix module may help you obtain a better memory footprint and sometimes better performances on some operations.

Be sure that you load the data from your csv file only once and not at every evaluation.

Also, it may be worth looking into cProfile and a visualization tool like gprof2dot.

I cannot help you further than that without profiling information or a small working example demonstrating the depicted slowness.

Hope it helped,
Yannick Hold

Mark Lefebvre

unread,
Apr 26, 2016, 8:12:53 PM4/26/16
to deap-users
Yannick,

This is good information. I was on the trail of append once I realized that numpy is not any faster (actually, it is slower because it is spitting errors that the math functions did not).  No idea why Numpy is slightly slower (at least for the actual GP work, oddly, not if I use it outside of DEAP to do some mean and stddev calcs afterwards).

I saw an excellent talk by one of the pypy folks that pointed out when Python copies memory.
I had never given that much thought before, since the last time I profiled this code it was the floating point eval that was taking all the time.
I'll give cprofiler another go and see if I can get it to work...but fixing sliding window and perhaps append would likely give me some speed increases.

I do, in fact, only read my CSV once, in fact, I do the calculations once and then use that appended list (in case I want to modify that as well, though I haven't tried that yet) for my final calculations.  No zeroes in my data set, it is too dense, if anything.

I find that small changes in precision make a huge difference in results, however, which is a different problem!  Then again, I have so many issues it's hard to know where to start. :)

This is a good starting point, however, so thanks to everyone so far.

If  anyone has any good sliding window code or a way around my append issue, I'd love to incorporate that instead. In the meantime, I'll keep scouring for more ways to speed this up.

-Mark Lefebvre

François-Michel De Rainville

unread,
Apr 26, 2016, 8:27:25 PM4/26/16
to deap-users
I don't know if it is even possible in your case but using convolution might help you a lot. I had to compute std dev over sliding window of adjacent pixels and transformed my sliding window into 2 convolutions and gained about 100 - 1000x speed.

--
You received this message because you are subscribed to the Google Groups "deap-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deap-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark Lefebvre

unread,
Apr 27, 2016, 3:44:13 PM4/27/16
to deap-users

It seems like even minor changes to the sliding window have an impact, so that is likely my culprit here.
I just realized the code_comp call is not in my example, it's just one line above (so much for precise cut and paste):

code_comp = toolbox.compile(expr=individual)

All I am doing is running the result of the evolution on each row of my file, with a 5 row lookback..(hence the sliding window).
The post processing I'm doing after this is pretty minor (and was sped up with numpy).
I never had a better way to do sliding window, but with faster hardware this method may no longer be fast enough.

Not sure if convolution is applicable here, I haven't yet found a clear example that I can directly apply.

Mark Lefebvre

unread,
May 2, 2016, 12:23:45 AM5/2/16
to deap-users
Finally got cprofile to work (doesn't like multiprocess....) and it made the attached graph.
Interestingly, I'm seeing more invocations of eaMuPlusLambda than I thought I would.
It is hard to nail down and will take me more time to some experiments, but my assumption was I'd get toolbox.population + LAMBDA * GEN number of these, but I don't appear to - the number is higher.
More disturbingly, this number seem to go up as I adjust GEN, so the discrepancy increases more than the increase in GEN.
I am running 1.1 (on this machine) but I can try other versions as well with virtualenv if need be. 

Thanks

-Mark Lefebvre
output.png

François-Michel De Rainville

unread,
May 2, 2016, 6:36:55 AM5/2/16
to deap-users

I'm not sure if it's only on my mobile device but the image has been shrunk and I can't read it.

François-Michel De Rainville

unread,
May 2, 2016, 8:37:51 AM5/2/16
to deap-users
According to your graph eaMuPlusLambda is called once, it is the eval funciton that is called 1005 times.

Images intégrées 1

Mark Lefebvre

unread,
May 5, 2016, 12:40:47 PM5/5/16
to deap-users
Sorry, I meant eval_func (my evaluation algorithm) not eamupluslambda.
I thought it was toolbox.population + LAMBDA * GEN calls - but that is not what this number is and it increases at something greater than GEN when gen goes up....I expected this to be linear and easy to calculate.

This might be expected.  In any case, small changes are helping somewhat to speed things up.
I haven't found a way to make the sliding window faster with all those variable assignments which seem to be the big thing now.
Postproc is mostly checks for things - I may be able to speed that up with some more efficient use of variables or something.

Is anyone else using a sliding window which is more efficient?

thanks

-Mark Lefebvre

Mk

unread,
May 6, 2016, 2:57:15 AM5/6/16
to deap-users
I have seen a book: Python for Data Analysis. I think it would help you to understand concepts behind Numpy, Pandas, etc.
Pandas at least contains rolling window functions. If it would not help you you can still turn to Cython.

François-Michel De Rainville

unread,
May 6, 2016, 6:41:22 AM5/6/16
to deap-users

What is the population size and lambda?

Mark Lefebvre

unread,
May 6, 2016, 9:54:32 AM5/6/16
to deap-users
   MU = 50
   LAMBDA = 100
   XOVER = 0.7
   MUTATE = 0.25
   GEN = 10
   row = 0
   count = 0
   random.seed(318)
   #pool = multiprocessing.Pool()
   #toolbox.register("map", pool.map)
   pop = toolbox.population(n=50)
   hof = tools.ParetoFront()
   algorithms.eaMuPlusLambda(pop, toolbox, MU, LAMBDA, XOVER, MUTATE, GEN, halloffame=hof, verbose=False)

Mark Lefebvre

unread,
Jun 2, 2016, 10:35:52 PM6/2/16
to deap-users

Removed some extra variables and bumped this up to 1000 generations - now things look quite different.
Numpy makes everything slower, I put my entire dataset in a numpy array but it is still much slower than using the python array and math operations.

Any ideas why nympy makes things slower?
There must be some way to speed this up.
bench3.png

François-Michel De Rainville

unread,
Jun 3, 2016, 5:22:44 AM6/3/16
to deap-users
Are you still on your unexpected number of evaluations? What is the wrapper around varOr?

With numpy, the trick is to make things faster is to reduce the number of calls using vectorization.

Mark Lefebvre

unread,
Jun 7, 2016, 1:28:45 AM6/7/16
to deap-users
I have no idea where that number of evaluations comes from, it clearly isn't mu + lambda * generations - the code that I'm using to set those values is in this thread, but I change generations to 1000.  I don't call eval that seems to only be called from emupluslambda (so why is it called 332900 times?). I don't wrap varOr, all I did was take the deap example code for symbolic regression, changed to emupluslambda and add my code for sliding window (which I pasted in this thread).

As for numpy - I'm not sure how it would ever make DEAP faster directly for symbolic regression.  Clearly, it isn't helping to vectorize the variables in the sliding window - not sure why.

Is there are more efficient way to set the DEAP vars so that they are a matrix? Sliding window in my example is at least a partial matrix (the first row has fewer elements than the other rows) so I would expect that it could be quicker.  I'm not sure what DEAP does with the values one I set the variables.

François-Michel De Rainville

unread,
Jun 7, 2016, 6:33:08 AM6/7/16
to deap-users
Can you share a minimal example reproducing your bug using gist or pastebin (or similar)?

Mark Lefebvre

unread,
Jun 25, 2016, 9:32:54 AM6/25/16
to deap-users

I had some time to play with this today - what I discovered was that if you use your symbreg example, it doesn't seem to call eval the number of times to expect, no matter if I'm using the example as is or I switch eaSimple out for eaMuplusLambda.  Is this a problem with my python configuration?

François-Michel De Rainville

unread,
Jun 26, 2016, 8:26:07 AM6/26/16
to deap-users

Doesn't it call the function the number of individuals times the number of points?

Mark Lefebvre

unread,
Jun 26, 2016, 2:48:21 PM6/26/16
to deap-users
At this point, I'm trying to get some baseline data.
How many times (as a number) should it be called?
When I run symbreg.py from your examples, I get this:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6913    0.313    0.000    0.317    0.000 {eval}
76752/19188    0.188    0.000    0.747    0.000 copy.py:145(deepcopy)
     6913    0.178    0.000    0.281    0.000 gp.py:87(__str__)
   145173    0.173    0.000    0.396    0.000 symbreg.py:59(<genexpr>)
58164/38676    0.109    0.000    0.285    0.000 creator.py:160(initType)
    76752    0.085    0.000    0.101    0.000 copy.py:267(_keep_alive)
     7174    0.083    0.000    0.124    0.000 gp.py:152(height)
    49721    0.045    0.000    0.055    0.000 random.py:273(choice)
    19188    0.044    0.000    0.662    0.000 gp.py:55(__deepcopy__)


Which looks wrong to me.
If I increase generations by 5, I get this:
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7773    0.386    0.000    0.391    0.000 {eval}
     7773    0.233    0.000    0.372    0.000 gp.py:87(__str__)
86516/21629    0.211    0.000    0.838    0.000 copy.py:145(deepcopy)
   163233    0.198    0.000    0.508    0.000 symbreg.py:59(<genexpr>)
65487/43558    0.123    0.000    0.320    0.000 creator.py:160(initType)
     8115    0.107    0.000    0.160    0.000 gp.py:152(height)
    86516    0.096    0.000    0.115    0.000 copy.py:267(_keep_alive)
493224/478759    0.054    0.000    0.058    0.000 {len}

What do you get?
Once I can confirm that this is correct, I'll modify the symbreg.py to use NSGA2 and euMuPlusLambda and try to see if that changes properly.

François-Michel De Rainville

unread,
Jun 30, 2016, 11:35:09 AM6/30/16
to deap-users
Are you mistaking the Python built-in eval function for your evaluation function? DEAP GP uses internally eval to build and evaluate trees.

Mark Lefebvre

unread,
Jul 3, 2016, 1:15:47 PM7/3/16
to deap-users

Nothing in my code calls eval - however, I believe your library does use it.
At this point, all I am trying to determine is if this call is being made too often given the number of generations and population sizes involved.
Again, this number seems to grow non-linearly with an increase in the number of generations.  I do not know why this should be true, perhaps it is intentional, I do not know.

Have you tested any of this?

Is there a unit test harness that I can use to validate my findings the example code?

The findings that I have posted above - are those correct?  That is just using the examples provided in your distribution.
If I cannot find this same behavior there, then it has some possibility to be related to the (minor) changes I've made.
This seems unlikely to me, however, so I'd like some independent verification in order to debug further.

Thanks

-Mark Lefebvre

François-Michel De Rainville

unread,
Jul 3, 2016, 1:46:32 PM7/3/16
to deap-users
With the git master, I get the exact same numbers (6913 and 7773). I still don't understand what you find abnormal with those numbers (eval is called by compile, which is called during the evaluation). The number of evaluations is approximately given by

population size 300 x 40 generations x ~0.55 variation rate = 6600
population size 300 x 45 generations x ~0.55 variation rate = 7425

My estimation of the variation rate is probably off by some point, but the numbers are in the same order. I don't see any problem there.

Best regards,

Reply all
Reply to author
Forward
0 new messages