Duplicate inidividuals in the population?

569 views
Skip to first unread message

Jim Leek

unread,
Apr 22, 2016, 3:11:30 PM4/22/16
to deap-users
I've put together a pretty nice little symbolic regression tool using DEAP 1.1.0, so thanks guys.

It uses a paretro front, the two objectives being "fitness" and "tree size."  To try to keep bloat from overwhelming everything.

I recently added demes over MPI, so I use multiprocessing to parallelize evaluation, but each deme has it's own MPI process (I can't use SCOOP on my cluster).  However, this made me notice something, I noticed that I was getting a lot of duplicates in the migrating population.  Sometimes every single migrant was the same function. 

So I investigated the populations, and I found that, indeed, there were a lot of duplicates in the populations.  With NGSA-II, half the individuals were sqrt(X), which was the best current match for the function.  So when the best migrants were selected, they were ALL sqrt(X).  SPEA2 did a better job of maintaining diversity, but there were still a lot of duplicates.  (Maybe 10% were sqrt(X) at the same time in the computation.)

Does anyone have a suggestion for controlling this problem?  I can think of a two of options:
1. Check for duplicates in every generation, and replace them with new random expressions.
2. Use the fortin2013 NSGA-II version. https://github.com/DEAP/experimental/blob/master/fortin2013/fortin2013.py

Thanks,
Jim

Marc-André Gardner

unread,
Apr 22, 2016, 3:23:19 PM4/22/16
to deap-users
Hi Jim,

Actually, if the only reason to go with multiobjective is to avoid bloat, then maybe you could consider using other bloat control techniques. Indeed, if you peek at the litterature, you will see that multiobjective has been shown to be not really useful for bloat control. Some bloat control techniques are already implemented in DEAP : one of the simplest is the static height limit (available as a decorator), but the Double Tournament option is also quite efficient.

That being said, I can see another potential source for the problem : the replication rate. Usually in EA, we use a replication rate relatively high, but in GP it is often detrimental. What are your crossover and mutation rates? The sum of both these rates should be at least around 0.8, even 0.9.

Have a good day,

Marc-André

EBo

unread,
Apr 22, 2016, 3:27:47 PM4/22/16
to deap-...@googlegroups.com
When I ran into this on my hydrologic model calibration work, when I
found a duplicate (not only in the current population, but in any
previously evaluated individual -- and I cached each and every one),
then I simply passed that individual off to a mutation function. After
the mutation, I once again tested it against the archived list. As a
note, the fitness evaluation of each individual in my example typically
takes several CPU hours each, so a full calibration run may represent
10,000 CPU hours on one of our clusters.

Hope that helps,

EBo --

François-Michel De Rainville

unread,
Apr 22, 2016, 3:33:22 PM4/22/16
to deap-users
The keywords here are rediversification and avoiding early convergence/diversity loss. There are a lots of papers out there on techniques to battles these in EA.

Cheers,

Jim Leek

unread,
Apr 22, 2016, 3:45:59 PM4/22/16
to deap-users
I am actually using the static height limit, but that's pretty crude, and I do want to allow the solutions to get large, as the functions may get complex.  Fitness vs Complexity seemed like a reasonable trade-off comparison anyway.  I'll read up on the Double Tournament.

Let's see, currently crossover is .7 and mutation is .2.  But I'm using varAnd, so I suppose the actual replication rate is .3*.8 = .24.  I suppose if the bloat gets too high I could see replication from that, but I don't see any indication of that.

I'll look into rediversification and avoiding early convergence/diversity as well.  Thanks,
Jim

François-Michel De Rainville

unread,
Apr 22, 2016, 3:50:43 PM4/22/16
to deap-users
I know Marc-André doesn't like to self-promote but I'd really look at this :

Gardner, M. A., Gagné, C., & Parizeau, M. (2015). Controlling code growth by dynamically shaping the genotype size distribution. Genetic Programming and Evolvable Machines16(4), 455-498.

Plus it is already implemented in DEAP :D

--
You received this message because you are subscribed to the Google Groups "deap-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deap-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jim Leek

unread,
Apr 22, 2016, 5:04:20 PM4/22/16
to deap-users
I should've looked at how long that was before I printed it out, but it's really great!  Thanks a lot.

Jim Leek

unread,
Apr 23, 2016, 7:54:36 PM4/23/16
to deap-users
Do you happen to have any recommendations rediversification and avoiding early convergence/diversity loss?  I have been reading through a couple of papers, but being new to the field it's not obvious if there is a well known solution.  I liked "An improvement of the standard genetic algorithm fighter premature convergence in continuous optimization" 2001 by J. Andre and co.  It at least showed that a few of the early things I was going to try were not going to be terribly effective.  So far the best thing seems to be to adapt the mutation rates as the problem progresses, and have some way to keep an eye on population diversity.  (Which is currently don't have.)  EBo's suggestion is at least a good first step.  Although my evaluation function is thankfully much faster on this particular problem.  Although maybe not for the next one.

On the topic of bloat, I guess it's also possible that the multi-objective approach is actually working against me since the sqrt(x) solution is so simple it dominates everything in the size axis.  I'll set up a single objective run on Monday.  Starting with double tournament selection I suppose.

It also occurred to me today that I need a different error function.  I'm currently still just using average absolute error, which is dumb on the data set I'm currently looking at.  I'm pretty sure the high numbers are absolutely dominating everything else.  It won't take me long to fix that on Monday.

Thanks guys,
Jim
Reply all
Reply to author
Forward
0 new messages