My first attempt at GP using deap - seeking advice

213 views
Skip to first unread message

Lionel C

unread,
Jun 22, 2014, 7:15:03 PM6/22/14
to deap-...@googlegroups.com
Hello,

I am trying to apply Genetic Programming to stock selection. My goal is to find a model that uses fundamental financial data (i.e. accounting data from financial statements) to rank stocks. The model will be used to construct a portfolio of the highest-ranked stocks.

I have relied on deap to run my program (see attached). I have found the framework very easy to implement so far, although I may have misused it. This is where I would like to get some feedback and advice.

Data set:
To train and validate the model, I have assembled a large dataset: more than 100 financial measures (of which 50 are ratios that are commonly used to evaluate stocks), across 20 years for close to 2000 stocks. Some stocks do not have data for the whole period, so the number of observations (a given stock for a given year) is around 17000. The data set also shows the performance of the stock in the subsequent 12-month period. The model should therefore be able to predict which stocks are most likely to outperform, based on their fundamentals.

I have split the dataset in two: half for training, half for validation.

Fitness function:
I have tried two fitness functions so far. The first is to compare the model's output (i.e. the score), against the performance of two observations a and b. If score(a) > score(b) and perf(a) > perf(b), the score is "correct". Likewise if score(a) < score(b) and perf(a) < perf(b). The fitness function returns the percentage of pairs which were scored (i.e. ranked) correctly. The second fitness function relies on Spearman's rho, which measures rankings correlation.

Now for the questions:
- the program takes quite a long time to run on my laptop. I have tried to use scoop, but I get error messages (division by zero, even though I use protected division). For the time being, I am therefore reduced to train the model on a small subset of data and for a small number of generations.
Question: do you see any obvious way to improve my program's performance, or is my data set just too large?

- I run GP several times in succession, to train a model on a period (say: 1994 - 2000) and validate the result on the subsequent year (2001). I then proceed to train the model on 1994 - 2001 and validate on 2002, etc. However, I get strange results: with every generation, the fitness average increases, however the min and max are relatively constant. I would have expected the max and min to increase.
Question: Is there a way to log details on each generation's population to better understand these statistics?

LC



experimenting33.py
terminal.txt

Mark Vicuna

unread,
Jun 23, 2014, 8:21:39 AM6/23/14
to deap-...@googlegroups.com
Hi,

  I can only help with the division by zero, I have noticed that if I use numpy floating point types the example protectedDiv doesn't catch the numerical errors.

Mark.

Yannick Hold-Geoffroy

unread,
Jun 26, 2014, 4:24:55 PM6/26/14
to deap-...@googlegroups.com
Hello Lionel,

I am curious as to the error your ran into while using SCOOP. A division by zero seems like an odd error occurrence to me...

Can you post a full traceback of a SCOOP run of your program (including the header)? Using either 1 or 2 workers ( -n 1 or -n 2 ) with the verbose flag ( -vvv ) would be good to generate enough debug data but not too much.

To answer your question on a more general basis, individual evaluation can be quite time consuming. To better understand where your time is consumed in your program, I recommend the use of profilers. A good visual profiling method I recently discovered is gprof2dot which generates a call graph with the time spent most of the functions.
In a similar flow of ideas, a developer of DEAP recently wrote this blog article which speeds up GP evaluation in DEAP by representing individuals directly in Python bytecode. This is more food for though than an actual practically usable work path.

For your second question, I see that you are using eaSimple as the evolutionary loop. If you want to perform live debugging or checkpointing, I would recommend you to copy the original eaSimple function into your program and modify it according to your needs. DEAP strive to be as transparent as possible and encourages its users to try using their own ideas and algorithms (as shown in the documentation). That way, a simple "import pdb; pdb.set_trace()" will do the trick to help you understand what's going on. If you only want to add basic statistic recording, a whole entry was made in the documentation to help you with that. You can register any statistical function you want (even if you defined it yourself) while keeping the predefined algorithms; it suffice to pass it a "stats" parameter with the right functions registered.

Have a nice day,
Yannick Hold


--
You received this message because you are subscribed to the Google Groups "deap-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to deap-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages