Hello,
I am trying to apply Genetic Programming to stock selection. My goal is to find a model that uses fundamental financial data (i.e. accounting data from financial statements) to rank stocks. The model will be used to construct a portfolio of the highest-ranked stocks.
I have relied on deap to run my program (see attached). I have found the framework very easy to implement so far, although I may have misused it. This is where I would like to get some feedback and advice.
Data set:
To train and validate the model, I have assembled a large dataset: more than 100 financial measures (of which 50 are ratios that are commonly used to evaluate stocks), across 20 years for close to 2000 stocks. Some stocks do not have data for the whole period, so the number of observations (a given stock for a given year) is around 17000. The data set also shows the performance of the stock in the subsequent 12-month period. The model should therefore be able to predict which stocks are most likely to outperform, based on their fundamentals.
I have split the dataset in two: half for training, half for validation.
Fitness function:
I have tried two fitness functions so far. The first is to compare the model's output (i.e. the score), against the performance of two observations a and b. If score(a) > score(b) and perf(a) > perf(b), the score is "correct". Likewise if score(a) < score(b) and perf(a) < perf(b). The fitness function returns the percentage of pairs which were scored (i.e. ranked) correctly. The second fitness function relies on Spearman's rho, which measures rankings correlation.
Now for the questions:
- the program takes quite a long time to run on my laptop. I have tried to use scoop, but I get error messages (division by zero, even though I use protected division). For the time being, I am therefore reduced to train the model on a small subset of data and for a small number of generations.
Question: do you see any obvious way to improve my program's performance, or is my data set just too large?
- I run GP several times in succession, to train a model on a period (say: 1994 - 2000) and validate the result on the subsequent year (2001). I then proceed to train the model on 1994 - 2001 and validate on 2002, etc. However, I get strange results: with every generation, the fitness average increases, however the min and max are relatively constant. I would have expected the max and min to increase.
Question: Is there a way to log details on each generation's population to better understand these statistics?
LC