In the last part of this series introducing genetic programming (GP), I described a simple experiment in symbolic regression. In that experiment, a formula was discovered based on data observations. This formula, modeling the underlying data generating process (DGP) can be used for predicting out of sample values. As the DGP was quite simple, the formula was consistently discovered in a handful of generations.
An example of a slightly more complicated DGP is shown below:
This series is described by the formula:
To model the DGP for this series, I set up a GP run the perform symbolic regression using the following parameters:
Parameter | Value |
Elitist | true |
Fitness | Mean Error |
Fitness direction | Ascending |
Population Size | 10000 |
Max Initial Depth | 3 |
Max Depth | 17 |
Mutation % | 8 |
Crossover % | 87 |
Training Generations | 100 |
tournament | Selection Strategy |
Tournament Size | 4 |
Functions | add, subtract, multiply, divide, sin, cos, square root, exponentiation, natural logarithm |
Terminals | randomInteger(-1..5), variable(x) |
A video of the run is shown below. The red series is the target equation. The blue series is the best fit after each generation. If all goes well, the blue series will eventually converge to the red series.
You can view the log of the run [Symbolic Regression Log 3.txt].
While the mean error is not zero—the model is not exact—GP does a reasonable job at approximating the target series, with a final mean error of 0.011952. The mean error is the average error in each of 100 points of the predicted x,y value compared to the known x,y value.
As I did in the prior blog post, I will prove that the GP methodology works. In the second experiment, I set the tournament size to 1, essentially eliminating the “fittest” from survival of the fittest. Elitism is still used, so any fit individual uncovered through mutation and crossover will be retrained. However, these mutations and crossovers do not necessarily use fitter individuals.
Parameter | Value |
Tournament size | 1 |
This run does a much worse job then the first example, giving a mean square error of 0.70398.
A video of the run is shown below. Very little progress is made over the 100 generation run.
You can view the log of the run [Symbolic Regression Log 4.txt].
I executed ten runs comparing the modeling using tournament sizes 1 and 4. The average mean error over each generation is shown in the chart below.
In the real world, series we choose to model are not generally as simple as the target series described above. An example of such series are chaotic series. An example is the following
The equation for this series is:
Chaotic series are generally described as series that appear random but a highly dependent on their initial conditions, in this case, Y(0)=0.9. A different initial value might yield a series that looks entirely different. As these series are actually deterministic, they can therefore be accurately modeled. However, in reality, these series are very hard to model and predict, as even a small error in predicted initial value will lead to a much larger error in overall accuracy.
In the first experiment on this series, I use the parameters below, which are essentially the same as what was used in the first experiment.
Parameter | Value |
Elitist | true |
Fitness | Mean Error |
Fitness direction | Ascending |
Population Size | 10000 |
Max Initial Depth | 5 |
Max Depth | 17 |
Mutation % | 8 |
Crossover % | 87 |
Training Generations | 100 |
tournament | Selection Strategy |
Tournament Size | 4 |
Functions | add, subtract, multiply, divide, sin, cos, square root, exponentiation, natural logarithm |
Terminals | randomInteger(-1 5), randomDouble(-1 1) variable(x) |
The results are not very good. The mean error of this run is 0.21638 . GP has an extremely hard time modeling this chaotic series as it lacks sufficiency, meaning the input parameters are not sufficient to find the actual solution.
A video of the run is shown below.
You can view the log of the run [Symbolic Regression Log 5.txt].
Looking back at the formula describing the DGP, we can see that the current values are dependent on prior series values. This phenomenon is typical with chaotic series. By including two additional terminals providing the prior two values of the series, a much better approximation is found.
Parameter | Value |
Terminals | randomInteger(-1..5) randomDouble(-1..1) variable(x) offsetValueFixed(1) offsetValueFixed(2) |
This run achieved a mean error of 0.002874, much better than the initial run.
A video of the run is shown below.
You can view the log of the run [Symbolic Regression Log 6.txt]
I executed the second set of runs ten times, with and without the additional two autoregressive terminals. The results are shown in the following figure.
The results of these experiments indicate that while GP is highly parameter dependent, as long as sufficiency is satisfied, results will tend to converge towards the correct or optimal solution.
As these demos did not take very long to run, it is certainly possible to get better results by using larger population sizes or running for more generations. In practice, we would certainly do this. However, for series with continually changing DGP (think stocks), increasing these parameters could lead to over-fitting or at least difficulty in reacting to such changes. The next entry in this series will discuss such as situation, also know as regime change.