This article demonstrates using genetic programming (GP) to solve a symbolic regression problem. The goal of symbolic regression is to find a program or expression that models observed data. This is closely related to linear regression, except the format of the data and the final model is not constrained in any way, other than by the provided primitives. Symbolic regression is probably the simplest example of genetic programming and is often used in most textbooks as a first problem.
In this example, I simulate experimental data with the equation
This equation represents the underlying data generation process (DGP)—the process causing or controlling the observed data. This function is graphed below:
In this case, the DGP is a simple nonlinear function of a single variable x. In a real-world case, the DGP is unknown and is observed as a stream of (x, y) pairs. Real world data is simulated for this experiment by providing a sequence of values y=f(x) for x in [1..100]. GP will be trained on these values and will. ideally, determine the actual DGP.
The first step in the GP process is to define the parameters and the primitive set. The values used in this experiment are shown in the table below:
|Max Initial Depth||5|
|*Functions||add, subtract, divide, multiply|
|*Terminals||randomInteger(-1 5), variable(x)|
*Problem specific; **Not a standard GP parameter
Note that the reproduction percentage is implied to be 5% (100 – mutation% – crossover%). The target function is used to provide the DGP. This parameter is specific to this experiment’s domain. The function and terminals need to be chosen with respect to the problem at hand. This simple example contains a limited set of primitives. Future examples will contain more complex mathematical and/or financial functions.
Recall from the last blog entry, the GP algorithm involves the following four steps. Most of these steps are standard to GP and don’t vary no matter the problem domain:
- Initialize population of programs
- Calculate each program’s fitness
- Build next generation of programs
- Repeat until solution found or max generations reached
Step 1 makes use of the available primitives and randomly generates a population of individuals subject to any parameterized constraints such as initial tree depth. The population size is fixed by corresponding parameter.
The fitness evaluation in step 2 is domain specific. The fitness function is a measure of how well any individual solves the target problem. In this case, each program to be evaluated on the same sequence of 100 (x,y) pairs generated by the DGP equation. To calculate the fitness of any individual program, use the program to calculate the value for y for each value of x. As we know the actual DGP formula (x3-x2+x-4), we can find the error for each of the 100 calculations. Fitness is the mean error over all (x,y) pairs.
In this experiment, as we know what the actual answer should be, we can stop if an exact solution is found. Otherwise, the process continues until the maximum number of generations is reached.
Step 3, evolution of the next generation, is done in the standard GP manner, as described in the last blog entry. One or two individuals are selected based on their fitness and are either mutated, mated, or reproduced to generate a single individual for the next generation. When the required number of new individuals (per the population size) are generated, the process repeats.
A video of the actual run is shown below. In this case, the video is not very interesting, as the algorithm finds a reasonably close solution almost immediately, likely due to the relatively large population size.
The final result from the GP run is the following expression:
(- (+ (+ (* x (* x x)) (- x (+ 3 1))) (+ x x)) (* x (+ x 2)))
This expression is the equivalent of the target equation: x3-x2+x-4.
A log of this run is available [Symbolic Regression Log 1.txt].
Here you can observe the fitness and best result after each generation of evolution.
Though there are several third-party GP implementations available, the program I am demonstrating in this and future blogs is a Java program, and early precursor of Evolutionary Signals, written as part of my PhD dissertation.
From the above discussion, one might get the notion that GP is just a highly parallel random search , trying different combinations of programs until a good one is found. Looking at the log of the experiment above, even the best result from the first (randomly created) generation has a reasonable fitness score (mean error = 4).
Yes, a blind search will find the answer in a simple problem such as the one above, but evolutionary search will find the answer much quicker. I can demonstrate this in a manner I have not seen this done anywhere else.
Let’s run the experiment again, but using a tournament size of 1 instead of 2. Doing this eliminates the “fittest” component of “survival of the fittest”. This is quite close to a random search. I still used Elitism, though this perhaps introduces a level of fitness into the selection process.
For this experiment, I also raised the maximum training generations from 100 to 10000. The first experiment found the correct answer before the maximum generations were reached, though for more complicated problems this generally won’t be the case.
The modified run took 83 generations to find the correct answer, compared to 12 generations in the first test.
You can watch the run in the video below.
The log of the run is available [Symbolic Regression Log 2.txt].
To get a broader set of results, I ran each experiment ten times. Both methods found the correct solution, but the random approach takes considerable more generations to do so.
|Generations needed to find solution|
|Run 1||Run 2|
The results of these experiments could also be used to predict future values. Assume we were asked to predict the value for x=101. As we found the correct equation for this, we just apply that equation to the new point to give the predicted value. Such a prediction approach can be applied to market prediction if the series under consideration is a stock series. Of course, stock series don’t follow such neat and clean formulas (if only they did). However, that doesn’t mean we can’t approximate the underlying DGP and make reasonable predictions.
In the next blog, I will address prediction of more complicated and chaotic time series.