ECJ |
An Evolutionary Computation and Genetic Programming System
Bob Orchard (bob.orchard@nrc-cnrc.gc.ca)
National Research Council of
Institute for Information Technology
There are a number of topics that fall under the general category of Evolutionary Computing. These include Genetic Algorithms (GA) and Genetic Programs (GP). Gene Expression Programming (GEP) is a ‘new’ evolutionary algorithm (developed by Cândida Ferreira) that evolves computer programs in a way similar to GP. But the representation of the programs is more efficient and leads to less complex implementation of the genetic operators. The individuals of gene expression programming are encoded in linear chromosomes which are expressed or translated into expression trees when evaluation is required. Thus, in GEP, the genotype (the linear chromosomes) and the phenotype (the expression trees) are different entities (both structurally and functionally).
As in nature, the linear chromosomes consist of the genetic material that is passed on with modification to the next generation. Therefore, in GEP, all the genetic modifications take place in the chromosomes, and only the chromosomes are transmitted in the process of reproduction. After reproduction the new chromosomes are expressed forming the body or expression trees (ETs).
The ETs are
computer programs evolved to solve a particular problem and are selected
according to their capabilities in solving the problem at hand. With time,
populations of such computer programs discover new traits and become better
adapted to a particular selection environment (for instance, a set of
experimental results) and, hopefully, a good solution evolves.
This is variation
of Genetic Programming and not a ‘new’ type of evolutionary computing.
However, due to the genotype/phenotype representation and to the multigenic
organization of GEP chromosomes, this new algorithm generally leads to better
performance. It may also generate better solutions in many cases.
For all the
details of this new algorithm, see the seminal GEP paper
by Ferreira (published in Complex Systems) where the algorithm is fully
described and applied to a set of problems, including symbolic regression, boolean concept learning, and cellular automata.
Now much of the
above is taken directly from the Wikipedia definition of GEP and as it says,
the details can be found in Ferreira’s paper and in much greater detail in her
excellent book, Gene Expression programming: mathematical modeling by an
artificial intelligence. 2nd Edition.
Springer-Verlag: 2006. ISBN 3-540-32796-7. There is
also a commercial implementation of the algorithm by the company founded by
Ferreira, Gepsoft (www.gepsoft.com). The
product is called GeneXproTools. You might also look at the documentation
provided with the fully functioning (but limited) download of GeneXproTools to
gain much more insight and detail about gep, or just view the short tutorial (http://www.gepsoft.com/tutorial002.htm).
The intent of this ECJ implementation of the gep algorithm is to provide a free, and hopefully useful and accurate implementation of the software, with most of the major functionality. And, since the source code is provided, researchers and other interested parties can experiment with ways to enhance or improve some aspect of this approach to evolutionary computing.
One doesn’t need to understand the detailed implementation of gep in ECJ to use the system but it is necessary to know about some things so that the appropriate parameters can be set and your data can be processed.
Please NOTE that the output from any examples in this documentation may be different than the output you see when you execute the problems. The code is changing and this can cause changes to the output even with the same input (especially if code changes affect the way or order in which random numbers are allocated).
You need to supply 3 things to run an experiment: a parameter
file that defines how to do the evolution, a very simple Java
class (actually you don
The example is to determine a model (equation) that satisfies the following set of data (stored in the file test1.txt). The variables x and y are independent variables and z is the dependent variable, so z = f(x,y).
x
, y, z
0.5,
1.0, 1.25
1.0,
2.0, 3.0
2.0,
4.0, 8.0
3.0, 12.0, 21.0
You might quickly see that this is ‘perfect’ sample data for
the function z = x*x + y. Data files that are currently supported are the
so called CSV, comma separated value, text files. NOTE: gzip
Generally one will use either a
comma, a space or a tab to be the field separator. The example above uses a
comma. The first row of the file must identify the independent and dependent
variable names with the last column being the dependent variable. You’ll see
later that for time series problems there is another file format that is used
with ‘raw’ time series data, but for most problems the training data, when in a
file, has this format. So here we have 3 columns that identify the 2
independent variables x and y plus the dependent variable z.
There are 4 rows of numeric data that we will use as
The parameter file that tells the gep system how to behave as it generates models and proceeds from generation to generation is test1.param. It has the following content:
# This parameter file
supports a simple test example to solve x^2 + y.
#
# The data (including variable names) is in the file test1.txt.
# My parent:
parent.0 =
../../../gep/gep.params
# Problem
#===============================
eval.problem = ec.app.gep.test1.test1
# run for 100 generations, quit prematurely if I find something ideal
generations =
100
quit-on-run-complete =
true
# ec.Subpopulation
# ==============================
# subpop size is 50 individuals
pop.subpop.0.size =
50
# ec.Species
# ==============================
gep.species.numgenes = 2
gep.species.gene-headsize = 5
gep.species.gene-linking-function = +
# Problem type must be one of: functionfinding, classification,
timeseries, logical
# Set default to be
gep.species.problemtype = functionfinding
# specify the symbols ... functions and terminals (variables)
#
# the terminal symbols (variable names) will be obtained from the data
file (1st row)
gep.species.symbolset.terminalfilename = ec/app/gep/test1/test1.txt
# 4 functions + - * / are used in this problem
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.3 = Div
# Statistics
# ==============================
# output statistics to the file "test1.stat" in the directory
# the run was started in
stat =
ec.gep.GEPSimpleStatistics
stat.file =
test1.stat
Let..
/../../gep/gep.params
). This is the parameter file that holds all of the default settings for a
gep problem. You can over-ride any of the settings in this file by including
new values in your own param file (in this case, test1.param). But your param
file must reference this file so that the settings that you are not concerned
about are properly defined. If you looked at the gep.params file you
Next we see that the parameter eval.problem is set
to ec.app.gep.test1.test1
.
This is the name of the class that contains the code we need to supply for our
problem. We
The next 2
parameters identify the maximum number of generations (in this case 100) that
we should run when trying to find a solution to the problem and whether we
should quit when we find a
The following set of parameters that start with
gep.species identify various things that are very specific to gep problems.
Each individual in the population is a chromosome that can be composed of 1 or
more genes (there is an expectation that you
There are 4 types of problems that can be handled by the gep
approach (at this time). In this case we are doing a function finding problem.
Genes can be composed of various symbols that represent the functions and
terminals (independent variables) that can be used. There can also be constants
but more on that later. We need to identify these symbols so that the system
can build appropriate genes. The functions are identified by telling how many
there are (functionsize) and what they are (function.0, function.1, etc.). In
this problem we have decided to allow the genes to use the 4 functions:
addition, subtraction, multiplication and addition. There are a very large
number of supported functions. These can be found in the ec/gep/symbols
directory. It is quite simple to add more functions as needed. Just follow the
example of a similar type of function and create a new class. The terminal or
independent variable symbols can also be defined in the parameter file but for
this example we have a training data file that contains the terminal symbols in
the 1st row of the file. So we specify that gep.species.symbolset.terminalfilename
is ec/app/gep/test1/test1.txt
.
As shown above this will identify the 2 terminal symbols x and y (as well as
the dependent variable symbol z).
The final entries in the parameter file describe what we want in the way of output from the system. This is called the statistics output. In this example we use the class ec.gep.GEPSimpleStatistics to generate our output and specify that the output will go to the file stats1.txt. Users may want to build their own statistics classes to control the type of information that is generated.
Finally we need to provide a simple program that tells the system how to evaluate the fitness of chromosomes so decisions can be made as the individuals are discarded or moved to the next generation (possibly with modifications -- mutations, inversions, and various forms of recombination and transposition). The code for the class ec.gep.test.test1 is shown below.
import ec.*;
import ec.gep.*;
import ec.simple.*;
/**
* The problem is to find the equation x*x + y
*
* Data and variable names are in the data file test1.txt
*/
public class test1 extends GEPProblem implements SimpleProblemForm
{
static double IDEAL_FITNESS_MINIMUM = 999.9999;
public void evaluate(EvolutionState state, Individual ind, int
threadnum)
{
if (!ind.evaluated) // don
{
// Mean Squared Error (MSE) fitness is
normalized between 0 and 1000 (1000 * (1/(1+MSE))
double fitness =
GEPFitnessFunction.MSEfitness(true, (GEPIndividual)ind);
// the fitness better be SimpleFitness!
SimpleFitness f = ((SimpleFitness)ind.fitness);
f.setFitness(state,(float)fitness, fitness
>= IDEAL_FITNESS_MINIMUM);
ind.evaluated = true;
if (fitness >= IDEAL_FITNESS_MINIMUM)
{
((GEPIndividual)
}
}
}
}
This is the basic format that should be used in the user
program. The outer test just says that if we
Note: the 1st argument to a fitness function evaluation is a boolean that specifies whether training data (true) or testing data (false) should be used when calculating the fitness of the function. The training data will always be used when measuring the fitness of functions (models) created during the evolutionary process. The fitness of the model with testing data will normally be used to validate that ‘good’ models selected from the training (evolutionary) exercise are still good when subjected to other data.
So now we have all of the
required pieces to define our problem and to execute it we can use the command
line or the ECJ console. To run from a command line execute something like:
D:\ecj>java -classpath
./;./start/javacsv.jar ec.Evolve -file ec/app/gep/test1/test1.params
You
| ECJ
| An evolutionary computation system (version 15)
| By Sean Luke
| Contributors: L. Panait, G. Balan, S. Paus, Z. Skolicki,
|
J. Bassett, R. Hubley, and A. Chircop
| URL: http://cs.gmu.edu/~eclab/projects/ecj/
| Mail: ecj-help@cs.gmu.edu
| (better: join ECJ-INTEREST at URL
above)
| Date: April 4, 2006
| Current Java: 1.5.0_10 / Java HotSpot(TM) Client VM-1.5.0_10-b03
| Required Minimum Java: 1.3
Threads: breed/1 eval/1
Seed: 4357
Job: 0
Setting up
Initializing Generation 0
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.0.weight
ALSO: gep.species.symbolset.function.0.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.1.weight
ALSO: gep.species.symbolset.function.1.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.2.weight
ALSO: gep.species.symbolset.function.2.weight
WARNING:
Weight for GEP Function must be > 0; defaulting to 1)
PARAMETER: pop.subpop.0.species.symbolset.function.3.weight
ALSO: gep.species.symbolset.function.3.weight
Generation 1
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x
(x*x) + y
Found Ideal Individual
The warnings about weight for the functions can be ignored.
There is an option to identify weights for the functions so that the ones with
higher weights are selected more often when building and evolving genes. The
default of 1 was used by not specifying the weights. In the code we chose to
display the best individual when it was found and thus the output showing
the details for the chromosome with an ideal fitness of 1000. It identifies the
linking function used (+), the contents of the 2 genes (in Karva format ... see
the gep documentation in papers by Ferreira, etc.) and the expression of the
chromosome in a mathematical formula:
(x*x) + y
This is clearly the solution we were looking for. You
will also find the statistics for the run in the file test1.stat. This should
contain the following information:
Seed(s) used in this job: 4357
Problem Type: functionfinding
Generation: 0
Best Individual:
Evaluated: T
Fitness: 274.6781
Linking function: +
Gene 0
-.x.-.y.y.y.y.x.y.y.y
Gene 1
+.x.y.y.-.x.y.y.y.y.x
(x-(y-y)) + (x+y)
Generation: 1
Best Individual:
Evaluated: T
Fitness: 1000.0<small><code>
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x
(x*x) + y
Best Individual of Run:
Found at Generation: 1
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x
(x*x) + y
Size of program: 4
Variables used(count): x(2), y(1)
***** TRAINING data results *****
Statistics:
MSE: 0.0 RAE:
0.0 RSE: 0.0
RMSE: 0.0 MAE:
0.0 RRSE: 0.0
Observed Computed
1.25 1.25
3.0 3.0
8.0 8.0
21.0 21.0
The output shows
the best individual found in each generation. There were only 2 generations
needed to find a solution to this problem since it was a simple one. Notice
that in the first generation (generation 0), the best solution found was
(x-(y-y) + (x+y). It had a fitness value of 274.6781, well below the perfect
value of 1000 that we get with the best solution of generation 1. The
statistics also provide some other information. The size of the best solution
found during the run was 4. This means that there were 4 symbols used in the
expression (excluding the linking function). So we had 2 x terminal symbols, 1
y terminal symbol and the * function symbol in the solution. We also display some
statistical values including mean squared error (MSE), root mean squared error
(RMSE), etc. Finally we display the values expected or observed (from our
training data) and the values produced by the best model in the run, in this
case a perfect model.
Note that in more
recent versions of ECJ-GEP you can request that the mathematical expressions
found be simplified. In which case you would see in the results and statistics
file something like:
MATH
(y+(x*x))
MATH
(SIMPLIFIED)
x^2+y
But so note that in most cases it is best to do your experiments without simplification turned on since the software system used to do the simplification of expressions (meditor) has been know to cause crashes or non terminating loops for some expressions. Also the meditor.jar file must be in the ‘start’ directory. To turn on simplification the following line must be in the parameter file:
pop.subpop.0.species.ind.simplify-expressions = true
So, that
gep.species.symbolset.terminalsize = 1
gep.species.symbolset.terminal.0 = x
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.2.weight = 1
gep.species.symbolset.function.3 = Div
gep.species.symbolset.function.3.weight = 1
Here we
package ec.gep.test;
import ec.*;
import ec.gep.*;
import ec.simple.*;
public class test2 extends GEPProblem implements SimpleProblemForm
{
public static final double IDEAL_FITNESS_MINIMUM =
0.9999999999;
public double xvalues[] = { 9.500366, -6.130432,
3.252685, 7.88797, 9.090484,
1.485199,
-3.950531, 10.003326, -0.607453, 5.469299};
public double zvalues[] = { 9103.54918799079,
1213.47815867463, 160.18146840485,
4432.23529247904, 7671.79392962917, 11.8327154232663, 193.570365560718,
11124.3786278048, -0.326443121119579, 1093.78835980998}; // x^4 + x^3 +
x^2 + x
public double[] getDataValues( String label )
{
if (label.equals("x"))
return
(xvalues);
else if
(label.equals("dependentVariable")) // always called
return
(zvalues);
else
return
null;
}
public void evaluate(EvolutionState state, Individual
ind, int threadnum)
{
if (!ind.evaluated) //
don
{
//
Mean Squared Error (MSE) fitness is normalized between 0 and 1000 (1000 *
(1/(1+MSE))
double
fitness = GEPFitnessFunction.MSEfitness((GEPIndividual)ind);
// the
fitness better be SimpleFitness!
SimpleFitness f = (true, (SimpleFitness)ind.fitness);
f.setFitness(state,(float)fitness,
fitness >= IDEAL_FITNESS_MINIMUM);
ind.evaluated = true;
if (fitness >=
IDEAL_FITNESS_MINIMUM)
{
((GEPIndividual)
}
}
}
}
The main
difference here is that we have to provide a method getDataValues to supply the
data values for the terminal symbols identified in the param file. In this case
that is the x values. The ECJ system will call this function during
initialization to get the values, passing the symbol "x" as an
argument. The user program must provide a double[]
result. In the previous example the data was in a file and the first row had
the symbols for the independent variable and the dependent variable. In the
parameter file we don
Seed(s) used in this job: 4357
Problem Type: functionfinding
Generation: 0
Best Individual:
Evaluated: T
Fitness: 0.004819467
Linking function: +
Gene 0
-.-.-.x.*.*.-.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+.-.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*.x.x.x.x.x.x.x.x.x.x.x
((x-(x*x))-((x*x)-(x-x))) + ((x+((x/x)/x))*(x-(x/x))) +
((x+(x*x))*((x*x)+x))
Generation: 1
Best Individual:
Evaluated: T
Fitness: 410.7618
Linking function: +
Gene 0
-.-.*.x.*.*.-.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+.-.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x
((x-(x*x))-((x*x)*(x-x))) + ((x+((x/x)/x))*(x-(x/x))) +
((x+(x*x))*((x*x)+(x/x)))
Generation: 6
Best Individual:
Evaluated: T
Fitness: 500.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.+./.x./.x././.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x
((x-x)-(x*x)) + ((x+((x/x)/x))*(x/(x/x))) + ((x+(x*x))*((x*x)+(x/x)))
Generation: 14
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.x.x.x.x.x.x.+.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x
((x-x)-(x*x)) + (x*x) + ((x+(x*x))*((x*x)+(x/x)))
Best Individual of Run:
Found at Generation: 14
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.-.*.x.x.x.x.x.x.x.x.x.x.x.x.x.x
Gene 1
*.x.x.x.x.x.x.+.x.x.x.x.x.x.x.x.x
Gene 2
*.+.+.x.*.*./.x.x.x.x.x.x.x.x.x.x
((x-x)-(x*x)) + (x*x) + ((x+(x*x))*((x*x)+(x/x)))
Size of program: 23
Variables used(count): x(13)
***** TRAINING data results *****
Statistics:
MSE: 7.182716450022835E-24
RAE: 4.832470396077297E-16
RSE: 4.344516272977063E-31
RMSE: 2.680059038533076E-12
MAE: 1.7715995337397316E-12 RRSE:
6.591294465411982E-16
Observed Computed
9103.54918799079 9103.549187990786
1213.47815867463 1213.4781586746294
160.18146840485 160.18146840484988
4432.23529247904 4432.235292479036
7671.79392962917 7671.793929629167
11.8327154232663 11.832715423266338
193.570365560718 193.57036556071805
11124.3786278048 11124.378627804794
-0.326443121119579 -0.3264431211195794
1093.78835980998 1093.7883598099795
In this case it
took 14 generations to solve the problem with an ideal (fitness = 1000)
solution. The solution could use some simplification so that we see the first
term is -x*x, the second is x*x (which cancel each other) and the third is
(x+x*x)(x+1) or x^4+x^3+x^2+x, the desired solution.
The errors in the statistics are not exactly zero but are quite small. This is
due to loss of precision as the complex (not simplified) expression is
evaluated. It also demonstrates why we use double and not float for all
calculations. One final thing to note is that we don
Again, if
simplification of math expressions was turned on you would see something like
the following in the output:
MATH
((x-x)-(x*x)) + (x*x) +
((x+(x*x))*((x*x)+(x/x)))
MATH (SIMPLIFIED)
x+x^2+x^3+x^4
Test3 is just test1 again, but this time we do not have to supply a Java
program at all, we
# Problem
#===============================
eval.problem = ec.gep.GEPDefaultUserProg
eval.problem.fitness-function = RH
eval.problem.fitness-function-arg0 = 2.0
Instead of
specifying some user program like ec.gep.test.test3 we use
ec.gep.GEPDefaultUserProg. But because we do this we need to identify which
fitness function to use. There are more than 25 fitness functions provided and
most require a single argument, the individual being evaluated. Some require
one or 2 extra numeric arguments. In the case of test3 we
100*(abs(expected-computed)/expected) < precision
The maximum number of hits possible is the number of sets of training data. The GEPDefaultUserProg uses the RH function to determine fitness. Check out the GEPFitnessFunction class to see the selection of functions available and to add your own if so inclined. So for test3 we only had to create a parameter file and a testing data file. The results of the run should look something like:
Seed(s) used in this job: 4357
Problem Type: functionfinding
Generation: 0
Best Individual:
Evaluated: T
Fitness: 3.0
Linking function: +
Gene 0
*.-.-.+.y.y.x.x.y.x.x
Gene 1
-.y.-.y.y.y.y.y.x.y.x
(((x+y)-y)*(y-x)) + (y-(y-y))
Generation: 1
Best Individual:
Evaluated: T
Fitness: 4.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x
(x*x) + y
Best Individual of Run:
Found at Generation: 1
Evaluated: T
Fitness: 4.0
Linking function: +
Gene 0
*.x.x.y.y.x.x.y.x.x.x
Gene 1
y./.x.+.+.x.x.x.x.x.x
(x*x) + y
Size of program: 4
Variables used(count): x(2), y(1)
***** TRAINING data results *****
Statistics:
MSE: 0.0 RAE:
0.0 RSE: 0.0
RMSE: 0.0 MAE:
0.0 RRSE: 0.0
Observed Computed
1.25 1.25
3.0 3.0
8.0 8.0
21.0 21.0
As you can see
the final result is the same, we get the x*x+y equation that the data supports.
The maximum fitness value in this case was 4, since there were only 4 training
sets of data.
This concludes the introduction to using the ECJ GEP extension. The next
section identifies a number of other examples that have been implemented,
showing more complex examples from the four types of problems: function
finding, logical, classification and time series. We
These examples are similar to the test1, test2 and test3 examples we explored
in the previous section. We
1. Acot Example - this is an attempt to discover the ArcCotangent
function using a set of 100 training cases. There is only 1 gene, 30
individuals in the population and the RRSE (Root Relative Squared Error) fitness
function is used. There is one independent variable, d0, specified in the
Acot.param file. There are 17 functions used (+, -, *, /, Sqrt, Exp,
Floor, Ceiling, Abs, Neg, Sin,
gep.species.use-constants
= true
gep.species.numconstantspergene = 2
gep.species.integer-constants
= false
gep.species.constants-lowerlimit = -10
gep.species.constants-upperlimit = 10
gep.species.rnc-mutation-prob
= 0.01
gep.species.dc-mutation-prob
= 0.044
gep.species.dc-inversion-prob
= 0.1
gep.species.dc-istransposition-prob = 0.1
First we identify that we are going to use constants and that there will be 2
constants allocated per gene. We are using floating point constants since the
integer-constants parameter is set to false. The values of the constants are
allowed to range between -10.0 and +10.0. The other parameters relate to how
constants evolve from generation to generation. This again is covered in
Ferreira
After running the example we get the output in the Acot.stat file.
Seed(s) used in this job: 4357
Problem Type: functionfinding
Generation: 0
Best Individual:
Evaluated: T
Fitness: 499.85913
Linking function: +
Gene 0
ceil.tan.C1.tan.d0.C0.tan.acos.d0.d0.d0.C1.C1.C1.d0.C1.C1
C0: -8.147180372252937
C1: 5.652730175877368
ceiling(tan(5.652730175877368))
Generation: 10
Best Individual:
Evaluated: T
Fitness: 503.19714
Linking function: +
Gene 0
/.cos.C1.tan.tan.sin.d0.d0.d0.C1.C0.C1.C0.C0.d0.C0.d0
C0: -1.9238205968468058
C1: 5.576872726631521
(cos(tan(tan(sin(d0))))/5.576872726631521)
Generation: 14
Best Individual:
Evaluated: T
Fitness: 614.31683
Linking function: +
Gene 0
/.cos.d0.floor.C1.d0.floor.d0.d0.C1.C0.C1.C1.C0.d0.C1.d0
C0: -8.147180372252937
C1: 5.652730175877368
(cos(floor(5.652730175877368))/d0)
Generation: 21
Best Individual:
Evaluated: T
Fitness: 873.75574
Linking function: +
Gene 0
sin././.cos./.neg.*.floor.d0.C1.C0.C0.C1.C1.d0.d0.d0
C0: -8.147180372252937
C1: 5.652730175877368
sin((((floor(5.652730175877368)/d0)/(-5.652730175877368))/cos(((-8.147180372252937)*(-8.147180372252937)))))
Generation: 366
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
atan././.d0.d0.d0./.C1.d0.d0.C1.C0.C0.d0.C0.C1.C1
C0: -8.147180372252937
C1: -0.24815065988408236
atan(((d0/d0)/d0))
Best Individual of Run:
Found at Generation: 366
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
atan././.d0.d0.d0./.C1.d0.d0.C1.C0.C0.d0.C0.C1.C1
C0: -8.147180372252937
C1: -0.24815065988408236
atan(((d0/d0)/d0))
Size of program: 6
Variables used(count): d0(3)
***** TRAINING data results *****
Statistics:
MSE: 6.5223928682607005E-31
RAE: 1.4189629717074661E-15
RSE: 5.2194614261805544E-30
RMSE: 8.076133275436148E-16
MAE: 3.738676035425215E-16 RRSE:
2.284614065040429E-15
Observed
Computed
-0.204002518946359 -0.2040025189463591
-0.339365885046348 -0.3393658850463476
-0.387522092113786 -0.3875220921137857
-0.23995309898349 -0.23995309898349046
0.515703851271719 0.5157038512717189
0.310364838997948 0.31036483899794765
...
0.146141817137733 0.14614181713773297
-0.124890152806863 -0.1248901528068634
-0.182368282843355 -0.18236828284335482
0.160813070832659 0.1608130708326594
0.139880646214395 0.13988064621439456
It took 366 generations in this case (with the seed provided) and found the
expression atan(((d0/d0)/d0)) which is equivalent to atan(1/d0), the definition
of acot(d0). There were improvements in the best fitness value at generations
10, 14, 21, and 366. So there was a lot of exploration of models through
evolution to improve between generation 21 and 366.
2. Data mining example - This is an interesting variation of the test2
problem described above. This high-dimensional toy problem illustrates the use
of a great deal of extraneous information being filtered during evolution.
In this example 15 out of 16 independent variables are meaningless. As in
test2 the target function is the quartic polynomial f(x) = x^4+x^3+x^2+x. The
data contained 161 sets with 15 independent variable values plus the dependent
variable values. As we can see from the (partial) output shown below, it takes
1596 generations (with the given seed as the starting point) to get to the best
solution.
Seed(s) used in this job: 4357
Problem Type: functionfinding
Generation: 0
Best Individual:
Evaluated: T
Fitness: 459.41364
Linking function: +
Gene 0
+./.-.*.*./.-.*.d0.d2.d15.d13.d4.d12.d10.d6.d6
Gene 1
+.d6.*.-.+.d5./.*.d3.d12.d3.d13.d0.d8.d14.d12.d7
Gene 2
-.*.-./.-.d8.*./.d0.d13.d8.d7.d12.d0.d11.d4.d5
((((d6*d6)*d0)/(d2*d15))+((d13/d4)-(d12-d10))) +
(d6+((d5-(d12/d3))*((d13*d0)+d3))) + ((((d0/d11)/d0)*(d13-d8))-(d8-(d7*d12)))
Generation: 3
Best Individual:
Evaluated: T
Fitness: 461.31992
Linking function: +
Gene 0
+.-.-.*.*./.-.*.d0.d2.d15.d13.d4.d12.d10.d6.d6
Gene 1
+.d6.*.-.+.d5./.*.d3.d12.d3.d13.d0.d8.d14.d12.d7
Gene 2
-.*.-./.-.d8.*./.d0.d13.d8.d7.d12.d0.d11.d4.d5
((((d6*d6)*d0)-(d2*d15))+((d13/d4)-(d12-d10))) + (d6+((d5-(d12/d3))*((d13*d0)+d3)))
+ ((((d0/d11)/d0)*(d13-d8))-(d8-(d7*d12)))
...
Generation: 1347
Best Individual:
Evaluated: T
Fitness: 998.6959
Linking function: +
Gene 0
+.*.d6.*.d0.d0.d0.-.d5.d12.d7.d5.d8.d4.d10.d5.d4
Gene 1
-.*.d1.d0.+.*.d0.*.d0.d0.d0.d10.d11.d8.d10.d5.d10
Gene 2
+.-.d0.d1.+.d6.d1.d1.d3.d6.d0.d8.d5.d8.d0.d2.d14
(((d0*d0)*d0)+d6) + ((d0*(((d0*d0)*d0)+d0))-d1) + ((d1-(d6+d1))+d0)
Generation: 1596
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
+.-.d0.d1.d6.-.d1.*.d0.d12.d2.d9.d5.d6.d9.d12.d12
Gene 1
-.*.d1.+.d0.*.d0.*.d0.d0.d0.d7.d5.d15.d9.d12.d1
Gene 2
+.*.d6.*.*./.d0.d0.d0.d0.d0.d10.d5.d8.d5.d8.d7
((d1-d6)+d0) + (((((d0*d0)*d0)+d0)*d0)-d1) + ((((d0/d0)*d0)*(d0*d0))+d6)
Best Individual of Run:
Found at Generation: 1596
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
+.-.d0.d1.d6.-.d1.*.d0.d12.d2.d9.d5.d6.d9.d12.d12
Gene 1
-.*.d1.+.d0.*.d0.*.d0.d0.d0.d7.d5.d15.d9.d12.d1
Gene 2
+.*.d6.*.*./.d0.d0.d0.d0.d0.d10.d5.d8.d5.d8.d7
((d1-d6)+d0) + (((((d0*d0)*d0)+d0)*d0)-d1) +
((((d0/d0)*d0)*(d0*d0))+d6)
Size of program: 27
Variables used(count): d0(11), d1(2), d6(2)
***** TRAINING data results *****
Statistics:
MSE: 7.99571803020474E-23
RAE: 1.0804622056054273E-15 RSE:
4.6576363300204115E-30
RMSE: 8.941877895724556E-12 MAE:
3.3320664912135643E-12 RRSE:
2.1581557705643983E-15
Observed Computed
1418.74707069507 1418.7470706950726
75.3238688522697 75.3238688522697
11019.4252335609 11019.425233560933
1526.50067316805 1526.500673168051
11309.522753932 11309.522753931999
6755.18648304066 6755.186483040658
...
537.603328324592 537.603328324592
533.236668818159 533.2366688181589
3742.26191483226 3742.261914832265
6664.98407320731 6664.984073207315
8312.11655791031 8312.116557910313
You
d1-d6+d0 + d0^4+d0^2-d1
+ d0^3+d6
or
d0 + d0^4 + d0^2 +
d0^3
the expected solution. There is a second version of the program,
dataMining2 that has the data in a CSV text file.
3. Production example - In this example we have real-world production
data from 446 firms. The goal is to explain the Production as a function of
Labor, Material, and Capital. There is nothing new in the parameter file that
we haven
Best Individual of Run:
Found at Generation: 2999
Evaluated: T
Fitness: 878.4472
Linking function: +
Gene 0
/.*./.-.-./.material./.material.capital.material.C0.labour.material.material.C0.C0
C0: 7.249937058678704
C1: -0.5794351361736538
Gene 1
material.material.material./.C0.+.labour.labour.capital.capital.labour.C1.material.material.capital.capital.C1
C0: 3.697751463839939
C1: 2.8929911331385565
Gene 2
*./././.-.C1./.-.material.C0.material.C1.material.material.labour.labour.labour
C0: -3.2099741094356737
C1: -0.49638400660778004
((((material/material)-material)*(capital-material))/((7.249937058678704/labour)/material))
+ material +
((((material-labour)/material)/((-3.2099741094356737)-material))*((-0.49638400660778004)/((-0.49638400660778004)/material)))
Size of program: 31
Variables used(count): labour(2), material(10), capital(1)
4. Analog Circuit Design Example - The goal is to find the transfer
function expressing the yield of an analog circuit in terms of three
parameter tolerances. The training set, consisting of 40 pairs of
tolerances and their corresponding yields, was obtained from n runs of
Zielinski, L. and J. Rutkowski, 2004. Design Tolerancing with Utilization of Gene
Expression Programming and Genetic Algorithm. In Proceedings
of the International
Conference on Signals and Electronic Systems,
Again this is similar to other function finding problems we
Classification examples we haven
1. Iris Virginica Example - In this problem the goal is to classify
three different types of irises based on four measurements: sepal length,
sepal width, petal length, and petal width. The original iris dataset
contains fifty examples each of three types of iris:
Iris setosa, Iris versicolor, and Iris virginica. Here the sub-problem
Virginica versus Not Virginica is analyzed, where 100 randomly chosen
samples are used for training and the remaining 50 for testing. The new
things in the parameter file (IrisVirginica.param) are shown below.
# Problem type must be one of:
functionfinding, classification, timeseries, logical
# Set default to be
gep.species.problemtype = classification
# if the problem is a classification type problem a threshold value (used
# to convert real values to 0 or 1 during fitness calculations) must be
# specified (do not specify this unless it IS a classification type
problem).
gep.species.classification-threshold = 0.5
gep.species.symbolset.terminalfilename =
ec/app/gep/IrisVirginica/IrisVirginica.txt
gep.species.symbolset.testingdatafilename =
ec/app/gep/IrisVirginica/IrisVirginicaTest.txt
We see that the problem type is set to classification.
With classification problems we expect to see a classification threshold
specified, In this case the value is set to 0.5. The data is in CSV files. The
training data set is in IrisVirginica.txt. There are 100 sets of data in the
file along with the first record that holds the names of the terminals
(variables) as d0, d1, d2, and D.V. We see also that there is a testing data
set file specified as IrisVirginicaTest.txt. This file holds the 50 testing sets
of data and does not have the names of the variables in the first row!
When the best model is determined from the run, the gep system will
automatically access the test data and compare the results of using the model
with this data against the expected values. This will help in the assessment of
the quality of the model that was chosen. Let
Seed(s) used in this job: -1368642168
Problem Type: classification
Classification rounding threshold: 0.5
Generation: 0
Best Individual:
Evaluated: T
Fitness: 801.24774
Linking function: +
Gene 0
*./.+.+.*./.-.d2.d3.d0.d1.d3.d3.d3.d1.d3.d3
C0: 4.19396659531718
C1: 4.2865486803954775
Gene 1
+./.d3.-.+./.d2.d1.d1.d3.d0.C1.d2.d1.C1.d2.d1
C0: 2.4700381059064878
C1: 8.534256473201435
Gene 2
/.d1.*.-.*.d1.d2.d1.d0.d2.d2.d0.C1.d1.d0.C0.C0
C0: 4.031580228177383
C1: 1.4026131599950642
(((d2+d3)/(d0*d1))*((d3/d3)+(d3-d1))) + ((((d3/d0)-d2)/(d1+d1))+d3) +
(d1/((d1-d2)*(d1*d0)))
...
Generation: 9
Best Individual:
Evaluated: T
Fitness: 969.69696
Linking function: +
Gene 0
-.d3./.+.d2.*./.d0.d1.d1.d1.d1.d3.d1.d2.d2.d2
C0: 0.2103929964396789
C1: 4.2865486803954775
Gene 1
-.*./.-.d3.d3.C1.d2.d1.d3.d2.C1.d0.d2.d3.d2.d1
C0: 2.4700381059064878
C1: 8.534256473201435
Gene 2
/.d1.*.-.*.d1.d2.d1.C1.d1.d3.d3.C0.d3.d0.C1.C0
C0: 4.031580228177383
C1: 1.4026131599950642
(d3-(((d0*d1)+(d1/d1))/d2)) + (((d2-d1)*d3)-(d3/8.534256473201435)) +
(d1/((d1-d2)*(d1*1.4026131599950642)))
Generation: 421
Best Individual:
Evaluated: T
Fitness: 984.8485
Linking function: +
Gene 0
-.-././.d3.d0.d3.*.d1.d1.d2.C0.d2.d3.C0.d2.d2
C0: 9.274259134965515
C1: 4.873682980602775
Gene 1
-.d3./.+.d2.*.C1.d0.d1.d2.d2.d2.d2.C0.d2.C1.d0
C0: 3.9855714658085004
C1: 8.534256473201435
Gene 2
-.d2././.d2.d2.C1.-.C0.C0.d1.d0.d0.d0.C1.d3.d0
C0: 1.8621216876649604
C1: 6.2762163390852646
((((d1*d2)/d1)-d3)-(d0/d3)) + (d3-(((d0*d1)+8.534256473201435)/d2)) +
(d2-((d2/6.2762163390852646)/d2))
Best Individual of Run:
Found at Generation: 421
Evaluated: T
Fitness: 984.8485
Linking function: +
Gene 0
-.-././.d3.d0.d3.*.d1.d1.d2.C0.d2.d3.C0.d2.d2
C0: 9.274259134965515
C1: 4.873682980602775
Gene 1
-.d3./.+.d2.*.C1.d0.d1.d2.d2.d2.d2.C0.d2.C1.d0
C0: 3.9855714658085004
C1: 8.534256473201435
Gene 2
-.d2././.d2.d2.C1.-.C0.C0.d1.d0.d0.d0.C1.d3.d0
C0: 1.8621216876649604
C1: 6.2762163390852646
((((d1*d2)/d1)-d3)-(d0/d3)) + (d3-(((d0*d1)+8.534256473201435)/d2)) +
(d2-((d2/6.2762163390852646)/d2))
Size of program: 27
Variables used(count): d0(2), d1(3), d2(5), d3(3)
***** TRAINING data results *****
Confusion Matrix:
Predicted Value
| Yes | No
|----------------
Yes| 34 | 0
Actual Value |----------------
No
| 1 | 65
|----------------
***** TEST data results *****
Number of Calculation Errors: 0 out of 50 test sets
Confusion Matrix:
Predicted Value
| Yes | No
|----------------
Yes| 16 | 0
Actual Value |----------------
No
| 0 | 34
|----------------
Observed Computed
1.0 1.0
1.0 1.0
0.0 0.0
...
We see from the
2. Breast Cancer Example - In this example of
diagnosis of breast cancer the task is to classify a tumor as either benign (0)
or malignant (1) based on nine different cell analysis (clump thickness,
uniformity of cell size, uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and
mitoses). Data obtained from PROBEN1 (Prechelt, L., 1994. PROBEN1 - A set of
neural network benchmark problems and benchmarking rules. Technical Report 21/94, Univ.
Nothing new here except that there is a large amount of data (training and testing). Run the problem or check out the results in BreastCancer.stat.
3. DNA MicroArrays Example - The objective of this
problem is to distinguish between 2 types of leukemia, ALL (0) and AML
(1). There are 38 training data sets and 34 testing data sets. The major
difference of this example is the number of attributes. There are 7129
independent variables! One minor change from the other examples is that we
specified a tab as the separator in the data files rather than the default
comma (gep.species.symbolset.terminalfileseparator = tab). The data sets are
available at:
http://sdmc.lit.org.sg/GEDatasets/Data/ALL-AML_Leukemia.zip. Note that
these files have very long lines and many editors will have difficulty reading
or editing them. With the default seed value an ideal solution is found after
2924 generations. But as you can see from the output the model did not perform
well with the test data. A good model for the training data doesn
Best Individual of Run:
Found at Generation: 2924
Evaluated: T
Fitness: 1000.0
Linking function: +
Gene 0
-.d2452.*.d4739.*.*.+.*.d4572.d4791.d1108.d4210.d670.d4458.d1985.d5621.d5374
Gene 1
+.d6935.*.+.+.*.+.d4555.d221.d1529.d3911.d2841.d1838.d2397.d4492.d6959.d1045
Gene 2
*.-.*.*.d6801.-.*.*.d11.d749.d6727.d2841.d7054.d5751.d6694.d436.d6498
(d2452-(d4739*(((d4210*d670)*d4572)*(d4791+d1108)))) +
(d6935+(((d1529*d3911)+(d2841+d1838))*(d4555+d221))) +
((((d5751*d6694)*d11)-d6801)*((d749-d6727)*(d2841*d7054)))
Size of program: 41
Variables used(count): d11(1), d221(1), d670(1), d749(1), d1108(1),
d1529(1), d1838(1), d2452(1), d2841(2), d3911(1), d4210(1), d4555(1), d4572(1),
d4739(1), d4791(1), d5751(1), d6694(1), d6727(1), d6801(1), d6935(1), d7054(1)
***** TRAINING data results *****
Confusion Matrix:
Predicted Value
| Yes | No
|----------------
Yes| 11 | 0
Actual Value |----------------
No
| 0 | 27
|----------------
***** TEST data results *****
Number of Calculation Errors: 0 out of 34 test sets
Confusion Matrix:
Predicted Value
| Yes | No
|----------------
Yes| 4 | 10
Actual Value |----------------
No
| 2 | 18
|----------------
Also see the publication:
"Molecular Classification of Cancer: Class Discovery
and Class Prediction by
Gene Expression Monitoring". Science, 286:531-537,
October 1999.
The special thing about time
series problems (as found in gep) is that the data is a single sequence of
data. From a training sequence of time series data we
timeseries-delay - This integer value determines how to use the values
in the timeseries data. If the value is 1, then use every time series value, if
it is 2, then use every other one, etc.
timeseries-embeddingdimension - This
integr value determines the number of timeseries points to use as
independent variables when transforming the set of time series data.
Another data point is used as the dependent variable value. So the time
series
timeseries-testingpredictions - an integer that
specifies the number of sets of data to devote to testing.
So if we had the time series data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
and delay was 1 and timeseries-embeddingdimension was 4 then we
iv1 iv2 iv3 iv4 dv
1
2 3 4 5
2
3 4 5 6
3
4 5 6 7
. . .
14 15
16 17 18
15 16
17 18 19
16 17
18 19 20
17 18
19 20 21
If delay was 2 then 7 sets would be formed as:
iv1 iv2 iv3 iv4 dv
1
3 5 7 9
3
5 7 9 11
. . .
9 11
13 15 17
11 13
15 17 19
13 15
17 19 21
Of course the user can provide the data already preprocessed into the 2
data sets as we have done before, but that could require a lot more effort. The
timeseries data file must not have any information regarding the names of the
independent variables, since these are automatically created.
1. Dow Jones Example - This example looks at the monthly closings of the
Dow-Jones industrial index, Aug. 1968 - Aug. 1981. (Source: Hipel and
Mcleod (1994). Time Series Modelling of Water Resources and
Environmental Systems, Elsevier. The original time series data file DJ.DAT
came from:
Hyndman, R.J. (n.d.) Time Series Data Library,
http://www.robhyndman.info/TSDL/.
Accessed on June 3, 2006.
For the integral time series used in this example, each value from the original
raw timeseries was replaced by its moving average using a smoothing period p =
10.
Of interest in the example is that there are 40 functions used to build the
models and that the 3 timeseries values are set to:
gep.species.timeseries-delay = 1
gep.species.timeseries-embeddingdimension = 10
gep.species.timeseries-testingpredictions = 10
Some of the statistics output:
Seed(s) used in this job: 4357
Problem Type: timeseries
Generation: 0
Best Individual:
Evaluated: T
Fitness: 862.8322
Linking function: +
Gene 0
gau.comp.*.v6.C1.*.v7.mul3.v6.C0.v3.v3.v6.v8.v8.v3.v5.v6.v6.v9.v6.v1.C0.v0.v4.v2.v0.v2.v9.v3.C1.v8.v9
C0: -1.8928414198253574
C1: -9.528899601444923
Gene 1
max2.X3.v9.C0.exp.-.v0.v9.v0.v8.v3.v9.v6.v2.v1.v7.v6.v6.v0.v2.v3.v7.v6.v7.v3.v3.v1.v8.v7.v3.v8.v3.v7
C0: 5.353542014548175
C1: -9.236307382044629
Gene 2
pi.v4.X5.v0.*.comp.min2.v7.v3.C1.v9.v3.v5.v1.v2.v9.v7.C0.v7.v8.v6.v5.v3.v8.v9.v1.v8.v0.v9.v2.v5.v0.v1
C0: 9.111412390259325
C1: 9.94532215973475
exp(-pow((1-(v6*(-9.528899601444923))),2)) +
max((5.353542014548175^3),v9) + pi
Generation: 2
Best Individual:
Evaluated: T
Fitness: 866.1386
Linking function: +
Gene 0
gau.comp.*.v7.C1.*.v7.mul3.v6.C0.v4.v3.v6.v8.v8.v3.v5.v6.v6.v9.v6.v5.C0.v0.v4.v2.v0.v2.v9.v3.C1.v8.C1
C0: -1.8928414198253574
C1: -9.528899601444923
Gene 1
max2.X3.v9.C0.exp.-.v0.v9.v0.v8.v3.v9.v6.v2.v1.v7.v6.v6.v0.v2.v3.v7.v6.v7.v3.v3.v1.v8.v7.v3.v8.v3.v7
C0: 5.353542014548175
C1: -9.236307382044629
Gene 2
gau.comp.*.v6.C1.*.v7.mul3.v6.C0.v1.v2.v6.v8.v3.v3.v5.v2.v6.v9.v6.v1.C1.v0.v4.v2.v0.v2.v9.v3.C0.v8.v9
C0: 9.111412390259325
C1: 9.94532215973475
exp(-pow((1-(v7*(-9.528899601444923))),2)) +
max((5.353542014548175^3),v9) + exp(-pow((1-(v6*9.94532215973475)),2))
...
Generation: 238
Best Individual:
Evaluated: T
Fitness: 934.1794
Linking function: +
Gene 0
-.v9.v8.sech.v1.v7.v9.v4.v7.C1.v2.v0.v3.v8.v3.v1.v5.v8.v8.v2.v8.v6.v2.v9.v2.v2.v1.v7.v8.v7.v0.v0.C0
C0: -1.8928414198253574
C1: 4.140535916050434
Gene 1
sin.min3.avg2.v2.v9.v8.v5.v0.v5.v8.v2.v9.v8.v8.v3.v0.v5.v5.v3.C1.v2.v0.v6.v9.v9.v2.v7.v5.C1.v8.v7.v0.v6
C0: 7.570619662652462
C1: -1.542568610489699
Gene 2
max2.v9.v9.acsc.v7.v8.add3.tan.v4.v0.v0.v0.v8.C1.v1.v3.v5.v2.v8.v2.v5.v9.v2.v5.v4.v1.v2.v8.v7.v6.v0.v5.C0
C0: -9.226903090150953
C1: 0.7626090735460895
(v9-v8) + sin(min(((v8+v5)/2),v2,v9)) + max(v9,v9)
...
Best Individual of Run:
Found at Generation: 4293
Evaluated: T
Fitness: 936.5758
Linking function: +
Gene 0
-.v9.v8.v0.v9.v7.acsc.v8.v2.v0.v3.v2.C1.v4.C0.v5.v9.v3.v5.v0.v2.v7.v7.v5.v9.v6.v2.v9.v2.v3.v9.v5.v5
C0: 5.233037180754705
C1: -7.143261425650365
Gene 1
sin.min3.mul3.v2.v6.cot.v6.mul4.v3.v6.v9.v8.v9.v5.v3.v0.v0.v0.v9.v5.v6.v0.v8.v4.v2.v8.v5.v4.v1.v0.C1.v5.v5
C0: 3.8794116272712955
C1: 7.460702927280668
Gene 2
v9.*.v8.v9.v8.v2.v7.*.v6.v4.v2.v7.v1.v9.v8.v8.v2.v6.v1.v8.v3.C0.v3.v3.v2.C0.C0.v9.v4.v2.v2.v7.v0
C0: 1.24681550366709
C1: -0.13544436934751047
(v9-v8) + sin(min((cot(v3)*v6*(v6*v9*v8*v9)),v2,v6)) + v9
Size of program: 17
Variables used(count): v2(1), v3(1), v6(3), v8(2), v9(4)
***** TRAINING data results *****
Statistics:
MSE: 24.239096329510474
RAE: 0.06769687388092159
RSE: 0.004585892048274829
RMSE: 4.923321676420349 MAE:
4.017510790472716 RRSE:
0.06771921476416298
***** TEST data results *****
Number of Calculation Errors: 0 out of 10 test sets
Statistics:
MSE: 16.72184789340559
RAE: 0.11520996098236828 RSE:
0.019535079218796565
RMSE: 4.089235612361508 MAE:
3.011454736524365 RRSE:
0.13976794775196696
Observed Computed
811.63994 808.827861961812
797.92192 797.2229981918084
786.41891 785.1324338242816
771.09691 774.3511732579256
763.19992 756.108499923999
...
858.957 853.8591859872175
869.964 871.6444114907323
878.539 880.63390351821
890.288 886.6045236016611
900.373 901.3867552973694
916.525 910.849758599105
932.277 932.416393052961
947.579 947.2086536125466
960.562 962.9643013882095
966.205 973.9101611582124
970.634 970.9536648254193
972.626 974.1782314966911
968.324 975.5769759596939
2. Exchange Rate Example - This example follows the exchange rate of the
Australian dollar against the US dollar. The original time series data file
USEXCHM.DAT can be obtained from:
Hyndman, R.J. (n.d.) Time Series Data Library,
http://www.robhyndman.info/TSDL/.
Accessed on June 3, 2006.
For the integral time series used in this run, each value from the last 100
records of the original raw series was replaced by its moving average
using a smoothing period p = 12.
3. Earthquake Data Example - This example follows the number of
earthquakes per year of magnitude 7.0 or greater from 1900 to 1998. The source
is the US National Earthquake Information Center and the original time series
data file EARTHQ.DAT can be obtained from:
Hyndman, R.J. (n.d.) Time Series Data Library,
http://www.robhyndman.info/TSDL/. Accessed on June 3, 2006.
For the integral time series used in this run, each value from original raw
series was replaced by its moving average using a smoothing period p = 10.
4. Sunspost Data Example - This example uses the Wolfer sunspots data
that can be found at http://mason.gmu.edu/~csutton/sun554.html or do a search
on the internet. For this example we have 3 variations to show how the
timeseries files can be accessed. The Sunspots.param variation gets the data
from a timeseries text file, Sunspots.txt. Sunspots2.param gets the data from a
user program. Sunspots3.param get the data from a text file but it is
preprocessed already, so one should not specify the delay, or other 2
parameters that are required when the data must be processed.
Logical problems have data values that are all logical,
i.e. true or false (represented by 1 and 0 in gep). They can use only logical
functions such as and, or, not, etc. and the linking
function must be a logical function as well (from the list: and, or,
nand, nor, xor, nxor).
1. XOR example - Given a set of x, y and result values (where the result
is the exclusive or of the x and y values) we try to discover an
equivalent function using only the functions and, or and not.
The Xor.param file contains:
gep.species.numgenes = 2
gep.species.gene-headsize = 8
gep.species.problemtype = logical
gep.species.symbolset.terminalfilename = ec/gep/test/Xor.txt
gep.species.symbolset.functionsize = 3
gep.species.symbolset.function.0 = Not
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = And
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Or
gep.species.symbolset.function.2.weight = 1
The resulting statistic file shows that an expression was
discovered that matches the requirement exactly.
Seed(s) used in this job: 4357
Problem Type: logical
Generation: 0
Best Individual:
Evaluated: T
Fitness: 500.0
Linking function: and
Gene 0
not.and.d1.and.d1.d0.not.d0.d0.d0.d0.d1.d1.d0.d0.d1.d1
Gene 1
not.and.d0.d1.d0.d1.d0.d1.d0.d0.d0.d0.d1.d0.d1.d0.d0
(not (d1 and (d1 and d0))) and (not (d0 and d1))
Generation: 2
Best Individual:
Evaluated: T
Fitness: 1000.0
Linking function: and
Gene 0
not.and.d0.d1.d0.and.d1.not.d0.d1.d1.d0.d0.d1.d1.d0.d0
Gene 1
or.d0.or.d1.d1.not.not.not.d1.d0.d0.d1.d0.d0.d1.d0.d0
(not (d0 and d1)) and (d0 or (d1 or d1))
Best Individual of Run:
Found at Generation: 2
Evaluated: T
Fitness: 1000.0
Linking function: and
Gene 0
not.and.d0.d1.d0.and.d1.not.d0.d1.d1.d0.d0.d1.d1.d0.d0
Gene 1
or.d0.or.d1.d1.not.not.not.d1.d0.d0.d1.d0.d0.d1.d0.d0
(not (d0 and d1)) and (d0 or (d1 or d1))
Size of program: 9
Variables used(count): d0(2), d1(3)
***** TRAINING data results *****
Confusion Matrix:
Predicted Value
| Yes | No
|----------------
Yes| 2 | 0
Actual Value |----------------
No
| 0 | 2
|----------------
Observed Computed
0.0 0.0
1.0 1.0
1.0 1.0
0.0 0.0
2. Majority Function Example - This example looks for a logical
expression that can determine if given three inputs (x, y and z) there are at
least 2 true (1) values. The solution uses only the operators not, and
and or.
3. Odd 3-Parity Example - This example looks at 3 boolean
input variables (x, y, and z) and looks for a logical expression that can
decide if there are an odd number of true (1) values. The solution uses
only the operators not, and and or.
4. 6 Bit Multiplexer Example - This example is an attempt to
discover a 6 bit multiplexer function using and, or and not
operators. It looks at 6 boolean variables
representing 2 control bits and 4 data bits. The 1st 2 bits are the
control bits and they identify which of the next 4 input bits will be in
the output. For example, if the control bits are 00 then the 3rd bit value
is value of the output; if the control bits are 01 then the 4th bit value
is the output value; if the control bits are 10 then the 5th bit value is
the output value; if the control bits are 11 then the 6th bit value is the
output value. This can be expressed as:
A&~c1&~c2 v B&~c1&c2 v C&c1&~c2 v
D&c1&c2 = output
where c1 is control bit 1 and c2 is control bit 2
A, B, C, D are
the 4 input bits
So if we have 6 bits labelled d0, d1, d2, d3, d4, d5 and d0=c1, d1=c2, A=d2,
B=d3, C=d4, D=d5 then an excellent solution to the problem will be:
output = d2&~d0&~d1 v d3&~d0&d1 v
d4&d0&~d1 v d5&d0&d1
The file ec/gep/gep.params has most of the required details in comments. But
here we list the parameter settings that are of concern for gep problems.
eval.problem - The user code to support the problem. For
example, ec.app.gep.test2.test2. These classes will be of type
GEPProblem. If the user wants to just take a default route and not have to
create any code at all he can use the default GEPProblem class,
ec.gep.GEPDefaultUserProb. Then he will need to also specify a
evalthreads and breedthreads -
should both be 1 always. It
checkpoint - not well tested but it should work
OK. If not let me know. May not be important since rerunning
a problem may be fast enough.
state - Should be
ec.simple.SimpleEvolutionState; do not change this!
init - Should be ec.gep.GEPInitializer;
probably should not need to change this.
finish - Should be ec.simple.SimpleFinisher;
probably should not need to change this.
exch - Should be ec.simple.SimpleExchanger;
probably should not need to change this.
breed -Should be ec.simple.SimpleBreeder; do
not change this.
eval - Should be ec.simple.SimpleEvaluator; do not
change this without some serious thought.
generations - As for standard ecj, maximum
number of generations to run
quit-on-run-complete - As for standard ecj, true if will quit when
an
fitness-function-ideal-percent-threshold
- specifies a percentage of the maximum fitness value (MaxThreshold) that must
be reached to be considered an ideal fitness value; the default value is
99.999999; so if the maximum fitness value is 1000 and the fitness-function-ideal-percent-threshold
is set to 99.9 then a fitness value >= 999 will be considered ideal; by
default a value >= 999.99999 will be ideal.
pop - Should be ec.Population. - As for
standard ecj.
pop.subpops - must be set to 1 (do not change since only 1 subpop
makes sense).
pop.subpop.0 - must be ec.Subpopulation. Do not change.
pop.subpop.0.duplicate-retries - should be 0? Haven
pop.subpop.0.size - application dependent; number of individuals in each
generation
pop.subpop.0.species - must be ec.gep.GEPSpecies.
pop.subpop.0.species.fitness - Should be ec.simple.SimpleFitness. Could change this as required (for example to support a more
complex fitness or to use doubles rather than floats).
pop.subpop.0.species.ind - must be ec.gep.GEPIndividual.
The following are the standard defaults for the evolutionary parameters as
suggested by Ferreira. These can be changed as required. I called these
probabilities but they are actually
gep.species.inversion-prob
= 0.1
gep.species.mutation-prob
= 0.044
gep.species.istransposition-prob = 0.1
gep.species.ristransposition-prob = 0.1
gep.species.onepointrecomb-prob = 0.3
gep.species.twopointrecomb-prob = 0.3
gep.species.generecomb-prob
= 0.1
gep.species.genetransposition-prob = 0.1
Constants can not be used with Logical problem types and will be ignored if
specified.
gep.species.use-constants - set to true if
using constants else false.
gep.species.numconstantspergene - Number of constants generated for each
gene.
gep.species.integer-constants - Set to true if
using integer constants or false to use floating point constants.
gep.species.constants-lowerlimit - Minimum
value of the constants chosen.
gep.species.constants-upperlimit - Maximum
value of the constants chosen.
These are default to these values as suggested by Ferreira. They can be changed
as required for the problem, but will need to understand what they mean (read
the information in Ferreira
gep.species.rnc-mutation-prob
= 0.01
gep.species.dc-mutation-prob
= 0.044
gep.species.dc-inversion-prob
= 0.1
gep.species.dc-istransposition-prob = 0.1
gep.species.numchromosomes - The number of chromosomes to be used for
the problem. This is not part of Ferreira
gep.species.numgenes - The number of genes to be used for the problem.
gep.species.gene-headsize - The length of the head of
each gene.
The linking-function
is the function to use to combine the values of the expressions for each gene.
In early versions the linking-function could be one of +, -, * or / for
numerical problem types and for logic problems it could be one of: and,
or, nand, nor, xor, nxor. This was as provided in Ferreira’s implementation.
These choices are still available, but now any function defined as a
GEPFunctionSymbol (such as Add,
Sub, Mul, Div, Ifgt4, Ifeq4, Pow, etc) that has
arity of 2 or more can be used. It must be specified with the same ‘case’ as
the class name and there must be a ‘correct’ number of genes that suit the
arity of the function. For a function with arity of 2 any number of genes is
OK; for arity 3 there must be 3, 5, 7, 9, etc. genes; for arity 4 there must be
4, 7, 10, 13, etc. genes; For 1 gene then no function is needed.
gep.species.gene-linking-function - The
function to use when combining the values the expressions for each gene.
Problem type must be one of: functionfinding, classification, timeseries, logical. Set default to be
gep.species.problemtype = unknown
gep.species.classification-threshold - If the problem is a classification
type problem then a threshold value (used to convert real values to 0 or 1
during fitness calculations) must be specified.
If the problem type is timeseries then the user needs to specify a number of
parameters. See ec.gep.gep.params for full details.
gep.species.timeseries-delay
gep.species.timeseries-embeddingdimension
gep.species.timeseries-testingpredictions
gep.species.symbolset - Must be ec.gep.GEPSymbolSet; do NOT
change.
gep.species.symbolset.name - can be any string but must be specified;
not actually used (at one point it seemed we might have multiple symbols sets
but this will not likely happen0; should be removed at some point.
Examples for specifiying the functions to be used in the
problem and their weights. The set of functions can be found by looking
at the directory ec/.gep/symbols.
gep.species.symbolset.functionsize = 4
gep.species.symbolset.function.0 = Add
gep.species.symbolset.function.0.weight = 1
gep.species.symbolset.function.1 = Sub
gep.species.symbolset.function.1.weight = 1
gep.species.symbolset.function.2 = Mul
gep.species.symbolset.function.2.weight = 1
gep.species.symbolset.function.3 = Div
gep.species.symbolset.function.3.weight = 1
The number and names of the terminal symbols (independent variables) can be
specified as shown below:
gep.species.symbolset.terminalsize = 1
gep.species.symbolset.terminal.0 = x
Or for the terminal set, the info can be in a text file where the file has
the names of the terminal symbols and the dependent variable in the first
record/line (comma separated, space or tab separated, etc) and the
training values for the terminal symbols in the other records; the number
of terminals is computed from the number of terminal symbols in the 1st record.
See details in ec/gep/gep.params and look at some examples.
gep.species.symbolset.terminalfilename = filename
gep.species.symbolset.terminalfileseparator = ,
or for a tab
gep.species.symbolset.terminalfileseparator = tab
or for a space
gep.species.symbolset.terminalfileseparator = space
If there is testing data as well it can be specified. Again see full details in
ec/gep/gep.params since there are some slightly complex situations (timeseries
data etc.).
gep.species.symbolset.testingterminalfilename = testDataFileName
For gep systems breeding should be as shown below. Changing
theseis possible but will not be as per original gep by Ferreira (as in
GeneXProTools) and will require some in-depth knowledge!
gep.species.pipe = ec.gep.breed.DcGeneticOperatorsPipeline
gep.dcgeneticoperators.source.0 = ec.gep.breed.GenerecombinationPipeline
gep.generecombination.source.0 =
ec.gep.breed.TwopointrecombinationPipeline
gep.twopointrecombination.source.0 =
ec.gep.breed.OnepointrecombinationPipeline
gep.onepointrecombination.source.0 =
ec.gep.breed.GenetranspositionPipeline
gep.genetransposition.source.0 = ec.gep.breed.RIStranspositionPipeline
gep.RIStransposition.source.0 = ec.gep.breed.IStranspositionPipeline
gep.IStransposition.source.0 = ec.gep.breed.InversionPipeline
gep.inversion.source.0 = ec.gep.breed.MutationPipeline
# FitProportionateSelection (roulette selection) is used exclusively by
Ferreira - could be changed
gep.mutation.source.0 = ec.select.FitProportionateSelection
breed.elite.0 - Should be set to 1 since this is the only option used by
Ferreira; however you might want to experiment with this to see if it improves
the evolution for your problem or just to force the best n individuals to be
carried forward.
We supply a not so simple GEPSimpleStatistics class to handle the basic capture
of information from a gep run. For the basic setting of statistics output you
can do something like the following.
stat = ec.gep.GEPSimpleStatistics
stat.file = ourProblem.stat
stat.file-observed-versus-computed = ourOvsC.stat
stat.file-observed-versus-computed-test = ourOvsCtest.stat
stat.detail-to-log = change
stat.number-of-best-to-log = 1
This contains all of the parameters that you might like to use but variations
can be done depending what you want to capture. As shown above,
GEPSimpleStatistics has a number of parameters it can use to control the
output:
stat.file - The main output is directed to the file specified by the
file parameter.
file-observed-versus-computed - this file, if
specified, gets the results of the best of run model by listing the
observed values versus the model
file-observed-versus-computed-test. If file-observed-versus-computed
is not specified it prints to the main output file, if specified,
or to the console.
file-observed-versus-computed-test - If
specified, this identifies a file with only the computed test data values.
If not specified it prints to the console.
no-observed-versus-computed - If true
then don
detail-to-log - the user can specify the detail
of the output to the main log file (param
all - display the best model
from every generation (can be a large amount of output)
change - logs the best model of
the generation when the fitness improves (default)
final - logs only the best of run
model information
number-of-best-to-log - this determines the number of best individuals
to display in the final results. By default it is 1.
For the multiple chromosome case there is another not so simple
GEPSimpleStatisticsMultipleChromosomes class to handle the capture of
information from a gep run. For the basic setting of statistics output you can
do something like the following.
stat =
ec.gep.GEPSimpleStatisticsMultipleChromosomes
stat.file = ourProblem.stat
stat.file-computed-values = ourOvsC.stat
stat.no-computed = false
stat.detail-to-log = change
stat.number-of-best-to-log = 1
These are similar to the parameters in the GEPSimpleStatistics case but we don
More complex things can be done as per standard ECJ. Consider the example below
examples for setting multiple statistic packages.
stat.num-children = 3
stat.child.0 = ec.gep.test.SimpleXYSeriesChartStatistics
stat.child.0.title = Best of Generation
stat.child.0.x-axis-label = generation
stat.child.0.y-axis-label = fitness
stat.child.1 = ec.simple.SimpleShortStatistics
stat.child.1.file = simpleshortstats.txt
stat.child.1.gather-full = true
stat.child.2 = ec.gep.GEPSimpleStatistics
stat.child.2.file = simplestats.txt
stat.child.2.file-observed-versus-computed = simplestatsOvsC.txt
Note that as revisions are made, new parameters may be added or changed, so it
is best to also look at the gep.params file for details.
state.evalthreads = evalthreads;
state.breedthreads =
breedthreads;
state.seeds = seeds; // this line was added
state.breedthreads = breedthreads;
state.randomSeedOffset =
randomSeedOffset;
state.seeds = seeds; // this line was added
/**
* The seeds used for the current job
*/
public int seeds[];
package
ec.gep.symbols;
import ec.EvolutionState;
import ec.gep.GEPFunctionSymbol;
import ec.gp.GPData;
import ec.gp.GPIndividual;
import ec.util.Parameter;
public class Add extends GEPFunctionSymbol {
public Add()
{
super("+", 2);
}
public double eval(double params[])
{
//should check that there are 2
params
return (params[0] + params[1]);
}
public boolean isLogicalFunction()
{
return false;
}
public String getMathExpressionAsString( String p[] )
{
return "(" + p[0] +
"+" + p[1] + ")";
}
}
The constructor will simply call the super constructor and pass the symbol to
be used when printing the Karva representation of the expressions formed in the
genes. It will also pass the getMathExpressionAsString
contains
the string representations of its parameters.
Adding new fitness functions is done by adding some methods to the class
ec.gep.GEPFitnessFunction. For each of the fitness functions there are 3
methods that need to be supplied: one to calculate the fitness (with a
value from 0 to some maximum); one to give the raw fitness value, the value
prior to being mapped between 0 and the maximum value; and one to provide
the maximum value that the fitness value can take for that fitness function. Actually
the raw value is not always needed (see for example the specialized fitness
function WCorrRMSE (Weighted correlation coefficient and Root Mean Squared
Error). There is no raw value for this since it combines to other fitness
values. but since some users might want access to the
value before it is converted to a value from 0 to a maximum value, it seemed
prudent to make this available whenever possible. By convention we use the
names XXXXfitness, XXXXrawFitness and XXXXmaxFitness for all of the
fitness functions. XXXX is the pneumonic for the fitness function (e.g.
RRSE, MAE, etc.). This allows a user to specifiy a fitness function to
be used in the default user program by specifying only the short pneumonic
for the fitness function (they don
GEPFitnessFunction.MSEfitness(useTrainingData, gepindividual
);
- calculates the fitness
using the individual
- it uses either the training date (useTrainingData
is true) or the testing data (useTrainingData
is false)
- it gets the raw fitness first from MSErawFitness (the mean
squared error
of the predicted values versus the text/expected
values)
- then it normalizes the result between 0 and 1000
(1000 * (1/(1 + raw MSE))
GEPFitnessFunction.MSErawFitness(useTrainingData, gepindividual,
chromosomeNum);
- sum((predicted valuei - expected
valuei)**2)/n
GEPFitnessFunction.NHmaxFitness( gepindividual );
- in this case max is always 1000 but some fitness functions
have a maximum that depends on the number of training data sets
Again let
/**
* Calculates the
* fitness (before the normalization from 0 to max value is done).
* @param useTrainingData true if using training data, else use testing data
* @param
* @param chromosomeNum which chromosome in the individual to use to calc the raw RAE
* @return the
*/
public static double RAErawFitness(boolean
useTrainingData, GEPIndividual
{
double
sumOfAbsoluteError = 0.0;
double expectedResult;
double result;
double error;
GEPDependentVariable dv;
if
(useTrainingData)
dv = GEPDependentVariable.trainingData;
else
dv = GEPDependentVariable.testingData;
double dvValues[] =
dv.getDependentVariableValues(chromosomeNum);
double
dvSumOfAbsoluteError = dv.getDependentVariableSumOfAbsoluteError(chromosomeNum);
for (int i=0;
i<dvValues.length; i++)
{
expectedResult
= dvValues[i];
result =
ind.eval(chromosomeNum, useTrainingData, i);
error =
result - expectedResult;
sumOfAbsoluteError
+= Math.abs(error);
}
if
(dvSumOfAbsoluteError == 0.0)
{
dvSumOfAbsoluteError = RELATIVE_ERROR_ZERO_FACTOR;
System.err.println("Warning:
sum of error for dependent variable is 0 in RAE fitness calculation. Adjusting
to avoid division by zero.");
}
// the raw fitness ... RAE
return
(sumOfAbsoluteError/dvSumOfAbsoluteError);
}
/**
* Calculates the fitness for the RAE (Relative Absolute Error) type fitness.
* Gets the raw fitness and then normalizes between 0 and max value as (maxValue * (1/(1+RAE)).
* @param useTrainingData true if using training data, else use testing data
* @param
* @return the fitness value after normalization from 0 to max value
*/
public static double RAEfitness(boolean
useTrainingData, GEPIndividual
{
double RAE = RAErawFitness(useTrainingData,
// raw fitness is normalized between 0
and 1000 (1000
* (1/(1+RAE))
return
(1000.0)/(1.0+RAE);
}
/**
* The max value for this type of fitness is always 1000.
* @param
* @return value 1000.0
*/
public static double
RAEmaxfitness(GEPIndividual
{
// always 1000
return 1000.0;
}
The code is fairly straight forward. the only thing
to note is that at one point we accessed the sum of the absolute error of the
values of the dependent variable data in the training set using -
GEPDependentVariable.getDependentVariableSumOfAbsoluteError(). When the
training and testing data for the program is obtained, a number of values are
calculated and stored statically in the class GEPDependentVariable. This is so
these values, which are used by several fitness functions, don
/** Mean value of the training values for the
dependent variable */
static double dependentVariableMean = 0.0;
/** Sum of the absolute error between dependent
variables training values and their mean value. The
* absolute error
is Abs(dv[i] - dvMean).
*/
static double dependentVariableSumOfAbsoluteError =
0.0;
/** Sum of the squares of the absolute errors between
dependent variables training values and their mean
* value. The
absolute error is Abs(dv[i] - dvMean).
*/
static double
dependentVariableSumOfSquaredAbsoluteError = 0.0;
/** Sum of the relative errors between dependent
variables training values and their mean value.
* The relative
error is Abs((dv[i] - dvMean)/dvMean).
*/
static double dependentVariableSumOfRelativeError =
0.0;
/** Sum of the squares relative errors between
dependent variables training values and their mean value.
* The relative
error is Abs((dv[i] - dvMean)/dvMean).
*/
static double
dependentVariableSumOfSquaredRelativeError = 0.0;
These values are available for both the training data as shown above and for the test data as well. See the class ec.gep.GEPDependentVariable.