ECJ has been around for almost eight years, for much of the length of Java's public existence. Though ECJ was overengineered and overwrought from the start, this overengineering has granted it a great deal of flexibility and the toolkit has, I think, proven unusually adaptable.
But ECJ has a lot of warts, many of which could be cleaned up with some work, though I'm worried about backward-compatability. Keep in mind that ECJ design was begun in 1997. For those of you old enough to remember :-), Java 1.1 was released in February 1997 -- and it was buggy -- and Java 1.2 wasn't released until December 1998. Lots of installations were still using Java 1.0. JITs were only just coming out, and HotSpot didn't exist. At that time, Java had a number of missing features and efficiency problems which influenced ECJ's design.
Here are some warts I've identified. Send me mail if you'd like some others discussed here.
setup(...) in most objects is fairly complex, consisting of a lot of checks to make sure everything is kosher. These checks aren't necessary for quick hacks -- I wrote them into ECJ because it's library code and should be relatively bulletproof. But unfortunately they lead people to thinking that setup(...) methods need to be complicated, when they can be quite trivial in fact.
ECJ also needed to make large numbers of copies of objects; but the clone() method was protected and could not be unprotected in an interface. Thus protoClone() was born to copy objects off of Prototypes.
protoCloneSimple() came about from the perception that, for the early Java virtual machines, there was a cost incurred to wrapping something in try { ... }. Thus protoClone would be declared to throw the baloney CloneNotSupportedException, and protoCloneSimple() would catch it in those cases where it was convenient to have such a wrapper. This notion of slowness in early try { ... } constructs may have been mistaken, though I seem to think it was correct. At any rate, were we to do this again, protoCloneSimple would be gone.
To make matters even more complex, protoClone was originally conceived without a definition as to whether it was a deep clone or not. This is problematic because GPTrees in some cases need to be light-cloned and sometimes deep-cloned. So Individual has a deep-clone method as well as a protoClone method. Instead we should have made ALL protoClones deep, and instead created a "light clone" method for GPIndividual.
Cliques proved to be useful but not as an interface. Quite a number of ECJ objects are Cliques, and they all follow the same pattern: a central repository which is globally accessible, storing a small N number of objects. This repository should have been moved into Clique itself, making Clique a class. But it wasn't. Additionally, the repository was a global (a static variable). This has prevented ECJ from being entirely modularized until recently, when we moved all these variables into Initializers for lack of a better place to put them.
Singletons never had a reason to be an Interface -- it's totally empty and there's nothing special about them. Ultimately we might delete them and just have them be Setups.
ParameterDatabase sits on top of PropertyList rather than XML. For good reason: XML didn't exist at the time. But I don't view this as a wart really: it's a feature. XML was never meant to be a data transfer format, and its misuse as one produces huge amounts of typing and syntactic errors. PropertyLists are much simpler and I'm glad I went with them. The disadvantage of PropertyLists is that they're flat. As a result, you see huge, long dotted parameters like "pop.subpop.0.species.pipe.source.0 = ec.gp.koza.CrossoverPipeline". A tree format like XML would solve this, sort of.
I wound up using Output's logging to write Individuals out to stdout and to various statistics files. I think this was an error. Later on the need would arise to write Individuals out to Writers and read them from Readers anyway. As a result Individuals all have methods for writing Individuals out to streams AND to log them in Output.
For a long time, ECJ has had two different ways for writing individuals out: one which is human-readable only, and one which can be read by humans and by machines. This allows GPIndividuals to be dumped in a "pretty" format and a "read-in" format. To pull off the trick of a human AND computer readable format, I constructed an encoding mechanism for numbers which included both their human-readable format and raw bits. It's called Code. A Coded double (2.932341) looks like this:
d4613785463717479022|2.932341|
Perhaps this might be better described as "human-decodable" rather than "human-readable". It worked well but parsing back in is slow (fine for files, bad for sockets). And the reading mechanism (a DecodeReturn tokenizer) is overly complex. Use of Code, Output, and other gunk makes ECJ's printFoo and readFoo methods quite difficult to understand.
The original code for island models used printIndividual and readIndividual to send individuals across streams. This is grotesquely inefficient, necessitating the creation of lots of Strings and lots of parsing. A much better approach would be to use DataInputStream and DataOutputStream, but I was concerned that this would result in even MORE ways to print and read Individuals. Well, we're going to bite the bullet and do it, for island models and the client/server evaluator at least.
public abstract void eval(final EvolutionState state, final int thread, final GPData input, final ADFStack stack, final GPIndividual individual, final Problem problem);
... it should have been something like
public abstract void eval(final GPData input, final GPContext context);