Say, that we have the following observational data:
Planet | Aphelion 000 km |
Perihelion 000 km |
Orbit time days |
Mercury | 69,816 | 46,001 | 88 |
Venus | 108,942 | 107,476 | 225 |
Earth | 152,098 | 147,098 | 365 |
Mars | 249,209 | 206,669 | 687 |
Jupiter | 816,520 | 740,573 | 4,332 |
Saturn | 1,513,325 | 1,353,572 | 10,760 |
Uranus | 3,004,419 | 2,748,938 | 30,799 |
Neptune | 4,553,946 | 4,452,940 | 60,190 |
Pluto | 7,311,000 | 4,437,000 | 90,613 |
The minimal, the maximal distance between a planet and the Sun (both in thousands of kilometres) and the number of (Earth) days for one revolution around the Sun. Above is only the empirical data and no binding algorithm among the three quantities. The celestial mechanics rules which go by the name of the Kepler's laws. Can those rules be (re)invented by a computer program and how?
The following program code will be put into a simulator:
//declarations of the integer type variables
$DECLAREINT bad perihelion aphelion orbit guess dif temp zero temp1
//table with the known data in a simulator friendly format
$INVAR perihelion(46001) aphelion(69816) orbit(88)
$INVAR perihelion(107476) aphelion(108942) orbit(225)
$INVAR perihelion(147098) aphelion(152098) orbit(365)
$INVAR perihelion(206669) aphelion(249209) orbit(687)
$INVAR perihelion(740573) aphelion(816520) orbit(4332)
$INVAR perihelion(1353572) aphelion(1513325) orbit(10760)
$INVAR perihelion(2748938) aphelion(3004419) orbit(30799)
$INVAR perihelion(4452940) aphelion(4553946) orbit(60190)
$INVAR perihelion(4437000) aphelion(7311000) orbit(90613)
// variables orbit and bad can't be touched by the simulator
//to avoid a degeneration to a triviality
$RESVAR orbit bad
//do NOT use if clause, while clause do not set direct numbers ...
$RESCOM if while val_operation inc_dec
//bad is the variable, by which the whole program will be judged
//a big value of bad is bad. By this criteria programs will be wiped out
//from their virtual existence. A kind of anti-fitness
$PENVAL bad
//do show the following variables when simulating
$SHOWVAR bad,orbit,guess,dif
//penalize any command with 0 (nothing) and every line by 1 point
$WEIGHTS commands=0 lines=1
//minimize the whole program to 20 lines or less
$MINIMIZE lines 20
$BES
//the arena, where algorithms will be
//created and the fittest only will survive
$EES
//testing area where the simulator has no write access to
//here the bad (the penalized variable) is calculated
//bigger the difference between the known orbit and the variable guess
//worse is the evolved algorithm
dif=orbit-guess;
dif=abs(dif);
bad=dif;
temp=dif;
temp*=10000;
temp1=temp/orbit;
temp=temp1*temp1;
bad=bad+temp;
//end of the testing area
After several hours the following C code has been evolved inside of the $BES - $EES segment.
aphelion=perihelion+aphelion;
aphelion=aphelion+aphelion;
aphelion=aphelion+aphelion;
guess=12;
aphelion=aphelion>>guess;
temp=aphelion/guess;
aphelion=aphelion-temp;
dif=sqrt(aphelion);
aphelion=guess|aphelion;
aphelion=aphelion*dif;
aphelion=guess^aphelion;
guess=aphelion/guess;
What the simulator does? It bombards the arena segment with a random C commands. Usually it then just notices a syntax error and repairs everything to the last working version. If everything is syntactically good, the simulator interprets the program and checks if the mutated version causes any run-time error like division by zero, a memory leak and so on. In the case of such an error it returns to the last good version. Otherwise it checks the variable called "bad", if it is at least as small as it was ever before. In the case it is, a new version has just been created and it is stored.
The evolutionary pressure is working toward ever better code, which increasingly well guesses the orbit time of nine planets. In this case the "orbit" variable has been under the $RESVAR clause and then the "gues" variable has been tested against the "orbit" variable. Had been no "$RESVAR orbit" statement, a simple "guess=orbit;" would evolve quickly. Had been no "$RESVAR bad" statement a simple "bad=-1000000;" could derail the process.
Many thousands of algorithms are born and die every second on a standard Windows PC inside this simulator. Million or billion generations later, the digital evolution is still running, even if an excellent solution has been already found.
And how good approximation for the Kepler (Newton) celestial mechanics of the Solar system we have here?
This good for the nine planets where the code evolved:
Planet | Error % |
Mercury | 0.00 |
Venus | 0.44 |
Earth | 0.27 |
Mars | 0.29 |
Jupiter | 0.16 |
Saturn | 0.65 |
Uranus | 0.10 |
Neptune | 0.79 |
Pluto | 1.08 |
And this good for the control group of a comet and six asteroids:
Asteroid/Comet | Error % |
Halley | 1.05 |
Hebe | 1.37 |
Astraea | 1.99 |
Juno | 3.19 |
Pallas | 1.66 |
Vesta | 2.49 |
Ceres | 2.02 |
Could be even much better after another billion generations and maybe with even more $INVAR examples. Generally, you can pick any three columns from any integer type table you want. And see this way, how they are related algorithmically. Can be more than three columns also.
The name of the simulator (evoluator) is Critticall and it is available at http://www.critticall.com
The distinction between data and code is mostly useful, but in some situations we intentionally cross the boundary. If you have a text editor, then the separation is valid -- the program is code, and the text documents are data. But if you have a compiler, you have a compiler program, which is code, and the compiled program, which is data, but on a different level it is also code. If you have a dynamically loaded library, while you load it, it is data, but then you start treating it as code by running it. If you have an interpreter, which is a code, you emulate the program, which on one level is data and simultaneously on another level is code. If you have a decompiler or debugger, which is code, it takes another code and treats it like data. A bootloader loads data and then starts them as a code. Etc.
This means that treating data as code, or code as data, is not necessarily a mistake or confusion of levels, and sometimes it is unavoidable. How else could you for example start a program, which originally exists only as a sequence of bytes in a file on a disk? You treat it as data when loading the file to a memory, and then you treat it as a code.
Some maths here: let X be a problem we want to solve, and let's suppose that the problem is algorithmically solvable. Further, let P be a set of all programs, PX a set of programs that correctly solve problem X, U is a unit test and PU is a set of programs that passes the unit test.
We will call U a correct unit test for problem X if each program from PX belongs to PU (a correct program always passes the unit test) and some programs in P-PX don't belong to PU (some incorrect programs don't pass the unit test). In other words, PX is a subset of PU, which is a strict subset of P.
A random code generator will provide a random distribution R of programs. Now the question is whether R ∩ PX is non-empty, in other words whether the generator is capable of generating a correct solution assuming it is incredibly lucky. This depends on the generator and the problem. For example if generator only creates random programs up to 100 lines of code, but the shortest possible code that solves the problem has 150 lines, then the generator cannot create a correct program. But if the generator is capable to create a program of any length, or more precisely capable of creating any program, then if the program is algorithmically solvable, the generator will generate the correct program with a non-zero probability. Just imagine that you wrote the correct program -- well, with some non-zero probability, the output of the random generator will be exactly the same text. The only problem is that p(x in PX | x in R) is very small. Sometimes so small that even if we could reliably automatically test whether a program belongs to PX, we just wouldn't have enough time to wait until first such program is generated. But let's assume that it is a simple enough program, so if we generate a few millions of programs, we can expect that some of them belong to PX.
Yeah, we have a couple assumptions here. Again, they are: (1) a correct program exists, and (2) some correct program has Kolmogorov complexity small enough (i.e. has a short code, if written optimally) that we can realistically wait until a random generator generates it. Actually, also (3) some correct program with small Kolmogorov complexity runs fast enough that it will pass the unit test in given time limit.
Assuming this all, we have a set of generated random programs R, some of them belong to PX, and we are trying to find them. If we replace the set R with R ∩ PU, we have a smaller set, and R ∩ PX is a subset of R ∩ PU, so we did not lose any solution. In probabilistic terms, p(x in R ∩ PX | x in R ∩ PU) is greater than p(x in PX | x in R), in human speech: if we choose a random program that passes the unit test, we have better chance than if we choose any random program.
Question is, how much does the test U reduce the set of all programs. If only 1% of programs passes the unit test, then p(x in R ∩ PX | x in R ∩ PU) = 100×p(x in R ∩ PX | x in R), so by doing the unit test we increased our luck 100 times... though the result may be still pretty close to zero. We can filter the results by another test, but now we need that only a few of programs that have passed the previous test can pass the next test. Which may be difficult; and also pretty difficult to estimate in advance. We can only measure it afterwards, how many programs that passed the first test were filtered by the second one. The idea is, if a generated program has chance 1:10^10 to be correct, if we can make a sequence of 5 unit tests where each one filters out 99% of programs that have passed the previous tests, we have a decent chance at the end.
Of course to make all this work, we need a decent generator, which will give programs reasonable prior probabilities, therefore it would be better to generate parse trees than source texts, etc. We could even invent a new language that could allow us to write shorter programs. For example it should have an instruction "forAllItemsInSet(SET) { X | CODE }" made of 4 symbols, instead of "for (X = 1; X < SET.size; X++) { CODE }" with 11 symbols, etc.