This post describes a toy formal model that helps me think about self-modifying AIs, motivationally stable goal systems, paperclip maximizers and other such things. It's not a new result, just a way of imagining how a computer sitting within a world interacts with the world and with itself. I hope others will find it useful.
(EDIT: it turns out the post does imply a somewhat interesting result. See my exchange with Nesov in the comments.)
A cellular automaton like the Game of Life can contain configurations that work like computers. Such a computer may contain a complete or partial representation of the whole world, including itself via quining. Also it may have "actuators", e.g. a pair of guns that can build interesting things using colliding sequences of gliders, depending on what's in the return value register. The computer's program reasons about its model of the world axiomatically, using a proof checker like in my other posts, with the goal of returning a value whose representation in the return-value register would cause the actuators to affect the world-model in interesting ways (I call this the "coupling axiom"). Then that thing happens in the real world.
The first and most obvious example of what the computer could want is suicide. Assuming the "actuators" are flexible enough, the program could go searching for a proof that putting a certain return value in the register eventually causes the universe to become empty (assuming that at the beginning it was empty except for the computer). Then it returns that value and halts.
The second example is paperclipping. If the universe is finite, the program could search for a return value that results in a stable configuration for the entire universe with the most possible copies of some still-life, e.g. the "block". If the universe is infinite, it could search for patterns with high rates of paperclip production (limited by lightspeed in the automaton). In our world this would create something like the "energy virus" imagined by Sam Hughes - a rare example of a non-smart threat that sounds scarier than nanotech.
The third is sensing. Even though the computer lacks sensors, it will make them if the goal calls for it, so sensing is "emergent" in this formalization. (This part was a surprise for me.) For example, if the computer knows that the universe is empty except for a specified rectangle containing an unknown still-life pattern, and the goal is to move that pattern 100 cells to the right and otherwise cause no effect, the computer will presumably build something that has sensors, but we don't know what kind. Maybe a very slow-moving spaceship that can "smell" the state of the cell directly touching its nose, and stop and resume motion according to an internal program. Or maybe shoot the target to hell with glider-guns and investigate the resulting debris. Or maybe something completely incomprehensible at first glance, which nevertheless manages to get the job done. The Game of Life is unfriendly to explorers because it has no conservation of energy, so putting the wrong pieces together may lead to a huge explosion at lightspeed, but automata with forgiving physics should permit more efficient solutions.
You could go on and invent more elaborate examples where the program cares about returning something quickly, or makes itself smarter in Gödel machine style, or reproduces itself... And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values. As long as its world-model (or, more realistically, its prior about possible physics) describes the real world well, the thing will maximize what we tell it to, as best it can. This indicates that value stability may depend more on getting the modeling+quining right than on any deep theory of "goal systems" that people seem to want. And, of course, encoding human values in a machine-digestible form for Friendliness looks like an even harder problem.
Well, obviously there has to be some amount of non-disturbed calculation at the start - the AI hardly has much chance if you nuke it while its Python interpreter is still loading up. But the first (and only) action that the AI returns may well result in the construction of another AI that's better shielded from uncertainty about the universe and shares the same utility function. (For example, our initial AI has no concept of "time" - it outputs a result that's optimized to work with the starting configuration of the universe as far as it knows - but that second-gen AI will presumably understand time, and other things besides.) I think that's what will actually happen if you run a machine like the one I described in our world.
ETA. So here's how you get goal stability: you build a piece of software that can find optima of utility functions, feed it your preferred utility function and your prior about the current state of the universe along with a quined description of the machine itself, give it just-barely-powerful-enough actuators, and make it output one single action. Wham, you're done.
More generally, you can think of AI as a one-decision construction, that never observes anything and just outputs a program that is to be run next. It's up to AI to design a good next program, while you, as AI's designer, only need to make sure that AI constructs a highly optimized next program, while running on protected hardware. This way, knowledge of physics or physical protection of AI's hardware or self-modification is not your problem, it's AI's.
The problem with this plan is that your AI needs to be able not just to construct an optimized next progr... (read more)