This post describes a toy formal model that helps me think about self-modifying AIs, motivationally stable goal systems, paperclip maximizers and other such things. It's not a new result, just a way of imagining how a computer sitting within a world interacts with the world and with itself. I hope others will find it useful.
(EDIT: it turns out the post does imply a somewhat interesting result. See my exchange with Nesov in the comments.)
A cellular automaton like the Game of Life can contain configurations that work like computers. Such a computer may contain a complete or partial representation of the whole world, including itself via quining. Also it may have "actuators", e.g. a pair of guns that can build interesting things using colliding sequences of gliders, depending on what's in the return value register. The computer's program reasons about its model of the world axiomatically, using a proof checker like in my other posts, with the goal of returning a value whose representation in the return-value register would cause the actuators to affect the world-model in interesting ways (I call this the "coupling axiom"). Then that thing happens in the real world.
The first and most obvious example of what the computer could want is suicide. Assuming the "actuators" are flexible enough, the program could go searching for a proof that putting a certain return value in the register eventually causes the universe to become empty (assuming that at the beginning it was empty except for the computer). Then it returns that value and halts.
The second example is paperclipping. If the universe is finite, the program could search for a return value that results in a stable configuration for the entire universe with the most possible copies of some still-life, e.g. the "block". If the universe is infinite, it could search for patterns with high rates of paperclip production (limited by lightspeed in the automaton). In our world this would create something like the "energy virus" imagined by Sam Hughes - a rare example of a non-smart threat that sounds scarier than nanotech.
The third is sensing. Even though the computer lacks sensors, it will make them if the goal calls for it, so sensing is "emergent" in this formalization. (This part was a surprise for me.) For example, if the computer knows that the universe is empty except for a specified rectangle containing an unknown still-life pattern, and the goal is to move that pattern 100 cells to the right and otherwise cause no effect, the computer will presumably build something that has sensors, but we don't know what kind. Maybe a very slow-moving spaceship that can "smell" the state of the cell directly touching its nose, and stop and resume motion according to an internal program. Or maybe shoot the target to hell with glider-guns and investigate the resulting debris. Or maybe something completely incomprehensible at first glance, which nevertheless manages to get the job done. The Game of Life is unfriendly to explorers because it has no conservation of energy, so putting the wrong pieces together may lead to a huge explosion at lightspeed, but automata with forgiving physics should permit more efficient solutions.
You could go on and invent more elaborate examples where the program cares about returning something quickly, or makes itself smarter in Gödel machine style, or reproduces itself... And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values. As long as its world-model (or, more realistically, its prior about possible physics) describes the real world well, the thing will maximize what we tell it to, as best it can. This indicates that value stability may depend more on getting the modeling+quining right than on any deep theory of "goal systems" that people seem to want. And, of course, encoding human values in a machine-digestible form for Friendliness looks like an even harder problem.
More generally, you can think of AI as a one-decision construction, that never observes anything and just outputs a program that is to be run next. It's up to AI to design a good next program, while you, as AI's designer, only need to make sure that AI constructs a highly optimized next program, while running on protected hardware. This way, knowledge of physics or physical protection of AI's hardware or self-modification is not your problem, it's AI's.
The problem with this plan is that your AI needs to be able not just to construct an optimized next program, it needs to be able to construct a next program that is optimized enough, and it is you that must make sure that it's possible. If you know that your AI is strong enough, then you're done, but you generally don't, and if your AI constructs a slightly suboptimal successor, and that successor does something a little bit stupid as well, so on it goes and by the trillionths step the world dies (if not just the AI).
Which is why it's a good idea to not just say that AI is to do something optimized, but to have a more detailed idea about what exactly it could do, so that you can make sure it's headed in the right direction without deviating from the goal. This is the problem of stable goal systems.
Your CA setup does nothing of the sort, and so makes no guarantees. The program is vulnerable not just while it's loading.
All very good points. I completely agree. But I don't yet know how to approach the harder problem you state. If physics is known perfectly and the initial AI uses a proof checker, we're done, because math stays true even after a trillion steps. But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.