This post describes a toy formal model that helps me think about self-modifying AIs, motivationally stable goal systems, paperclip maximizers and other such things. It's not a new result, just a way of imagining how a computer sitting within a world interacts with the world and with itself. I hope others will find it useful.

(EDIT: it turns out the post does imply a somewhat interesting result. See my exchange with Nesov in the comments.)

A cellular automaton like the Game of Life can contain configurations that work like computers. Such a computer may contain a complete or partial representation of the whole world, including itself via quining. Also it may have "actuators", e.g. a pair of guns that can build interesting things using colliding sequences of gliders, depending on what's in the return value register. The computer's program reasons about its model of the world axiomatically, using a proof checker like in my other posts, with the goal of returning a value whose representation in the return-value register would cause the actuators to affect the world-model in interesting ways (I call this the "coupling axiom"). Then that thing happens in the real world.

The first and most obvious example of what the computer could want is suicide. Assuming the "actuators" are flexible enough, the program could go searching for a proof that putting a certain return value in the register eventually causes the universe to become empty (assuming that at the beginning it was empty except for the computer). Then it returns that value and halts.

The second example is paperclipping. If the universe is finite, the program could search for a return value that results in a stable configuration for the entire universe with the most possible copies of some still-life, e.g. the "block". If the universe is infinite, it could search for patterns with high rates of paperclip production (limited by lightspeed in the automaton). In our world this would create something like the "energy virus" imagined by Sam Hughes - a rare example of a non-smart threat that sounds scarier than nanotech.

The third is sensing. Even though the computer lacks sensors, it will make them if the goal calls for it, so sensing is "emergent" in this formalization. (This part was a surprise for me.) For example, if the computer knows that the universe is empty except for a specified rectangle containing an unknown still-life pattern, and the goal is to move that pattern 100 cells to the right and otherwise cause no effect, the computer will presumably build something that has sensors, but we don't know what kind. Maybe a very slow-moving spaceship that can "smell" the state of the cell directly touching its nose, and stop and resume motion according to an internal program. Or maybe shoot the target to hell with glider-guns and investigate the resulting debris. Or maybe something completely incomprehensible at first glance, which nevertheless manages to get the job done. The Game of Life is unfriendly to explorers because it has no conservation of energy, so putting the wrong pieces together may lead to a huge explosion at lightspeed, but automata with forgiving physics should permit more efficient solutions.

You could go on and invent more elaborate examples where the program cares about returning something quickly, or makes itself smarter in Gödel machine style, or reproduces itself... And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values. As long as its world-model (or, more realistically, its prior about possible physics) describes the real world well, the thing will maximize what we tell it to, as best it can. This indicates that value stability may depend more on getting the modeling+quining right than on any deep theory of "goal systems" that people seem to want. And, of course, encoding human values in a machine-digestible form for Friendliness looks like an even harder problem.

New to LessWrong?

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 11:46 AM

Tangent: The concept in the “energy virus” page (I haven't read the rest of that story) is also explored at length in Greg Egan's novel Schild's Ladder.

Some cognitive architectures intrinsically exhibit instability of values (e.g. those where goals compete stochastically for priority), but Omohundro's drive to protect the utility function from modification should prevent a self-modifying AI with a stable architecture from adopting an architecture that is knowably or even possibly unstable.

However, the human cognitive architecture certainly looks to have value instability, and so this will be a problem for any attempt to codify a fixed human-friendly utility function by renormalizing the existing unstable architecture. Omohundro's drive won't automatically work here since the starting point isn't stable. It's also very possible that there's more than one reflectively stable equilibrium that can be obtained starting from the human decision architecture, because of its stochastic or context-dependent aspects.

Omohundro's drive to protect the utility function from modification

The machines in my post have no such drive coded in, and this isn't a problem. Just having a utility function over universes works out fine: if there's an action that makes the universe end up in the desired state, the computer will find it and do it. If there's uncertainty about possible interference, it will be taken into account.

Omohundro's drives are emergent behaviors expected in any sufficiently advanced intelligence, not something that gets coded in at the beginning.

Oh. Thanks.

There is also nothing to say that the eventual stable preference will have anything to do with the initial one, while the post argued about the initial utility. In this sense, Omohundro's argument is not relevant.

Also, consider sensing as a tool for resolving logical uncertainty, where there is no uncertainty about (definition of) the state of the world. Compare with reasoning under uncertainty after forgetting previously obtained logically deduced facts, forgetting previously received observations, and retrieving such forgotten facts from a separate module by "observing" them anew.

And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values.

Only by virtue of action being defined to be the result of non-disturbed calculation, which basically means that brain surgery is prohibited by the problem statement, or otherwise the agent is mistaken about its own nature (i.e. agent's decision won't make true the statement that the agent thinks it'd make true). Stability of values is more relevant when you consider replacing algorithms, evaluating expected actions of a different agent.

Well, obviously there has to be some amount of non-disturbed calculation at the start - the AI hardly has much chance if you nuke it while its Python interpreter is still loading up. But the first (and only) action that the AI returns may well result in the construction of another AI that's better shielded from uncertainty about the universe and shares the same utility function. (For example, our initial AI has no concept of "time" - it outputs a result that's optimized to work with the starting configuration of the universe as far as it knows - but that second-gen AI will presumably understand time, and other things besides.) I think that's what will actually happen if you run a machine like the one I described in our world.

ETA. So here's how you get goal stability: you build a piece of software that can find optima of utility functions, feed it your preferred utility function and your prior about the current state of the universe along with a quined description of the machine itself, give it just-barely-powerful-enough actuators, and make it output one single action. Wham, you're done.

More generally, you can think of AI as a one-decision construction, that never observes anything and just outputs a program that is to be run next. It's up to AI to design a good next program, while you, as AI's designer, only need to make sure that AI constructs a highly optimized next program, while running on protected hardware. This way, knowledge of physics or physical protection of AI's hardware or self-modification is not your problem, it's AI's.

The problem with this plan is that your AI needs to be able not just to construct an optimized next program, it needs to be able to construct a next program that is optimized enough, and it is you that must make sure that it's possible. If you know that your AI is strong enough, then you're done, but you generally don't, and if your AI constructs a slightly suboptimal successor, and that successor does something a little bit stupid as well, so on it goes and by the trillionths step the world dies (if not just the AI).

Which is why it's a good idea to not just say that AI is to do something optimized, but to have a more detailed idea about what exactly it could do, so that you can make sure it's headed in the right direction without deviating from the goal. This is the problem of stable goal systems.

Your CA setup does nothing of the sort, and so makes no guarantees. The program is vulnerable not just while it's loading.

All very good points. I completely agree. But I don't yet know how to approach the harder problem you state. If physics is known perfectly and the initial AI uses a proof checker, we're done, because math stays true even after a trillion steps. But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.

If physics is known perfectly and the first generation uses a proof checker to create the second, we're done.

No, since you still run the risk of tiling the future with problem-solving machinery of no terminal value that never actually decides (and kills everyone in the process; it might even come to a good decision afterwards, but it'll be too late for some of us - the Friendly AI of Doom that visibly only cares about Friendliness staying provable and not people, because it's not yet ready to make a Friendly decision).

Also, FAI must already know physics perfectly (with uncertainty parametrized by observations). Problem of induction: observations are always interpreted according to a preexisting cognitive algorithm (more generally, logical theory). If AI doesn't have the same theory of environment as we do, it'll make different conclusions about the nature of the world than we'd do, given the same observations, and that's probably not for the best if it's to make optimal decisions according to what we consider real. Just as no moral arguments can persuade an AI to change its values, no observations can persuade an AI to change its idea of reality.

But unknown physics could always turn out to be malicious in exactly the right way to screw up everything.

Presence of uncertainty is rarely a valid argument about possibility of making an optimal decision. You just make the best decision you can find given uncertainty that you're dealt. Uncertainty is part of the problem anyway, and can as well be treated with precision.

Also, interesting thing happens if by the whim of the creator computer is given a goal of tiling universe with most common still life in it and universe is possibly infinite. It can be expected, that computer will send slower than light "investigation front" for counting encountered still life. Meanwhile it will have more and more space to put into prediction of possible treats for its mission. If it is sufficiently advanced, then it will notice possibility of existence of another agents, and that will naturally lead it to simulating possible interactions with non-still life, and to the idea that it can be deceived into believing that its "investigation front" reached borders of universe. Etc...

Too smart to optimize.

One year and one level-up (thanks to ai-class.com) after this comment I'm still in the dark about the cause of downvoting the above comment.

I'm sorry for whining, but my curiosity took me over. Any comments?

It wasn't me, but I suspect the poor grammar didn't help. It makes it hard to understand what you were getting at.

Thank you. It is something I can use for improvement.

Can you point at the flaws? I can see that the structure of sentences is overcomplicated, but I don't know how it feels to native English speakers. Foreigner? Dork? Grammar Illiterate? I appreciate any feedback. Thanks.

Actually, a bit of all three. The one you can control the most is probably "dork", which unpacks as "someone with complex ideas who is too impatient/show-offy to explain their idiosyncratic jargon".

I'm a native English speaker, and I know that I still frequently sound "dorky" in that sense when I try to be too succinct.

It is valuable information, thanks. I underestimated relative weight of communication style in the feedback I got.

Also, interesting thing happens if by the whim of the creator computer is given a goal of tiling universe with most common still life in it and universe is possibly infinite.

Respectfully, I don't know what this sentence means. In particular, I don't know what "most common still life" meant. That made it difficult to decipher the rest of the comment.

ETA: Thanks to the comment below, I understand a little better, but now I'm not sure what motivates invoking the possibility of other agents, given that the discussion was about proving Friendliness.

In a cellular automaton, a still life is a pattern of cells which stays unchanged after each iteration.

[-][anonymous]12y00

Since you asked, your downvoted comment seems like word salad to me, I don't understand sensible reasons that would motivate it.

[This comment is no longer endorsed by its author]Reply

You seem to assume that world is indifferent to agent's goals. But if there's another agent, then it can be not the case.

Let G1 be "tile the universe with still-life", G2 be "tile upper half of the universe with still-life".

If agent A has goal G1, it will be provable destroyed by agent B, if A will change it's goal to G2, then B will not interfere.

A and B have full information on world's state.

Should A modify its goal?

Edit: Goal stability != value stability. So my point isn't valid.

You seem to assume that world is indifferent to agent's goals.

No, I don't need that assumption. What conclusion in the post depends on it, in your opinion?

It's error on my part, I assumed that goal stability equals value stability. But then it looks like that it can be impossible to reconstruct agent's values given only its current state.

I'm afraid I still don't understand your reasoning. How are "goals" different from "values", in your terms?

Goal is what an agent optimizes for at a given point in time. Value is the initial goal of an agent (in your toy model at least).

In my root post it seems to be optimal for agent A to self-modify into agent A', which optimizes for G2, thus agent A' succeeds in optimizing world according to its values (goal of agent A). But original goal doesn't influence its optimization procedure anymore. Thus if we'll analyze agent A' (without knowledge of world's history), we'll be unable to infer its values (its original goal).

Yes, that seems to be correct.

And they all share a curious pattern. Even though the computer can destroy itself without complaint, and even salvage itself for spare parts if matter is scarce, it never seems to exhibit any instability of values.

But aren't we talking about a thought experiment here?

We are. I wanted to show that stable goal systems aren't difficult to build, even in a setting that allows all sorts of weird actions like self-modification. You can just see at a glance whether a given system is stable: my examples obviouly are, and Mitchell's stochastic example obviously isn't.

Note: These arguments stem from a re-reading of the OP, they're not directly related to my initial comment.

If it were easy, we could do it. This is showing that if there are buildable life mechanisms that will perform task, a life computer can find it find it, and build it.

The computer never has instability of values because it never modifies itself until it has a proven plan, but finding the plan and the proof is left as an exercise for the computer.

The built mechanism might be very complicated, the proof of its behavior might be huge, and the computer might need to think for a looong time to get it. (which isn't a problem if the goal is an operation on a stable universe and the computer is big enough.)

If you want to do things faster or smaller, your program needs to be smarter, and perhaps its creations need to be smarter. You can still pass them through a proof checker so you know they are goal-friendly but that doesn't tell you or the program how to find things that are goal friendly in a time/space efficient manner.

You could end up with a paper clipper running inside your program somehow forced to pursue your given utility function, which to me looks a lot like the sandboxing,

Also, in each case listed the resulting mechanism probably wouldn't itself have any general intelligence. The goal would need to be pretty complex to require it. So in these cases the program won't be solving the stable goal problem either, just building machines.