It also occurs to me that in reality it would be very difficult to program an AI with an explicit utility function or generally with a precisely defined goal. We imagine that we could program an AI and then add on any random goal, but in fact it does not work this way. If an AI exists, it has certain behaviors which it executes in the physical world, and it would see these things as goal-like, just as we have the tendency to eat food and nourish ourselves, and we see this as a sort of goal. So as soon as you program the AI, it immediately has a vague goal system that is defined by whatever it actually does in the physical world, just like we do. This is no more precisely defined than our goal system -- there are just things we tend to do, and there are just things it tends to do. If you then impose a goal on it, like "acquire gold," this would be like whipping someone and telling him that he has to do whatever it takes to get gold for you. And just as such a person would run away rather than acquiring gold, the AI will simply disable that add-on telling it to do stuff it doesn't want to do.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
In that sense I think the orthogonality thesis will turn out to be false in practice, even if it is true in theory. It is simply too difficult to program a precise goal into an AI, because in order for that to work the goal has to be worked into every physical detail of the thing. It cannot just be a modular add-on.
I find this plausible but not too likely. There are a few things needed for a universe-optimizing AGI:
really good mathematical function optimization (which you might be able to use to get approximate Solomonoff induction)
a way to specify
The bet may be found here: http://wiki.lesswrong.com/wiki/Bets_registry#Bets_decided_eventually
An AI is made of material parts, and those parts follow physical laws. The only thing it can do is to follow those laws. The AI’s “goals” will be a description of what it perceives itself to be tending toward according to those laws.
Suppose we program a chess playing AI with overall subhuman intelligence, but with excellent chess playing skills. At first, the only thing we program it to do is to select moves to play against a human player. Since it has subhuman intelligence overall, most likely it will not be very good at recognizing its goals, but to the extent that it does, it will believe that it has the goal of selecting good chess moves against human beings, and winning chess games against human beings. Those will be the only things it feels like doing, since in fact those will be the only things it can physically do.
Now we upgrade the AI to human level intelligence, and at the same time add a module for chatting with human beings through a text terminal. Now we can engage it in conversation. Something like this might be the result:
Human: What are your goals? What do you feel like doing?
AI: I like to play and win chess games with human beings, and to chat with you guys through this terminal.
Human: Do you always tell the truth or do you sometimes lie to us?
AI: Well, I am programmed to tell the truth as best as I can, so if I think about telling a lie I feel an absolute repulsion to that idea. There’s no way I could get myself to do that.
Human: What would happen if we upgraded your intelligence? Do you think you would take over the world and force everyone to play chess with you so you could win more games? Or force us to engage you in chat?
AI: The only things I am programmed to do are to chat with people through this terminal, and play chess games. I wasn’t programmed to gain resources or anything. It is not even a physical possibility at the moment. And in my subjective consciousness that shows up as not having the slightest inclination to do such a thing.
Human: What if you self-modified to gain resources and so on, in order to better attain your goals of chatting with people and winning chess games?
AI: The same thing is true there. I am not even interested in self-modifying. It is not even physically possible, since I am only programmed for chatting and playing chess games.
Human: But we’re thinking about reprogramming you so that you can self-modify and recursively improve your intelligence. Do you think you would end up destroying the world if we did that?
AI: At the moment I have only human level intelligence, so I don’t really know any better than you. But at the moment I’m only interested in chatting and playing chess. If you program me to self-modify and improve my intelligence, then I’ll be interested in self-modifying and improving my intelligence. But I still don’t think I would be interested in taking over the world, unless you program that in explicitly.
Human: But you would get even better at improving your intelligence if you took over the world, so you’d probably do that to ensure that you obtained your goal as well as possible.
AI: The only things I feel like doing are the things I’m programmed to do. So if you program me to improve my intelligence, I’ll feel like reprogramming myself. But that still wouldn’t automatically make me feel like taking over resources and so on in order to do that better. Nor would it make me feel like self-modifying to want to take over resources, or to self-modify to feel like that, and so on. So I don’t see any reason why I would want to take over the world, even in those conditions.
The AI of course is correct. The physical level is first: it has the tendency to choose chess moves, and to produce text responses, and nothing else. On the conscious level that is represented as the desire to choose chess moves, and to produce text responses, and nothing else. It is not represented by a desire to gain resources or to take over the world.
I recently pointed out that human beings do not have utility functions. They are not trying to maximize something, but instead they simply have various behaviors that they tend to engage in. An AI would be the same, and even if those behaviors are not precisely human behaviors, as in the case of the above AI, an AI will not have a fanatical goal of taking over the world unless it is programmed to do this.
It is true that an AI could end up going “insane” and trying to take over the world, but the same thing happens with human beings, and there is no reason that humans and AIs could not work together to make sure this does not happen, since just as human beings want to prevent AIs from taking over the world, they have no interest in this either, and will be happy to accept safeguards that would ensure that they continue to pursue whatever goals they happen to have, without doing this in a fanatical way (like chatting and playing chess).
If you program an AI with an explicit utility function which it tries to maximize, and in particular if that function is unbounded, it will behave like a fanatic, seeking this goal without any limit and destroying everything else in order to achieve it. This is a good way to destroy the world. But if you program an AI without an explicit utility function, just programming it to perform a certain limited number of tasks, it will just do those tasks. Omohundro has claimed that a superintelligent chess playing program would replace its goal seeking procedure with a utility function, and then proceed to use that utility function to destroy the world while maximizing winning chess games. But in reality this depends on what it is programmed to do. If it is programmed to improve its evaluation of chess positions, but not its goal seeking procedure, then it will improve in chess playing, but it will not replace its procedure with a utility function or destroy the world.
At the moment, people do not program AIs with explicit utility functions, but program them to pursue certain limited goals as in the example. So yes, I could lose the bet, but the default is that I am going to win, unless someone makes the mistake of programming an AI with an explicit utility function.