In the previous post, I argued that simply knowing that an AI system is superintelligent does not imply that it must be goal-directed. However, there are many other arguments that suggest that AI systems will or should be goal-directed, which I will discuss in this post.
Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent -- I would classify this as “Agent AI that is not goal-directed”. (But see this comment thread for discussion.)
Note that these arguments have different implications than the argument that superintelligent AI must be goal-directed due to coherence arguments. Suppose you believe all of the following:
- Any of the arguments in this post.
- Superintelligent AI is not required to be goal-directed, as I argued in the last post.
- Goal-directed agents cause catastrophe by default.
Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.
Economic efficiency: goal-directed humans
Humans want to build powerful AI systems in order to help them achieve their goals -- it seems quite clear that humans are at least partially goal-directed. As a result, it seems natural that they would build AI systems that are also goal-directed.
This is really an argument that the system comprising the human and AI agent should be directed towards some goal. The AI agent by itself need not be goal-directed as long as we get goal-directed behavior when combined with a human operator. However, in the situation where the AI agent is much more intelligent than the human, it is probably best to delegate most or all decisions to the agent, and so the agent could still look mostly goal-directed.
Even so, you could imagine that even the small part of the work that the human continues to do allows the agent to not be goal-directed, especially over long horizons. For example, perhaps the human decides what the agent should do each day, and the agent executes the instruction, which involves planning over the course of a day, but no longer. (I am not arguing that this is safe; on the contrary, having very powerful optimization over the course of a day seems probably unsafe.) This could be extremely powerful without the AI being goal-directed over the long term.
Another example would be a corrigible agent, which could be extremely powerful while not being goal-directed over the long term. (Though the meanings of “goal-directed” and “corrigible” are sufficiently fuzzy that this is not obvious and depends on the definitions we settle on for each.)
Economic efficiency: beyond human performance
Another benefit of goal-directed behavior is that it allows us to find novel ways of achieving our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-directed behavior is one of the few methods we know of that allow AI systems to exceed human performance.
I think this is a good argument for goal-directed behavior, but given the problems of goal-directed behavior I think it’s worth searching for alternatives, such as the two examples in the previous section (optimizing over a day, and corrigibility). Alternatively, we could learn human reasoning, and execute it for a longer subjective time than humans would, in order to make better decisions. Or we could have systems that remain uncertain about the goal and clarify what they should do when there are multiple very different options (though this has its own problems).
Current progress in reinforcement learning
If we had to guess today which paradigm would lead to AI systems that can exceed human performance, I would guess reinforcement learning (RL). In RL, we have a reward function and we seek to choose actions that maximize the sum of expected discounted rewards. This sounds a lot like an agent that is searching over actions for the best one according to a measure of goodness (the reward function [1]), which I said previously is a goal-directed agent. And the math behind RL says that the agent should be trying to maximize its reward for the rest of time, which makes it long-term [2].
That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best. In particular, you shouldn’t expect a treacherous turn, since the whole point of a treacherous turn is that you don’t see it coming because it never happened before.
In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term. Of course, many tasks would have very long episodes, such as being a CEO. The vanilla deep RL approach here would be to specify a reward function for how good a CEO you are, and then try many different ways of being a CEO and learn from experience. This requires you to collect many full episodes of being a CEO, which would be extremely time-consuming.
Perhaps with enough advances in model-based deep RL we could train the model on partial trajectories and that would be enough, since it could generalize to full trajectories. I think this is a tenable position, though I personally don’t expect it to work since it relies on our model generalizing well, which seems unlikely even with future research.
These arguments lead me to believe that we’ll probably have to do something that is not vanilla deep RL in order to train an AI system that can be a CEO, and that thing may not be goal-directed.
Overall, it is certainly possible that improved RL agents will look like dangerous long-term goal-directed agents, but this does not seem to be the case today and there seem to be serious difficulties in scaling current algorithms to superintelligent AI systems that can optimize over the long term. (I’m not arguing for long timelines here, since I wouldn’t be surprised if we figured out some way that wasn’t vanilla deep RL to optimize over the long term, but that method need not be goal-directed.)
Existing intelligent agents are goal-directed
So far, humans and perhaps animals are the only example of generally intelligent agents that we know of, and they seem to be quite goal-directed. This is some evidence that we should expect intelligent agents that we build to also be goal-directed.
Ultimately we are observing a correlation between two things with sample size 1, which is really not much evidence at all. If you believe that many animals are also intelligent and goal-directed, then perhaps the sample size is larger, since there are intelligent animals with very different evolutionary histories and neural architectures (eg. octopuses).
However, this is specifically about agents that were created by evolution, which did a relatively stupid blind search over a large space, and we could use a different method to develop AI systems. So this argument makes me more wary of creating AI systems using evolutionary searches over large spaces, but it doesn’t make me much more confident that all good AI systems must be goal-directed.
Interpretability
Another argument for building a goal-directed agent is that it allows us to predict what it’s going to do in novel circumstances. While you may not be able to predict the specific actions it will take, you can predict some features of the final world state, in the same way that if I were to play Magnus Carlsen at chess, I can’t predict how he will play, but I can predict that he will win.
I do not understand the intent behind this argument. It seems as though faced with the negative results that suggest that goal-directed behavior tends to cause catastrophic outcomes, we’re arguing that it’s a good idea to build a goal-directed agent so that we can more easily predict that it’s going to cause catastrophe.
I also think that we would typically be able to predict significantly more about what any AI system we actually build will do (than if we modeled it as trying to achieve some goal). This is because “agent seeking a particular goal” is one of the simplest models we can build, and with any system we have more information on, we start refining the model to make it better.
Summary
Overall, I think there are good reasons to think that “by default” we would develop goal-directed AI systems, because the things we want AIs to do can be easily phrased as goals, and because the stated goal of reinforcement learning is to build goal-directed agents (although they do not look like goal-directed agents today). As a result, it seems important to figure out ways to get the powerful capabilities of goal-directed agents through agents that are not themselves goal-directed. In particular, this suggests that we will need to figure out ways to build AI systems that do not involve specifying a utility function that the AI should optimize, or even learning a utility function that the AI then optimizes.
[1] Technically, actions are chosen according to the Q function, but the distinction isn’t important here.
[2] Discounting does cause us to prioritize short-term rewards over long-term ones. On the other hand, discounting seems mostly like a hack to make the math not spit out infinities, and so that learning is more stable. On the third hand, infinite horizon MDPs with undiscounted reward aren't solvable unless you almost surely enter an absorbing state. So discounting complicates the picture, but not in a particularly interesting way, and I don’t want to rest an argument against long-term goal-directed behavior on the presence of discounting.
Here are a few more reasons for humans to build goal-directed agents:
Goal directed AI is a way to defend against value drift/corruption/manipulation. People might be forced to build goal directed agents if they can't figure out another way to do that.
Goal directed AI is a way to cooperate and thereby increase economic efficiency and/or military competitiveness. (A group of people can build a goal directed agent that they can verify represents an aggregation of their values.) People might be forced to build or transfer control to goal directed agents in order to participate in such cooperation to remain competitive, unless they can figure out another way to cooperate that is as efficient as this.
Goal directed AI is a way to address other human safety problems. People might trust an AI with explicit and verifiable values more than an AI that is controlled by a distant stranger.
As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?