Goal-directedness is the term used by the AI Safety community to point to a specific property: following a goal. It comes from Rohin Shah's post in his sequence, but the intuition pervades many safety issues and current AI approaches. Yet it lacks a formal definition, or even a decomposition into more or less formal subcomponents.
Which questions we want to answer about goal-directed systems underlies the sort of definition we're looking for. There are two main questions that Rohin asks in his posts:
- Are non goal-directed systems or less goal-directed ones inherently safer than fully goal-directed ones?
- Can non-goal-directed systems or less goal-directed ones be competitive with fully goal-directed ones?
Answering these will also answer the really important meta-question: should we put resources into non-goal-directed approaches to AGI?
Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.
Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness. But I wanted to make it explicit.
Reiterating the point of the post: goal-directedness is a property of behavior, not internal structure. By this I mean that given the complete behavior of a system over all environment, goal-directedness is independent of what's inside the system. Or equivalently, if two systems always behave in the same way, their goal-directedness is the same, regardless of if one contains a big lookup table and the other an homonculus.
This is not particularly original: Dennett's intentional stance pretty much says the same thing. (The Intentional Stance, p 15)
Then I will argue that any object -- or as I shall say, any system -- whose behavior is well predicted by this strategy [considering it as moving towards a goal] is in the fullest sense of the word a believer. What it is to be a true believer is to be an intentional system, a system whose behavior is reliably and voluminously predictable via the intentional strategy.
Why write a post about it, then? I'm basically saying that our definition should depend only on observable behavior, which is pretty obvious, isn't it?
Well, goal is a very loaded term. It is a part of the set of mental states we attribute to human beings, and other agents, but that we are reluctant to give to anything else. See how I never used the word "agent" before in this post, preferring "system" instead? That was me trying to limit this instinctive thinking about what's inside. And here is the reason why I think this post is not completely useless: when looking for a definition of goal-directedness, the first intuition is to look for the internal structure. It seems obvious that goals should be somewhere "inside" the system, and thus that what really matters is the internal structure.
But as we saw above, goal-directedness should probably depend only on the complete behavior of the system. That is not to say that the internal structure is not important or useful here. On the contrary, this structure, in the form of source code for example, is usually the only thing we have at our disposal. It serves to compute goal-directedness, instead of defining it.
We thus have this split:
- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
What I see as a mistake here, a mistake I personally made, is to look for the definition in the internal structure. To look at some neural net, or some C program, and try to find where the goals are and what makes the program follow them. Instead, I think we should define and formalize goal-directedness from the ideal context of knowing the full behavior of the system, and then use interpretability and formal methods to extract what's relevant to this definition from the internal structure.
Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.
I have two large objections to this.
First, the two questions considered are both questions about goal-directed AI. As I see it, the most important reason to think about goal directedness is not that AI might be goal directed, but that humans might be goal directed. The whole point of alignment is to build AI which does what humans want; the entire concept of "what humans want" has goal directedness built into it. We need a model in which it makes sense for humans to want things, in order to even formulate the question "will this AI do what humans want?". That's why goal directedness matters.
If we think about goal-directedness in terms of figuring out what humans want, then it's much less clear that it should be behaviorally defined.
Second, think about the implied logic in these two sentences:
Here's an analogous argument, to make the problem more obvious: I want to predict whether a system is foo based on whether it is bar. Foo-ness depends only on how big the system is, not on how red it is. Thus it makes sense to consider that bar-ness should also only depend on how big the system is, not on how red it is.
If I were to sketch out a causal graph for the implied model behind this argument, it would have an arrow/path Big-ness -> Foo-ness, with no other inputs to foo-ness. The claim "therefore bar-ness should also depend only on how big the system is" effectively assumes that bar-ness is on the path between big-ness and foo-ness. Assuming bar-ness is on that path, it shouldn't have a side input from red-ness, because then red-ness would be upstream of foo-ness. But that's not the only possibility; in the goal-directness case it makes more sense for bar-ness to be upstream of big-ness - i.e. goal-directness determines behavior, not the other way around.
Anyway, moving on...
I disagree with this. See Alignment as Translation: goal-directedness is a sufficient condition for a misaligned AI to be dangerous, not a necessary condition. AI can be dangerous in exactly the same way as nukes: it can make big irreversible changes too quickly to stop. This relates to the previous objection as well: it's the behavior that makes AI dangerous, and goal-directedness is one possible cause of dangerous behavior, not the only possible cause. Goal-directedness causes behavior, not vice-versa.
Overall, I'm quite open to the notion that goal-directedness must be defined behaviorally, but the arguments in this post do not lend any significant support to that notion.
You make a good point. Actually, I think I answered a bit too fast, maybe because I was in the defensive (given the content of your comment). We probably are actually trying to capture the intuitive goal-directedness, in the sense that many of our examples, use-cases, intuitions and counter-examples draw on humans.
What I reacted against is a focus solely on humans. I do think that goal-directedness should capture/explain humans, but I also believe that studying simpler settings/systems will provide many insight that would be lost in the complexity of human... (read more)