Goal-directedness is the term used by the AI Safety community to point to a specific property: following a goal. It comes from Rohin Shah's post in his sequence, but the intuition pervades many safety issues and current AI approaches. Yet it lacks a formal definition, or even a decomposition into more or less formal subcomponents.
Which questions we want to answer about goal-directed systems underlies the sort of definition we're looking for. There are two main questions that Rohin asks in his posts:
- Are non goal-directed systems or less goal-directed ones inherently safer than fully goal-directed ones?
- Can non-goal-directed systems or less goal-directed ones be competitive with fully goal-directed ones?
Answering these will also answer the really important meta-question: should we put resources into non-goal-directed approaches to AGI?
Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.
Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness. But I wanted to make it explicit.
Reiterating the point of the post: goal-directedness is a property of behavior, not internal structure. By this I mean that given the complete behavior of a system over all environment, goal-directedness is independent of what's inside the system. Or equivalently, if two systems always behave in the same way, their goal-directedness is the same, regardless of if one contains a big lookup table and the other an homonculus.
This is not particularly original: Dennett's intentional stance pretty much says the same thing. (The Intentional Stance, p 15)
Then I will argue that any object -- or as I shall say, any system -- whose behavior is well predicted by this strategy [considering it as moving towards a goal] is in the fullest sense of the word a believer. What it is to be a true believer is to be an intentional system, a system whose behavior is reliably and voluminously predictable via the intentional strategy.
Why write a post about it, then? I'm basically saying that our definition should depend only on observable behavior, which is pretty obvious, isn't it?
Well, goal is a very loaded term. It is a part of the set of mental states we attribute to human beings, and other agents, but that we are reluctant to give to anything else. See how I never used the word "agent" before in this post, preferring "system" instead? That was me trying to limit this instinctive thinking about what's inside. And here is the reason why I think this post is not completely useless: when looking for a definition of goal-directedness, the first intuition is to look for the internal structure. It seems obvious that goals should be somewhere "inside" the system, and thus that what really matters is the internal structure.
But as we saw above, goal-directedness should probably depend only on the complete behavior of the system. That is not to say that the internal structure is not important or useful here. On the contrary, this structure, in the form of source code for example, is usually the only thing we have at our disposal. It serves to compute goal-directedness, instead of defining it.
We thus have this split:
- Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
- Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.
What I see as a mistake here, a mistake I personally made, is to look for the definition in the internal structure. To look at some neural net, or some C program, and try to find where the goals are and what makes the program follow them. Instead, I think we should define and formalize goal-directedness from the ideal context of knowing the full behavior of the system, and then use interpretability and formal methods to extract what's relevant to this definition from the internal structure.
Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.
This seems like a bad argument to me, because goal-directedness is not meant to be a complete determinant of safety and competitiveness; other things matter too. As an analogy, one property of my internal cognition is that sometimes I am angry. We like to know whether people are angry because (amongst other things) it helps us predict whether they are safe to be around - but there's nothing inconsistent about two people with the same level of anger being differently safe (e.g. because one of them is also tired and decides to go sleep instead of starting a fight).
If we tried to *define* anger in terms of behaviour, then I predict we'd have a very difficult time, and end up not being able to properly capture a bunch of important aspects of it (like: being angry often makes you fantasise about punching people; or: you can pretend to be angry without actually being angry), because it's a concept that's most naturally formulated in terms of internal state and cognition. The same is true for goal-directedness - in fact you agree that the main way we get evidence about goal-directedness in practice is by looking at, and making inferences about, internal cognition. If we think of a concept in cognitive terms, and learn about it in cognitive terms, then I suspect that trying to define it in behavioural terms will only lead to more confusion, and similar mistakes to those that the behaviourists made.
On the more general question of how tractable and necessary a formalism is - leaving aside AI, I'd be curious if you're optimistic about the prospect of formalising goal-directedness in humans. I think it's pretty hopeless, and don't see much reason that this would be significantly easier for neural networks. Fortunately, though, humans already have very sophisticated cognitive machinery for reasoning in non-mathematical ways about other agents.
After talking with Evan, I think I understand your point better. What I didn't understand was that you seemed to argue that there was something else than the behavior that mattered for goal-directedness. But as I understand it now, what you're saying is that, yes, the behavior is what matters, but extracting the relevant information from the behavior is really hard. And thus you believe that computing goal-directedness in any meaningful way will require normative assumptions about the cognition of the system, at an abstract level.
If that's r... (read more)