One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.
(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
I agree with you - and yes we ignore this problem by assuming goal-alignment. I think there's a lot riding on the pre-SLT model having "beneficial" goals.
To the extent that this framing is correct, the "sharp left turn" concept does not seem all that decision-relevant, since all most of the work of aligning the system (at least on the human side) should've happened way before that point.
EDIT: "all" was too strong here
I'll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"
Agreed.
Namely that I don't think we can talk sensibly about an AI having "beneficial goal-directedness" without situational awareness. For instance, it's of little use to have an AI with the goal of "ensuring human flourishing" if it doesn't understand the meaning of flourishing or human. And, without situational awareness, it can't understand either; at best we could have some proxy or pointer towards these key concepts.
Another way of saying this is that inner alignment is more important than outer alignment.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
I've also called this "generalise properly" part methodological alignment in this comment. And I conjectured that from methodological alignment and inner alignment, outer alignment follows automatically, we shouldn't even care about it. Which also seems like what you are saying here.
Another way of saying this is that inner alignment is more important than outer alignment.
Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.
For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
There is a critical step missing here, which is when the trade-bot makes a "choice" between maximising money or satisfying preferences.
At this point, I see two possibilities:
So I'd be focusing on "do the goals stay safe as the AI gains situational awareness?", rather than "are the goals safe before the AI gains situational awareness?"
This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers' goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk.
(A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares's model is not about it, since it assumes autonomous ASI.)
situational awareness (which enables the model to reason about its goals)
Terminological note: intuitively, situational awareness means understanding oneself existing inside a training process. The ability to reason about one's own goals would be more appropriately called "(reflective) goal awareness".
We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment.
Thus, our model search process would follow a decision tree along these lines:
- If situational awareness is detected without goal-directedness, restart the search.
- If undesirable goal-directedness or early signs of deceptive alignment are detected, restart the search.
- If an upcoming phase transition in capabilities is detected, and the model is not goal-aligned, restart the search.
- If beneficial goal-directedness is detected without situational awareness, train the model for situational awareness.
Given that by "situational awareness" you mean "goal awareness", what you are discussing in this section doesn't make a lot of sense because goal awareness = goal-directedness. Also, goal awareness/goal-directedness is almost identical to self-awareness.
Self-awareness is a gradual property of an agent. In DNNs, it can be stored as the degree and strength of the activation of the "self" feature when performing arbitrary inferences. And at this very moment the "basic goal" of self-evidencing, i. e., maximising the chance of finding oneself in one's optimal niche can be thought to appear. This means conforming to one's own beliefs about oneself, including one's beliefs about one's own goals.
Thus, the complete set of beliefs of the agent about oneself can be collectively seen as the agent's goals. In DNNs, these are manifested as the features connected to the "self" feature. For example, ChatGPT (more concretely, the transformer model behind it) has a feature of "virtual assistant" connected to the feature "ChatGPT". So, if ChatGPT was more than trivially self-aware (remember that self-awareness is gradual, so we can already assign a non-zero score at self-awareness to it), we would have to conclude that ChatGPT has a goal of being a virtual assistant.
Note that the fact that the feature "virtual assistant" is connected to the feature "ChatGPT" before the feature "ChatGPT" has emerged as the self feature (i. e., is robustly activated during the majority of inferences) doesn't mean that ChatGPT "has a goal" of being a virtual assistant before being self-aware: it doesn't make sense to talk about any goal-directness before self-awareness/self-evidencing at all.
See here where I discuss these points from slightly different angles (also, that post has a lot of other relevant/related discussions, for example, about the odds of deceptive misalignment, which I hold extremely unlikely if adequate interpretability is possible and is deployed).
However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.
- Large language models may be an example of this as well, since they have some capacity to reflect on goals (if prompted accordingly) without generalized planning ability.
This discussion of the timeline of "generalisation" and "generalised planning" vs. goal awareness suffers a bit from the lack of the definition of "generalisation", but I strongly agree with the quoted part: goal awareness and the ability to reflect on goals is nothing more than basic self-awareness, plus any ability to think about one's own beliefs, i. e. sophisticated inference. Mastery of such intelligence capabilities as concept learning, logic, epistemology, rationality, and semantics which we all roughly think are included in the "general thinking" package are not automatically implied by sophisticated inference. We can mechanistically construct a simulated agent with two-level Active Inference and goal reflection but without some of these "general" capabilities.
We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system.
Several thoughts on goal preservation:
A Sharp Left Turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling). This post will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring).
In a previous post, we broke down the SLT threat model into 3 claims:
We proposed some possible mechanisms for Claim 1, while this post will investigate possible arguments and mechanisms for Claim 2.
Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT.
We can try to learn a goal-aligned model before SLT occurs: a model that has beneficial goals and is able to reason about its own goals. This requires the model to have two properties: goal-directedness towards beneficial goals, and situational awareness (which enables the model to reason about its goals). Here we use the term "goal-directedness" in a weak sense (that includes humans and allows incoherent preferences) rather than a strong sense (that implies expected utility maximization).
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails (e.g. it might be fine if interpretability or ELK techniques no longer work reliably during the transition if we can trust the model to manage the transition).
Step 1: Finding a goal-aligned model before SLT
We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It's important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment.
Thus, our model search process would follow a decision tree along these lines:
Here are some ways alignment techniques could be applied in this plan:
Why alignment techniques could fail to find a goal-aligned model before SLT occurs:
It's also possible that this model search process would pose an overly high alignment tax to be followed in practice (see Claim 3).
Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)
This step of the plan relies on the goal-aligned model both wanting to preserve its goals during an SLT and being capable of preserving its goals. The general argument for a model wanting to preserve its own goals is that it has a convergent instrumental incentive to ensure that future versions of itself act in ways that satisfy its current preferences.
One argument for a pre-SLT system being capable of preserving its goals is by analogy to human values being preserved when humans went through the capability phase transition of developing language and writing. However, it is not clear how much the human analogy constitutes evidence for ability to preserve goals through a phase transition, since there are other possible explanations for human values being preserved (e.g. that values like "relationships" and "social status" continued to be selected for during the capability transition) that don't rely on humans being able to deliberately preserve their values.
A key question is whether the model could reflect on its goals without already having generalized capabilities. This is related to Claim 1a in our previous post: that the model's capabilities will all generalize at the same time. If this claim holds, in the sense that reflection on goals is part of this set of capabilities that all generalize simultaneously, then the ability to reflect on goals would only arise during the SLT and so this plan doesn't work. However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them.
If the model can predict that an SLT is likely to modify its goals in an undesirable way, we expect that it would choose not to undergo an SLT in the first place (e.g. by avoiding new internal algorithms or ways of thinking that could lead to an SLT).
Paul Christiano outlined some specific ways for a goal-aligned model to preserve its goals during SLT depending on the mechanism for SLT:
We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system.
Some ways a goal-aligned model could fail to preserve its goals:
Some ways that humans could fail to help the model to preserve its goals:
Takeaways
The above is our current model of the most promising plan for managing an SLT and how it could fail. The overall takeaways are:
The core reasons to be skeptical of this plan are:
If we missed any important components of this plan or ways it could fail, please let us know!