I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.
How do you prevent deception from AI systems?
One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.
I think these are pretty reasonable approaches. However, I think they fail to address failure modes like deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.
One could make the argument that focusing on more salient and dangerous failure modes like deceptive alignment makes a lot more sense. However - especially if you’re interested in approaches that work in the worst case and don’t rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.
So, how do you prevent any failure modes that route through deception? In an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you’re going to be building an interface to that composite system.
Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can’t foresee the consequences of some action very reliably in a complex environment. This means we’re unable to properly evaluate whether an action taken by something smarter than us is good or not.
If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.
However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].
Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to. One example of this is the target of this optimization: what I refer to as objectives, and the focus of most of my work.
I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is that some properties we care about cash out in pretty different ways at the limit, and as such are more or less tractable.
- ^
This is also, by my understanding, the part of the problem that most theories of impact for current interpretability work focus on. That it only focuses on this part of the problem is one of my reservations of current interpretability work.
- ^
Even if very confused in current operationalizations. It could also be very fuzzy in realistic scenarios, with blurred boundaries between things that are structurally optimizers and things that aren’t. However, I think there’s a coherent conceptual boundary that can be drawn.
The first part here feels unfair to the deceived. The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can't have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place. The core intuition is that if we have the latter, I assume we'll eventually get the former through better models (though I think there's a decent chance that control works for a long time, and there you care specifically about whether complex environment interactions lead to deception succeeding or not, but I don't think that's what you mean?).
The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: "if", not "if and only if") (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don't massively reduce the probability of that strategy, or RLHFed constraints against being bad aren't enough). And here, (2) is of course about the environment. But to see whether this argument goes through, it doesn't seem like we need to care all that much about the real-world environment (as opposed to toy settings), because "does the real world incentivize deception" seems much less cruxy than (1) or (3).
So my (weakly held) claim is that you can study whether deception emerges in sufficiently simple environments that the environment complexity isn't a core problem. This will not let you determine whether a particular output in a complicated environment is part of a deceptive plan, but it should be fairly good evidence of whether or not deception is a problem at all.
(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it's better to not overload the term.)
Ye... (read more)