I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.
How do you prevent deception from AI systems?
One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.
I think these are pretty reasonable approaches. However, I think they fail to address failure modes like deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.
One could make the argument that focusing on more salient and dangerous failure modes like deceptive alignment makes a lot more sense. However - especially if you’re interested in approaches that work in the worst case and don’t rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.
So, how do you prevent any failure modes that route through deception? In an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.
Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you’re going to be building an interface to that composite system.
Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can’t foresee the consequences of some action very reliably in a complex environment. This means we’re unable to properly evaluate whether an action taken by something smarter than us is good or not.
If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.
However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].
Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to. One example of this is the target of this optimization: what I refer to as objectives, and the focus of most of my work.
I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is that some properties we care about cash out in pretty different ways at the limit, and as such are more or less tractable.
- ^
This is also, by my understanding, the part of the problem that most theories of impact for current interpretability work focus on. That it only focuses on this part of the problem is one of my reservations of current interpretability work.
- ^
Even if very confused in current operationalizations. It could also be very fuzzy in realistic scenarios, with blurred boundaries between things that are structurally optimizers and things that aren’t. However, I think there’s a coherent conceptual boundary that can be drawn.
Thanks for the comment, I'm glad it helped!
I'm not sure if I know exactly what parts you feel fuzzy on, but some scattered thoughts:
Abstracting over a lot of nuance and complexity, one could model internal optimization as being a ~general-purpose search process / module that the model can make use of. A general-purpose search process requires a goal to evaluate the consequences of different plans that you're searching over. This goal is fed into the search module as an input.
This input is probably described in the model's internal language; i.e., it's described in terms of concepts that the model learns corresponding to things in the environment. This seems true even if the model uses some very direct pointer to things in the environment - it still has to be represented as information that makes sense to the search process, which is written in the model's ontology.
So the inputs to the search process are part of the system itself. Which is to say that the "property" of the optimization that corresponds to what it's targeted at, is in the complexity class of the system that the optimization is internal to. I think this generalizes to the case where the optimization isn't as cleanly represented as a general-purpose search module.