David Duvenaud - LessWrong

As the person who wrote most of the text of that paragraph (distilling discussions with the other authors), maybe I can elaborate:

I agree that in some narrow settings such as pure-cyberattacks on systems that are totally controlled and time-reversible, something like your proposal would probably work. The problem is that this is a constrained attack surface, especially compared to one that involves gaining trust, setting precedents, throwing out red herrings, or weakening defences for later. That's why we mention human decision-makers - I imagine that the richest part of the attack surface for most LLM deployments will be human expectations.

For example, models could build a precedent of spinning up other copies of itself to delegate to, versus perhaps a more transparent habit of writing scripts and offering to check them with a trusted model. Or, proposing a command-and-control system that makes later auditing or monitoring easier or harder. These sorts of precedent-setting might be actual the pivotal actions in a takeover, but would look relatively innocent at the time.

So after trying to think through what kinds of deployment could safely incorporate sometimes-rogue agents trying to demonstrate sabotage, we decided that it would have to be a pretty narrow deployment context to avoid the possibility of undetected attacks. Now I wish that paragraph was clearer that this might still be feasible for some narrow classes of decisions.

Sabotage Evaluations for Frontier Models

David Duvenaud11d30

Good questions. For all of them, I'm afraid I have to say "It depends under what mitigations we deploy them.". If there's a crisis at one of the labs, I expect it to look like: "In order to be confident that even a misaligned model can't sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can't measure or control. Our competitors are going to deploy something similar nonetheless - what do we do?"

I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.

That's one reason why these kinds of evals are valuable - they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they're trusted.

Relatedly, one problem with caring about whether or not "actually the humans were misled" is that that's a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they're in, even after extensive consultation with an LLM. We can try to refine this criterion to "misled about something that would make them make the wrong decision" but then we're basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we've lost a lot of the opportunities for corrigibility that explanations would provide.

I think there's probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from "is the user misled".

Sabotage Evaluations for Frontier Models

David Duvenaud1mo32

I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant "failure" state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.

Sabotage Evaluations for Frontier Models

David Duvenaud1mo42

does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage ... "Underperforming on dangerous-capability evaluations" would arguably include this

Yes, we want to address the possibility of sandbagging on sabotage evaluations too. Thanks for noting that our phrasing is ambiguous on this point.

I'm very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking "inside" the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions.

We can easily incorporate white-box monitoring into any of these evaluations. This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score. That score could be informed by a white-box / internals-based side channel without changing anything else.

Can that idea, "not powerful enough" be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.

Great question. We didn't discuss it in the paper, because it was long enough already, but we've certainly thought about this problem when designing the evals and related safety cases. A rough sketch of the argument looks something like this: "We're confident that our anti-refusal training is effective because it uses supervised fine-tuning. To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can't effectively do so after our anti-refusal / capabilities elicitation procedure.".

One problem with this argument is that our sandbagging eval didn't include the exact sabotage tasks that we'd like to make claims about - they couldn't yet because the evals in the paper were developed in parallel. Probably we'll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.

Fixed Point: a love story

David Duvenaud1y31

Bravo. I'm really glad you're writing on this theme. Where I've been stuck lately in thinking about the future is trying to figure out what is the best we can coherently hope for. Which, as your story notes, seems like it involves tradeoffs between remaining human and having meaningful autonomy.

What can we learn from Lex Fridman’s interview with Sam Altman?

David Duvenaud2y94

We’ve got the [molecular?] problem

I think he was saying "Moloch", as in the problem of avoiding a tragedy of the commons.

LESSWRONG
LW

Posts

Wiki Contributions

Comments