schroederdewitt

Replying toSecret Collusion: Will We Know When to Unplug AI?

Secret Collusion: Will We Know When to Unplug AI?

Thanks Nathan these are fascinating questions!

"What if there was an offender who wasn't sure there was anyone 'out there' to collaborate with, but they felt pretty confident they could sneak messages past the monitor. They might just go 'fishing' with a constant stream of hidden messages, most of low importance or nonsense. Then there'd be no pattern of innocent - steganography - innocent -steganography... the steganography would be a constant underlying everything."

This is an interesting thought. Certainly, if steganography could be seen as an innocuous activity by itself, then its detection alone would not be very helpful and, instead, one would actually have to be able to decode the hidden messages to... (read more)

Replying toSecret Collusion: Will We Know When to Unplug AI?

schroederdewitt1y

Secret Collusion: Will We Know When to Unplug AI?

That's a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!

Replying toSecret Collusion: Will We Know When to Unplug AI?

schroederdewitt1y

Secret Collusion: Will We Know When to Unplug AI?

Interesting! Hope our work will catch his attention.

Secret Collusion: Will We Know When to Unplug AI?

schroederdewitt

schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi, Angira Sharma

TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, indicating that safety and security risks from steganographic collusion are increasing. These results highlight increasing challenges for AI governance and policy, suggesting institutions such as the EU AI Office and AI safety bodies in the UK and US should conduct cryptographic and steganographic evaluations of... (read 9227 more words →)

Take SCIFs, it’s dangerous to go alone

latterframe

latterframe, Jeffrey Ladish, schroederdewitt

Coauthored by Dmitrii Volkov $^{1}$ , Christian Schroeder de Witt $^{2}$ , Jeffrey Ladish $^{1}$ ( $^{1}$ Palisade Research, $^{2}$ University of Oxford).

We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the near-term include model weight leaks and backdoor insertion, and loss of control in the longer-term.

We discuss the Mistral and LLaMA model leaks as motivating examples and propose two classic opsec mitigations: performing AI audits in secure reading rooms (SCIFs) and using locked-down computers for frontier AI research.

Mistral model leak

In January 2024, a high-quality 70B LLM leaked from Mistral. Reporting suggests the model leaked through an external evaluation or... (read 710 more words →)

LESSWRONG
LW

LESSWRONG
LW

schroederdewitt

schroederdewitt

Secret Collusion: Will We Know When to Unplug AI?

Take SCIFs, it’s dangerous to go alone

schroederdewitt

schroederdewitt

schroederdewitt

Secret Collusion: Will We Know When to Unplug AI?

Take SCIFs, it’s dangerous to go alone

Mistral model leak