Coauthored by Fedor Ryzhenkov and Dmitrii Volkov (Palisade Research)

At Palisade, we often discuss latest safety results with policymakers and think tanks who seek to understand the state of current technology. This document condenses and streamlines the various internal notes we wrote when discussing Anthropic's "Scaling Monosemanticity".

Executive Summary

Research on AI interpretability aims to unveil the inner workings of AI models, traditionally seen as “black boxes.” This enhances our understanding, enabling us to make AI safer, more predictable, and more efficient. Anthropic’s Transformer Circuits Thread focuses on mechanistic (bottom-up) interpretability of AI models.

Their latest result, Scaling Monosemanticity, demonstrates how interpretability techniques that worked for small, shallow models can scale to practical 7B (GPT-3.5-class) models. This paper... (read 664 more words →)

Take SCIFs, it’s dangerous to go alone

latterframe

latterframe, Jeffrey Ladish, schroederdewitt

Coauthored by Dmitrii Volkov $^{1}$ , Christian Schroeder de Witt $^{2}$ , Jeffrey Ladish $^{1}$ ( $^{1}$ Palisade Research, $^{2}$ University of Oxford).

We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the near-term include model weight leaks and backdoor insertion, and loss of control in the longer-term.

We discuss the Mistral and LLaMA model leaks as motivating examples and propose two classic opsec mitigations: performing AI audits in secure reading rooms (SCIFs) and using locked-down computers for frontier AI research.

Mistral model leak

In January 2024, a high-quality 70B LLM leaked from Mistral. Reporting suggests the model leaked through an external evaluation or... (read 710 more words →)

Replying toOpen Thread Spring 2024

latterframe2y

Open Thread Spring 2024

Hey everyone! I work on quantifying and demonstrating AI cybersecurity impacts at Palisade Research with @Jeffrey Ladish.

We have a bunch of exciting work in the pipeline, including:

demos of well-known safety issues like agent jailbreaks or voice cloning
replications of prior work on self-replication and hacking capabilities
modelling of above capabilities' economic impact
novel evaluations and tools

Most of my posts here will probably detail technical research or announce new evaluation benchmarks and tools. I also think a lot about responsible release, offence/defence balance, and general governance to flesh out my work's theory of change; some of that might also slip in.

See you around 🙃

LESSWRONG
LW

LESSWRONG
LW

latterframe

latterframe

latterframe

A “Scaling Monosemanticity” Explainer

Take SCIFs, it’s dangerous to go alone

latterframe

latterframe

latterframe

A “Scaling Monosemanticity” Explainer

Take SCIFs, it’s dangerous to go alone

Executive Summary

Mistral model leak