Adam Karvonen — LessWrong

Building Better Activation Oracles

by ceselder, Jan Bauer, Niclas Luick, Adam Karvonen, and Neel Nanda

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen Huggingface, Github TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a...

Jun 462

Realistic Evaluations Will Not Prevent Evaluation Awareness

One Minute Summary I think there's a fundamental limit to behavioral alignment evaluations that gets worse as models improve: Humans control all inputs to a model and can snapshot, replay, or fabricate any scenario at will. An intelligent model could realize this and rationally treat every interaction as a potential...

Feb 2438

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

by Sam Marks, Adam Karvonen, James Chua, Subhash Kantamneni, Euan Ong, Julian Minder, Clément Dumas, and Owain_Evans

TL;DR: We train LLMs to accept LLM neural activations as inputs and answer arbitrary questions about them in natural language. These Activation Oracles generalize far beyond their training distribution, for example uncovering misalignment or secret knowledge introduced via fine-tuning. Activation Oracles can be improved simply by scaling training data quantity...

Dec 18, 2025154

Defending Against Model Weight Exfiltration Through Inference Verification

by Roy Rinberg, Adam Karvonen, dreuter, and Keri Warr

Authors: Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr Arxiv paper link One Minute Summary Anthropic has adopted upload limits to prevent model weight exfiltration. The idea is simple: model weights are very large, text outputs are small, so if we cap the output bandwidth, we can make...

Dec 15, 2025120

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, and Neel Nanda

Summary * We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data. * We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned. * It can also reduce sensitivity to spurious cues, even when they are present...

Jul 23, 202579

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find...

Jul 2, 2025191

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI...

Apr 14, 2025165