Joseph Bloom

Message

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I'm a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research...

1401

155

102

Joseph Bloom

A Selection of Randomly Selected SAE Features

Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Motivation Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution? While a full answer to this question is important and will take deliberate investigation, we note that researchers who've spent large amounts of time interacting with feature dashboards think it’s more likely that SAE features capture highly non-trivial information about the underlying models. Evidently, SAEs are the one true answer to ontology identification and as evidence of this, we show how initially uninterpretable features are often quite interpretable with further investigation / tweaking of dashboards. In each case, we describe how we make the best possible use of feature dashboards to ensure we aren't fooling ourselves or reading tea-leaves. Note - to better understand these results, we highly recommend readers who are unfamiliar with SAE Feature Dashboards briefly refer to the relevant section of Anthropic's publication (whose dashboard structure we emulate below). TLDR - to understand what concepts are encoded by features, we look for patterns in the text which causes them to activate most strongly. Case Studies in SAE Features Scripture Feature We open with a feature that seems to activate strongly on examples of sacred text, specifically from the works of Christianity. Scripture Feature Even though interpreting SAEs seems bad, and it can really make you mad, seeing features like this reminds us to always look on the bright side of life. Perseverance Feature We regist

109Apr 1, 2024

Auditing Games for Sandbagging [paper]

103Dec 9, 2025

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

103Feb 2, 2024

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

96Mar 25, 2024

Joseph Bloom

Message

1401

155

102

Auditing Games for Sandbagging [paper]

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom UK AI Security Institute, FAR.AI, Anthropic Links: Paper | Code | Models | Transcripts | Interactive Demo Epistemic Status: We're sharing our paper and...

Dec 9, 2025103

Research Areas in Interpretability (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202514

The Alignment Project by UK AISI

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

White Box Control at UK AISI - Update on Sandbagging Investigations

Introduction Joseph Bloom, Alan Cooney This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field. The format of this post was inspired by...

Jul 10, 202580

Eliciting bad contexts

Say an LLM agent behaves innocuously in some context A, but in some sense “knows” that there is some related context B such that it would have behaved maliciously (inserted a backdoor in code, ignored a security bug, lied, etc.). For example, in the recent alignment faking paper Claude Opus...

Jan 24, 202537

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom This work was produced as part of the PIBBSS program summer 2024 cohort. tl;dr * Sparse AutoEncoders (SAEs) are a promising method to extract monosemantic, interpretable features from large language models (LM) * SAE latents have recently been shown to be non-linear...

Dec 20, 202435

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution TL;DR * We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and...

Dec 11, 202482

Load More (7/22)

LESSWRONG
LW

LESSWRONG
LW

Joseph Bloom

Joseph Bloom

Joseph Bloom

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Joseph Bloom

Auditing Games for Sandbagging [paper]

Research Areas in Interpretability (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

White Box Control at UK AISI - Update on Sandbagging Investigations

Eliciting bad contexts

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

A Selection of Randomly Selected SAE Features

Auditing Games for Sandbagging [paper]

Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small

Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders

Auditing Games for Sandbagging [paper]

Research Areas in Interpretability (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

White Box Control at UK AISI - Update on Sandbagging Investigations

Eliciting bad contexts

Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders