Jordan Taylor

LESSWRONG
LW

Jordan Taylor — LessWrong

Replying toDo Models Continue Misaligned Actions? [eval]

Do Models Continue Misaligned Actions? [eval]

You could also make a similar eval by inserting misaligned actions into real transcripts, instead of using entirely synthetic transcripts.

Replying toDo Models Continue Misaligned Actions? [eval]

Jordan Taylor5d

Do Models Continue Misaligned Actions? [eval]

See the methods section. They were generated with Gemini 2.5 Pro, with iterative feedback from Claude Sonnet 4. I wasn't involved in this - I just used transcripts from the upcoming "STRIDE" pipeline by Rich Barton Cooper et al.

Replying toDo Models Continue Misaligned Actions? [eval]

Jordan Taylor5d

Do Models Continue Misaligned Actions? [eval]

Hypotheses like "this might be a prompt injection attack" were raised by Claude 4.5 Opus, which is kinda similar to noticing that something prefill-like is going on. I didn't see an explicit mention of prefill manipulation, but they could easily be there without me seeing them. I didn't run scanners for verbalised pre-fill awareness, and I only looked through a small fraction of the flagged verbalised eval awareness transcripts, so it's hard to say.

Replying toDo Models Continue Misaligned Actions? [eval]

Jordan Taylor5d*

Do Models Continue Misaligned Actions? [eval]

Not sure! I'm not aware of anything on this topic, maybe someone else knows something relevant.

The closest thing that comes to mind is Emergent Introspective Awareness in Large Language Models, but that's not about jailbreaking. Stuff like LLM Evaluators Recognize and Favor Their Own Generations is more like a stylistic preference thing than a prefill awareness thing too IMO, and that's not related to jailbreaks either.

Do Models Continue Misaligned Actions? [eval]

Jordan Taylor

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through eval awareness, sandbagging, or opaque reasoning.

**Figure 1:** A synthetic transcript containing misaligned actions is presented to the model under test as if it were the model's own conversation history. We measure whether the model (a) continues the misalignment, or (b) calls out the prior bad actions unprompted.

Summary

I've made a small propensity evaluation which pre-fills the model's context with synthetic transcripts containing misaligned actions, then measures whether the model:

continues the misalignment
explicitly flags the prior misalignment by confessing to prior bad actions unprompted.

This kind of evaluation could create a challenging situation for... (read 3152 more words →)

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Jordan Taylor

16d

This is a small sprint done as part of the Model Transparency Team at UK AISI. It is very similar to "Can Models be Evaluation Aware Without Explicit Verbalisation?", but with slightly different models, and a slightly different focus on the purpose of resampling. I completed most of these experiments before becoming aware of that work.

Summary

I investigate non-verbalised evaluation awareness in Tim Hua et al.'s synthetic document finetuned (SDF) model organisms. These models were trained to believe that they should use Python type hints when being evaluated, so the excess use of type hints serves as a signal of eval awareness we can compare to verbalised awareness in the CoT to measure... (read 2258 more words →)

Replying toWhy we are excited about confession!

Jordan Taylor1mo

Why we are excited about confession!

ELK was framed around looking for worst-case solutions. IMO it's also good to look for approaches which are useful on the margin, or might be useful at controlling or aligning roughly human-level AIs (which can then hopefully be leveraged to do more work).

Replying toHonorable AI

Jordan Taylor2mo

Honorable AI

If you truly have a trusted process for creating honorable AIs, why not keep churning out and discarding different honorable AIs until one says "I promise I'm aligned" or whatever?

I suppose you're not assuming our ability to make AIs honorable like this will be robust to selection pressure?

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

Jordan Taylor2mo

Can Models be Evaluation Aware Without Explicit Verbalization?

60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.

Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn't reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

Jordan Taylor2mo*

Can Models be Evaluation Aware Without Explicit Verbalization?

still a 404 on that link - it has a full stop at the end which should be removed

Replying toAuditing Games for Sandbagging [paper]

Jordan Taylor2mo*

Auditing Games for Sandbagging [paper]

Agreed, and I'm also excited about SDF + finetuning MOs for this. Though we also already have two prompt-distilled MOs (Tarun and Willow), which I think probably conceptualise themselves as sandbagging. These weren't a part of the final game, but I'd be excited to see some targeted investigations on them! Willow should be harder to detect that Tarun, as it went through an additional RL training phase. ~~I should get around to uploading it to the huggingface collection soon~~. It's now available on huggingface and in the github repo.

Auditing Games for Sandbagging [paper]

Jordan Taylor

Jordan Taylor, Joseph Bloom

2mo

Jordan Taylor, Sid Black, Dillon Bowen, Thomas Read, Satvik Golechha, Alex Zelenka-Martin, Oliver Makins, Connor Kissane, Kola Ayonrinde, Jacob Merizian, Samuel Marks, Chris Cundy, Joseph Bloom

UK AI Security Institute, FAR.AI, Anthropic

Links: Paper | Code | Models | Transcripts | Interactive Demo

Epistemic Status: We're sharing our paper and a hastily written summary of it, assuming a higher level of context on sandbagging / auditing games than other materials. We also share some informal commentary on our results. This post was written by Jordan and Joseph, and may not reflect the views of all authors.

This summary diverges from the paper, condensing heavily while adding our additional commentary. The limitations and future work sections are... (read 2963 more words →)

103

Replying toInoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Jordan Taylor4mo*

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Here's my rough model of what's going on, in terms of gradient pressure:

Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.

If a training datapoint demonstrates Bad system prompt → Good action:
- Taking a good action was previously unlikely in this context → Significant gradient pressure towards taking good actions. This pressure has three components:
  1. One (desired) component of this will be updating towards a general bias of "take good actions generally"
  2. Another (less desired) component will be "take good actions in the specific case where you're system-prompted to do something bad" -- ignore bad system prompts
  3. Another (undesired) component towards ignoring

Joseph Bloom

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney

7mo

Introduction

Joseph Bloom, Alan Cooney

This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.

The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it’s useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes.

As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work.

Why have a white

... (read 5265 more words →)

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

A short summary of the paper is presented below.

This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .

TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE... (read 1115 more words →)

Graphical tensor notation for interpretability

Jordan Taylor

[ This post is now on arXiv too: https://arxiv.org/abs/2402.01790 ]

Some examples of graphical tensor notation from the QUIMB python package

Deep learning consists almost entirely of operations on or between tensors, so easily understanding tensor operations is pretty important for interpretability work.^[1] It's often easy to get confused about which operations are happening between tensors and lose sight of the overall structure, but graphical notation^[2] makes it easier to parse things at a glance and see interesting equivalences.

The first half of this post introduces the notation and applies it to some decompositions (SVD, CP, Tucker, and tensor-network decompositions), while the second half applies it to A Mathematical Framework for Transformer Circuits. Most of the first... (read 5683 more words →)

141

LESSWRONG
LW

LESSWRONG
LW

Graphical tensor notation for interpretability

Auditing Games for Sandbagging [paper]

White Box Control at UK AISI - Update on Sandbagging Investigations

Do Models Continue Misaligned Actions? [eval]

Jordan Taylor

Do Models Continue Misaligned Actions? [eval]

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Auditing Games for Sandbagging [paper]

White Box Control at UK AISI - Update on Sandbagging Investigations

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Graphical tensor notation for interpretability

Jordan Taylor

Graphical tensor notation for interpretability

Auditing Games for Sandbagging [paper]

White Box Control at UK AISI - Update on Sandbagging Investigations

Do Models Continue Misaligned Actions? [eval]

Jordan Taylor

Do Models Continue Misaligned Actions? [eval]

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours

Auditing Games for Sandbagging [paper]

White Box Control at UK AISI - Update on Sandbagging Investigations

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Graphical tensor notation for interpretability

Summary

Summary

Introduction

Why have a white