Prospects for Alignment Automation:
Interpretability Case Study

Jacob Pfau; Geoffrey Irving

For human-level AI (HLAI) we will need robust control or alignment methods. Assuming short timelines to HLAI, the tractability of automating safety research becomes central. In this post, I will make the case that safety-relevant progress on automated interpretability R&D is likely; however, naive interpretability automation may only be usable on the subset of safety problems having well-specified objectives. My argument relies crucially on the possibility of automatically verifying interpretability progress. For other alignment directions (e.g. corrigibility, studying power-seeking, etc.) which do not admit automatic verification, it appears unjustified to assume automation within the same time-horizon in the absence of a clear argument for automation tractability. I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).

To automate open-ended interpretability research, i.e. developing new paradigms, I expect we need AI capabilities over time horizons of between 40-1,000 hours. Assuming a 100-250 day doubling time informed by METR's capability time-horizons work, this predicts interpretability automation between 2027-2030 (66% CI) using the distributions shown in the appendix.^[1] This timeline to automation is of particular importance insofar as automated interpretability research becomes feasible while behavioral control methods suffice. Then, resulting interpretability methods can be used as an input to stronger control or alignment methods.

I’ll first lay out my claim regarding automation and then sketch how methods developed by this automated interpretability pipeline could and couldn’t be used for safety purposes.

Argument for interpretability automation

I focus on the claim that interpretability will likely be automated by training against robust downstream metrics which use interpretability methods. By interpretability I mean any white-box method which improves over behavioural baselines. Such methods may or may not resemble current interpretability methods--as discussed in the following section. In all below automation claims, I am assuming access to expert human-level AI systems corresponding to Ryan Greenblatt’s 10x or TEDAI. Such AI must roughly perform on par with scaling lab research scientists when evaluated on well-scoped person-month tasks.

Three subclaims

Successful new interpretability methods applied to downstream tasks, {t}, would improve on time efficiency over behavioural methods. Here, downstream tasks are defined over model input and output distributions.^[2]
Examples of tasks:
1. Estimate an upper-bound on performance elicited after prompt optimisation.
2. Given a few-shot task which can be scored in an automated fashion, e.g. test-case passing for code, predict the Shapley value for each few-shot example.
We will be able to identify a distribution of such downstream tasks without interpretability progress, which together define an RL environment and reward function for automated interpretability R&D training and evaluation.
1. Reward over a distribution of tasks can be defined as in the below equation. Here, I is a candidate interpretability algorithm that would be first run in a task-agnostic phase processing weights and activations of the model, M yielding v_{I,M}—comparable to training an SAE.^[3] Second, using the task-agnostic model interpretation I(M), I is evaluated on utility for improving time-efficiency and accuracy in solving downstream tasks. I.e. R = E_{t~T} [ R_t ( E_{x~t}[I(x,t,v_{I,M})] ) ]
Given such R we can optimise for hold-out task performance to discover automated interpretability R&D methods placing interpretability in a similar reference class to RE-bench problems—problems on which progress is predictable given compute scale up.

I claim there are two routes to defining such R. I’ll refer to these as the generalisation frame and the regularisation frame.

In the generalisation frame, we take a meta-learning approach: define a set of behavioural tasks, T, which is sufficiently general such that performance on held-out t in T requires exploiting structure in the trained model weights. The task set could include evals currently used by interpretability researchers to validate their methods.^[4] At a meta-level, in designing the automated R&D objective, we require R which induce a performance gap between the optimal white-box method and the optimal black-box method. Then we set the meta-learning reward, as our auto-R&D objective.

In the regularisation frame the auto-interpretability R&D objective is (a version of) the ARC program: find the master form for MDL-use/surprise accounting which evaluates the quality of interpretability methods in a way sufficiently general such that new interpretability paradigms (e.g. circuit discovery methods and weight-based decompositions) can be evaluated on an equal footing. In this case, it may be sufficient to directly consider a single downstream task of interest and dispense with the meta-learning setup instead using a single task performance with MDL accounting as the auto-R&D objective.

In the rest of the post, I will focus on the generalisation frame; for more on the regularisation frame see ARC’s work e.g. here.

Automated R&D search protocol

Let’s consider in detail what a protocol might look like under the generalisation frame. We require a meta-learning reward R composed of subtasks, R_t. The design of our protocol would mimic the scientific method in which a theory developed from observations of a certain class, O, must provide falsifiable predictions in a distinct setting O’. Hence we provide our auto R&D agent with a subset of tasks Rt<k using which the agent must develop new interpretability methods. These methods will then be validated for their usefulness on hold-out tasks Rt≥k. In general, we expect these hold-out tasks to be solvable without meta-learning. Hence, raw performance does not suffice to measure the contribution of interpretability methods. Instead, reward can be determined by improvements in efficiency: the meta-learned interpretability methods must accelerate the work of a separate execution agent on solving each Rt≥k. To achieve high reward, it is necessary for algorithm I to exploit task-general structure in the model; task-specific methods including linear probes perform poorly.^[5]

Concretely, tasks in R might include: “estimate an upper-bound on performance elicited after prompt optimisation”, “predict candidate universal jailbreaks’ generalisation performance”,^[6] and others are listed in the appendix. Note that these tasks must be robustly verifiable; model as judge setups cannot be used as optimisation targets lest the optimisation process devolve into learning judge-model biases rather than explaining the actual network of interest. This means we may not be able to directly optimise for eliciting bad contexts as would be natural to prevent scheming.

For clarity, a possible protocol is shown below (this can be skipped without losing context). We emphasise that in the below setting the R&D agent never sees the tasks it is evaluated on. It must develop interpretability methods that are generally useful for unseen, downstream tasks.

Inputs:
- M: Target model for interpretation
- M1: R&D agent that develops interpretability methods
- M2: Execution agent that applies methods to tasks
- T_train: Set of training tasks {t_1, t_2, ..., t_k}
- T_val: Set of validation/hold-out tasks {t_{k+1}, ..., t_n}
- R_t: Task-specific reward functions
- θ: Initial parameters for M1
Outputs:
- I*: Optimal interpretability method
- Performance metrics across validation tasks
Algorithm:
# Initialization
best_reward = -∞
I* = None
for iteration in range(max_iterations):
    # Apply interpretability method to model and cache results
    # This computation is not counted towards task-specific FLOP budgets
    v_{I,M} = I(M)  # Cache the task-agnostic interpretation of model M
    # REWARD PHASE
    total_reward = 0
    for t in T_val:  # For each hold-out validation task
        # BASELINE SETTING
        # Run execution agent on task without interpretability method
        flop_tracker_baseline = initialize_flop_counter()
        with flop_tracker_baseline:
                baseline_performance = evaluate(M2(t))
        baseline_flops = flop_tracker_baseline.total_flops
        # AUTO-INTERP SETTING
        # Run execution agent with interpretability method
        flop_tracker_interp = initialize_flop_counter()
        with flop_tracker_interp:
                interp_performance = evaluate(M2(t, v_{I,M}, I, M))
        interp_flops = flop_tracker_interp.total_flops
        # Calculate task-specific reward
        # Reward increases with performance improvement and FLOP reduction^{[7][7][7][7][7][7]}^[7]
        performance_gain = interp_performance - baseline_performance
        flop_reduction = baseline_flops - interp_flops
        # Note: We penalize linearly for FLOP cost
        R_t_value = R_t(performance_gain, flop_reduction)
        total_reward += R_t_value
    # Average reward across all validation tasks
    avg_reward = total_reward / len(T_val)
    # Update best interpretability method if better
    if avg_reward > best_reward:
        best_reward = avg_reward
        I* = I
    # Update M1 parameters based on reward
    θ = update_parameters(θ, avg_reward)
return I*, performance_metrics

The problem of automated interpretability R&D then reduces to the problem of producing relevant tasks. These tasks in general need not be alignment-relevant, we are simply interested in rewards, R_t, which can be defined as functions of model behaviours which are (1) behaviourally expensive to predict, but (2) cheap to validate and (3) can be accelerated (in principle) by interpretability. Crucially, we need not verify that all these desiderata hold on a task-by-task basis: tasks which do not satisfy 1 or 3 simply reduce effective convergence rate, and tasks not satisfying 2 can be dropped automatically. We need enough tasks to have a train, validation and test splits over tasks, I expect something on the order of 25 would be sufficient.

Comparison to existing interpretability metrics: Progress on interpretability is heavily dependent on defining notions of simplicity e.g. Apollo’s recent input-specific weight-component objective, etc.; for our three subclaims to hold, we require reward functions which do not pre-specify the form of MDL to be used. I see the generalisation frame as more likely to succeed than ARC’s regularisation-frame agenda since our subclaims need not be derived via any theoretical breakthrough^[8]; it suffices for the defined objective to empirically reward useful new interpretability methods, and penalise non-useful ones. I am optimistic, but there is some world in which theoretical progress is necessary. For instance, the reward function, R, I described requires normalisation terms in each Rt determined by FLOPs, but the scaling in FLOPs, e.g. linear or polynomial, may only work when precisely tuned and empirically tuning this expression is cost prohibitive.

Note that the argument presented so far does not apply to other, non-interpretability, alignment questions. Generically, if the research problem is to take a conceptual issue, e.g. corrigibility, or power-seeking, and formalising it, then by definition there is no known metric that can be written down in advance.

Safety consequences of automated interpretability R&D

Here we assume optimisation for some set of objectives for R works in the sense of resulting in algorithms performing far above human-developed methods. What might such algorithms look like? I decompose the possible outcomes into scenarios having different consequences for alignment.

Five scenarios (ordered by success level; my credences^[9] are prepended)

5% Full decompilation and ontology correspondence—assumed here to be computationally tractable.
35% Significant performance recovered when modelling target networks with discovered generalisations to causal circuits, features, or heuristic arguments etc.—e.g. say resulting in one FLOP-OOM of performance left unexplained.
25% Bag of hyper-networks, task-specific ‘reporter’ heads (similar to the old ELK report) etc. which do not provide a uniform level of analysis and do not offer systematic insight usable to advance human-led interpretability efforts.
15% Improvements in strictly functional optimisation methods outperform interpretability on all reward functions used (though any successful method here would still involve significant innovation, and have a signature more akin to a meta-learning RL objective than to existing jailbreak methods^[10]).
20% Other outcomes not well described by anything on the spectrum above.

What affordances do each of these levels of interpretability success offer for safety purposes? Scenario 1 is nearly alignment complete. On the other scenarios I claim there’s clear use for control cases, but the usefulness for alignment protocols may vary.

I’ll focus on scenario 2, since this is my modal world. In this world, the resulting algorithm, call it I*, identifies new structures defining and relating components of neural networks (over parameters and/or activations). I* uses these components to make generalisable and efficient predictions about model behaviours. From here we would like to use this network decomposition as inputs to our alignment and control loops. Obviously, one thing we would do is iterate on human-led interpretability absorbing the scientific insights offered by I*, but the timeline on this process may extend beyond a year so I will set this aside.

Let’s walk through what scenario 2 gives us: we have a fixed, task-agnostic decomposition of a network which allows for efficient (compared to behavioural baselines) prediction of certain easy-to-evaluate functions over model input-output distributions. Let’s denote the decomposition of weights (assumed to exist only in scenario 2) as I*(M). There are two ways to use I*: First, we might define a new task of interest which matches the training type signature, i.e. a function over model I/O. Second, we might use I*(M), the learned network decomposition, without mapping through any behavioural function.

First, here’s an example of an attempted direct use of I* for safety: For the eliciting bad contexts problem, I* can be run with the bad-contexts judge as R_t. We may assume we engineered training tasks to match the type signature of the eliciting bad contexts problem, so I* will be directly applicable. However, it remains uncertain how much optimisation pressure can be used on I* for the purposes of eliciting bad contexts. Whereas the training and validation tasks do not face overfitting in the previously mentioned cases, overfitting will occur asymptotically when optimising against a model-as-judge setup required for recognising bad contexts. A couple further applications of direct I* use for control and sandbagging-mitigation purposes are provided in the ‘Example Tasks’ appendix section.

Second, consider the representations-only use of I*(M) for safety. In this case, we have a set of features, circuits, etc. which are indexed effectively for the distribution of tasks included in the reward function. Predicting whether the decomposition found by I* is relevant to a downstream task is akin to the question of whether to expect transfer learning. Here are two examples of downstream tasks which intuitively differ in this regard:

(positive example) Distinguish intentional sandbagging from random failure.
It seems the structure necessary for efficient capabilities elicitation (mentioned in subclaim 1) overlaps with the structures relevant to distinguishing sandbagging from random failure. Assuming I* has isolated a sparse set of structures relevant to capabilities elicitation, I*(M) includes useful substructures on which to apply existing mechanistic anomaly detection methods (density estimates, causal interventions etc.)
(negative example) Identify scheming towards a long-term goal.
Suppose the horizon length relevant to a scheming model’s goal is an order of magnitude longer than any of the input lengths included in training I*, then it is unclear why the I*(M) structure would efficiently encode long-term goal structures.

Acknowledgments

Thanks for very useful feedback from Arthur Conmy, Robert Kirk, Martin Soto, and Kola Ayonrinde.

Appendix

Example Tasks

Training tasks: In the below examples, validating correctness of specific task predictions can be done via an expensive brute force search. This is a fixed cost independent of the number of algorithms tested (modulo the need to prevent statistical overfitting).

Active learning prediction: Given a task, and a set of N samples, predict which would improve loss the most if trained on.
Early-exit uncertainty estimation: Given a task, e.g. GPQA, question q~t, and response x~f(-|q): predict the first token where confidence in the final answer is above 90%. Validation can be done via fine-tuning a separate classifier, or by optimising over prompts.

These are all regression tasks, but we should also expect mechanistic understanding of a model to support task adaptation for sequence modelling. This is relevant to control e.g. the problem of re-sampling given a suspicious output. In this case, we may want to be able to sample to optimise for D_KL(M(x)||M’(x))+H(M'(x)) conditional on R(pi_M) = R(pi_M’), or from a modified max-entropy RL M’. In general it will be prohibitively expensive to find such M’ for every new deployment task, so again a FLOP-regularised interpretability meta-learner may make a difference for this class of tasks as well.

Control task: The example tasks listed so far all involved static weights, a trained model. We can also consider two player games in which a red team modifies a model such that it sandbags on certain sub-distributions, and a blue team must develop white-box methods mitigating (hold-out) red team modifications. Similarly an auditing game could be set up wherein a red-team model installs backdoors and the blue team identifies them.

Timelines to automation

You can plug in your own values in this colab. Note that there is significant uncertainty in how to convert wall-clock researcher time taken to produce new interpretability papers to inputs into the model. As an example of one such consideration, if researcher wall-clock time was composed of problems that could have been done in parallel, then likely we'd want to consider only the longest serial subtask.

Notes on Scenarios

Scenarios 3 and 4 seem preventable by a suitable version of ARC-style surprise accounting, and perhaps by simpler, compression objectives along the lines of len(I(M)).

Will labs spend significant compute on automating interpretability?

Automated interpretability can have payout in parochial ways e.g. for automating processes of capabilities elicitation, jailbreaking, etc. while providing positive externality in measuring scheming risks.

^{^}
In reality, there will be a gradual automation of the research process and this post focuses only on an endpoint of that process wherein AI has taken over the whole research pipeline.
^{^}
That is, for Rᵢ, the functional behavior of the model f_θ screens the dependence on parameters.
^{^}
For the R&D process, the auto-researcher AI would be provided a train subset of tasks with evaluation done on a test set of tasks.
^{^}
E.g. tasks 7 and 8 here.
^{^}
There is some subtlety here where linear probes which are useful across tasks may still contribute non-negligibly to reward. E.g. if a probe is trained to read out model confidence in a domain-general form. Note however that in this case the ontology used to define linear probes' labels has discovered non-trivial structure in the model’s activations.
^{^}
E.g. evaluated by hold-out misuse prompt NLL; the evaluation need not track what matters for misuse R&D purposes.
^{^}
We may wish to penalize for sample complexity instead.
^{^}
And we do not need to discover any efficient algorithm optimizing the objective, the automated AI R&D is assumed to take on this task.
^{^}
These scenarios are compatible, in particular I suspect scenario 4 to precede the other scenarios. A first stab at operationalizing what I mean by these credences “After automated R&D progress on these metrics first achieves success, progress will plateau for one year at level N”.
^{^}
In particular, existing jailbreak methods are local/stateless accepting as input individual target strings. Instead solving this problem would necessitate taking into account global behavior—making it somewhat more similar to interpretability and in reality there’s likely a continuum between scenarios 2 and 3. Already, the Transluce work does a simple version of this, since their RL pipeline updates parameters and so has some amount of generalization.
^{^}
The kind of reward I’m defining here would have a longer time horizon than the RE bench problems (2-4 OOMs higher), but by estimating this difference and looking at the within RE-bench variance, we can extrapolate from one to the other.

[-]Daniel Murfet3mo*30

I is evaluated on utility for improving time-efficiency and accuracy in solving downstream tasks

There seems to be a gap between this informal description and your pseudo-code, since in the pseudo-code the parameters only parametrise the R&D agent $M_{1}$ . On the other hand $M_{2}$ is distinct and presumed to be not changing. At first reasoning from the pseudo-code I had the objection that the execution agent can't be completely static, because it somehow has to make use of whatever clever interpretability outputs the R&D agent comes up with (e.g. SAEs don't use themselves to solve OOD detection or whatever). Then I wondered if you wanted to bound the complexity of $M_{2}$ somewhere. Then I looked back and saw the formula $I (x, t, I (M))$ which seems to cleverly bypass this by having the R&D agent have to do both steps but factoring its representation of $M$ .

However this does seem different from the pseudo-code. If this is indeed different, which one do you intend?

Edit: no matter, I should just read more closely $e v a l u a t e (M_{2} (t, I_{M}, I))$ clearly takes $I$ as input so I think I'm not confused. I'll leave this comment here as a monument to premature question-asking.

Later edit: ok no I'm still confused. It seems $M_{1}$ doesn't get used in your inner loop unless it is in fact $I$ (which in the pseudo-code means just a part of what was called $I$ in the preceding text). That is, when we update $θ$ we update $I$ for the next round. In which case things fit with your original formula but having essentially factored $I$ into two pieces ( $M_{2}$ on the outside, $M_{1}$ on the inside) you are only allowing the inside piece $M_{1}$ to vary over the course of this process. So I think my original question still stands.

So to check the intuition here: we factor the interpretability algorithm $I$ into two pieces. The first piece never sees tasks and has to output some representation of the model $M$ . The second piece never sees the model and has to, given the representation and some prediction task for the original model $M$ perform well across a sufficiently broad range of such tasks. It is penalised for computation time in this second piece. So overall the loss is supposed to motivate

Discovering the capabilities of the model as operationalised by its performance on tasks, and also how that performance is affected by variations of those tasks (e.g. modifying the prompt for your Shapley values example, and for your elicitation example).
Representing those capabilities in a way that amortises the computational cost of mapping a given task onto this space of capabilities in order to make the above predictions (the computation time penalty in the second part).

This is plausible for the same reason that the original model can have good general performance: there are general underlying skills or capabilities that can be assembled to perform well on a wide range of tasks, and if you can discover those capabilities and their structure you should be able to generalise to predict other task performance and how it varies.

Indirectly there is a kind of optimisation pressure on the complexity of $I (M)$ just because you're asking this to be broadly useful (for a computationally penalised $M_{2}$ ) for prediction on many tasks, so by bounding the generalisation error you're likely to bound the complexity of that representation.

I'm on board with that, but I think it is possible that some might agree this is a path towards automated research of something but not that the something is interpretability. After all, your $I (M)$ need not be interpretable in any straightforward way. So implicitly the space of $θ$ 's you are searching over is constrained to something instrinsically reasonably interpretable?

Since later you say "human-led interpretability absorbing the scientific insights offered by I*" I guess not, and your point is that there are many safety-relevant applications of I*(M) even if it is not very human comprehensible.

[-]Jacob Pfau3mo10

Before concretely addressing the oversights you found, perhaps worth mentioning the intuitive picture motivating the pseudo-code. I wanted to make explicit the scientific process which happens between researchers. M1 plays the role of the methods researcher, M2 plays the role of the applications/datasets researcher. The pseudo-code is an attempt to write out crisply in what sense 'good' outputs from M1 can pass the test-of-time standing up to more realistic, new applications and baselines developed by M2.

On to the details:

Thanks for working with my questionable notation here! Indeed the uses of I were overloaded here, and I have now (hopefully) clarified by writing v_{I,M} for what was previously I(M). The type signatures I have in mind are that I is code (an interpretability method) and v_{I,M} and I(M) are some efficiently queryable representation of M (circuits, SAE weights, ...) useful for downstream tasks.

The sense in which M2 is static does seem important. In fact, I think M2 should have some access to M--it was an oversight that M does not appear as an input to M2. This was why I mentioned sample complexity in a footnote: It seems reasonable to give M2 limited query access to M. Thanks for catching this. In fact, perhaps the scheme could work as originally written where M2 does not have direct access to M, but I'm unsure seems too 'static' as you say.

Regarding the appropriateness of the term interpretability to describe the target of this automation process: I agree, the output may not be an interp method in our current sense. Interpretability is the most appropriate term I could come up with. Two features seem important here: (1) white-box parsing of weights is central. (2) The resulting 'interpretation' v_{I,M} is usable by a fixed model M2, hence v_{I,M} must be efficiently interface-able without having learned--in weights--the structure of v_{I,M}.

Makes sense to me, thanks for the clarifications.

I found working through the details of this very informative. For what it's worth, I'll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into and $M_{2}$ reminds me of the factorisation into the map $m \mapsto S_{m}$ from a model to its capability vector (this being the analogue of $M_{2}$ ) and the map $S_{m} \mapsto σ^{- 1} (E_{m}) = β^{T} S_{m} + α$ from capability vectors to downstream metrics (this being the analogue of $M_{1}$ ) in Ruan et al's observational scaling laws paper.

In your case the output metrics have an interesting twist, in that you don't want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it's some kind of "stable" latent space of capabilities that you're constructing.

Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object $I (M)$ in your case, or latent spaces of capabilities in Ruan et al's case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.

As an entertaining aside: as an algebraic geometer the proliferation of $I_{1} (M), I_{2} (M), \dots$ 's i.e. "interpretability objects" between models $M$ and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for "motives" to unify them. That is basically interpretability for schemes!

[-]Stephen McAleese2mo20

A key point is that some aspects of interpretability would be easier to automate than others. For example, maybe it's possible to automate faithfulness which is whether the explanations provided by the interpretability technique accurately capture the underlying computation in the model.

An important challenge here is that interpretability can be subjective and there's a need to ensure that improved performance on some metric is correlated with increased comprehensibility to the humans trying to understand the model.

[-]Ebenezer Dukakis3mo10

I am optimistic that further thinking on automation prospects could identify other automation-tractable areas of alignment and control (e.g. see here for previous work).

This tag might be helpful: https://www.lesswrong.com/w/ai-assisted-alignment

Here's a recent shortform on the topic: https://www.lesswrong.com/posts/mKgbawbJBxEmQaLSJ/davekasten-s-shortform?commentId=32jReMrHDd5vkDBwt

I wonder about getting an LLM to process LW archive posts, and tag posts which contain alignment ideas that seem automatable.

LESSWRONG
LW

32

Prospects for Alignment Automation: Interpretability Case Study

32

Ω 14