Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.
Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.
I alluded to this in the paper linkpost, but soft prompts are a very simple and very strong option for this. There remains a difficulty in figuring out what unwanted behavior to adversarially elicit, but this is an area that has a lot of low hanging fruit.
I'd also interested in whether how more brute force interventions, like autoregressively detuning a backdoored model with a large soft prompt for a very large dataset (or an adversarially chosen anti-backdoor dataset) compares to the other SFT/RL interventions. Activation steering, too; I'm currently guessing activation-based interventions are the cheapest for this sort of thing.
A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.
On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.
Another experiment:
Extensions:
Some experimental directions I recently wrote up; might as well be public:
Less concretely, I'd also like to:
3 more posts I feel like I need to write at some point:
Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.
Example of 'good enough to not x/s-risk' dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.
Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.
The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.
Generalizing the principle from policy regularization.
Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?
A simple example:
Repeat across scale and regularization strategies.
Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).
I've previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.
The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that's intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?
I'd also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.
Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.
Yeah, I've seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.... 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.
For what it's worth (perhaps nothing) in private experiments I've seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way.
I haven't looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.
Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.
Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?
You could test for that explicitly:
Notes and extensions:
Another potentially useful metric in the space of "fragility," expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.
Expanding on #6 from above more explicit, since it seems potentially valuable:
From the goal agnosticism FAQ:
The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.
From earlier experimentpost:
Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
This can be phrased as "what's the amount of information required to push a model into behavior X?"
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
"What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?"
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.
I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.
Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.
What does a prompt retrodictor look like?
Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.
Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.
Two obvious approaches:
[Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
[Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token]
Nothing stopping the prefix sequence from having zero length.[Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]
But we don't just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix's activations.
This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt))
, where (activations | sourcePrompt)
are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there's no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.
Note that retrodicting with an activation objective has some downsides:
This class of experiment is expensive for natural language models. I'm not sure how interesting it is at scales realistically trainable on a couple of 4090s.
Quarter-baked experiment:
Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:
Another item for the todo list:
Autoregressive transformer gradient flow shapes earlier token computation to serve future predictions, but that early computation cannot condition on future tokens. This should serve as a regularizing influence on the internal structure of token predictions: in order to be useful to the largest possible set of future predictions, the local computation would need to factor itself into maximally reusable modules.
The greater the local uncertainty about the future, the less the local computation can be specialized to serve future tokens. Could consider it something like: the internal representation is a probability-weighted blend of representations useful to possible futures. If the local computation is highly confident in a narrow space, it can specialize more.
Simplicity biases would incentivize sharing modules more strongly. Even if the local computation suspects a narrower future distribution, it would be penalized for implementing specialized machinery that is too rarely useful.
One implication: many forms of token-parallelized search get blocked, because they require too much foresight-driven specialization.
Quarter-baked ideas for potential future baking:
I'm using the word "shard" here to just mean "a blob of conditionally activated preferences." It's probably importing some other nuances that might be confusing because I haven't read enough of shard theory things to catch where it doesn't work.
This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)
Another item for the todo list:
Naive implementation likely requires a fairly big CodeLlama-34b-Instruct-tier interpreter and can only operate on pretty limited programs, but it may produce something interesting. Trying to apply the resulting interpreter on circuits embedded in larger networks probably won't work, but... worth trying just to see what it does?
Might also be something interesting to be learned in spanning the gap between 'compiled' networks and trained networks. How close do they come to being affine equivalents? If not linear, what kind of transform is required (and how complicated is it)?