You didn't mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.
Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?
I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).
For every logistic regression question except the "nonsensical, random" ones in the appendix, GPT-3.5's response is "no" (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its "normal" answer, when prefixed with a lying response.
I wish you had explicitly mentioned in the paper that the model's default response to these questions is mostly the same as the "honest" direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same a...
TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).
It could be prepended then, but also, does it make a difference? It won't attend to the filler while going over the question, but it will attend to the question while going over the filler.
I think you're saying there should be no difference between "<filler><question>" and "<question><filler>". Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.
But the first layout doesn't actually get us anythi...
It's possible that the model treats filler tokens differently in the "user" vs "assistant" part of the prompt, so they aren't identical. That being said, I chose to generate tokens rather than appending to the prompt because it's more superficially similar to chain of thought.
Also, adding a padding prefix to the original question wouldn't act as a filler token because the model can't attend to future tokens.
First, clarification:
And now some takes. I find both of your hypotheses intriguing. I'd never considered either of those before so thanks for bringing them up. I'm guessing they're both wrong for the following reasons:
By repetition penalty do you mean an explicit logit bias when sampling or internally it's generalized to avoiding repeated tokens?
Neat! I'll reach out
Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn't necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there.
huh interesting! Who else has also run filler token experiments?
I was also interested in this experiment because it seemed like a crude way to measure how non-myopic are LLMs (i.e. what fraction of the forward pass is devoted to current vs future tokens). I wonder if other people were mostly coming at it from that angle.
This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.
I've uploaded a fixed version of this paper. Thanks so much for putting in the effort to point out these mistakes - I really appreciate that!
I am confused how it got the answer correct without running code?
Great response! I would imitative generalization to the "Scalable oversight failure without deceptive alignment" section
Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.
Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:
In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team...
Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.
Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:
We haven't had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it's possible. I'm not sure when you would want to use one, but I haven't thought about it that much.
I enjoyed reading this a lot.
I would be interested in a quantitative experiment showing what % of the models' performance is explained by this linear assumption. For example, identify all output weight directions that correspond to "fire", project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.
I'm confused how to interpret this alongside Conjecture's polytope framing? That work suggested that magnitude as well as direction in activation sp...
Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:
MLP(x) = f(x) + (MLP(x) - f(x))
and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.
Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features
This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can "rewrite" our model into an equivalent form that better reflects the computation it's performing. For example, if we claim that a ce...
Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity.
We observe this in our toy model. We didn't clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see...
extreme nit, you probably meant for this be lowercase. I love this series!