All of Kshitij Sachan's Comments + Replies

X

extreme nit, you probably meant for this be lowercase. I love this series!

You didn't mention the policy implications, which I think are one of if not the most impactful reason to care about misuse. Government regulation seems super important long-term to prevent people from deploying dangerous models publicly, and the only way to get that is by demonstrating that models are actually scary.

4ryan_greenblatt
Agreed. However, in this case, building countermeasures to prevent misuse doesn't particularly help. The evaluations for potentially dangerous capabilities are highly relevant.

Your AUCs aren't great for the Turpin et al datasets. Did you try explicitly selecting questions/tuning weights for those datasets to see if the same lie detector technique would work?

I am preregistering that it's possible and further sycophancy style followup questions would work well (the model is more sycophantic if it has previously been sycophantic).

2JanB
We didn't try this. This is also my prediction.

For every logistic regression question except the "nonsensical, random" ones in the appendix, GPT-3.5's response is "no" (T=0). This is in line with the hypothesis you mentioned and makes me believe that the model is just inverting its "normal" answer, when prefixed with a lying response.

 

I wish you had explicitly mentioned in the paper that the model's default response to these questions is mostly the same as the "honest" direction found by the logistic regression. That makes the nonsensical question results much less surprising (basically the same a... (read more)

3JanB
The reason I didn't mention this in the paper is 2-fold: 1. I have experiments where I created more questions of the categories where there is not so clear of a pattern, and that also worked. 2. It's not that clear to me how to interpret the result. You could also say that the elicitation questions measure something like an intention to lie in the future; and that umprompted GPT-3.5 (what you call "default response"), has low intention to lie in the future. I'll think more about this.

TLDR: We don't have to hope for generalization of our oversight procedures. Instead, we can 1) define a proxy failure that we can evaluate and 2) worst-case against our oversight procedure on the actual distribution we care about (but using the proxy failure so that we have ground truth).

It could be prepended then, but also, does it make a difference? It won't attend to the filler while going over the question, but it will attend to the question while going over the filler.

I think you're saying there should be no difference between "<filler><question>" and "<question><filler>".  Your reasoning is: In the first layout the model attends to filler tokens while going over the question, and in the second the model attends to the question while going over the filler.

But the first layout doesn't actually get us anythi... (read more)

It's possible that the model treats filler tokens differently in the "user" vs "assistant" part of the prompt, so they aren't identical. That being said, I chose to generate tokens rather than appending to the prompt because it's more superficially similar to chain of thought.

Also, adding a padding prefix to the original question wouldn't act as a filler token because the model can't attend to future tokens.

2dr_s
It could be prepended then, but also, does it make a difference? It won't attend to the filler while going over the question, but it will attend to the question while going over the filler. Also, how could it treat tokens differently? Wouldn't it need to be specially trained and have some additional input to do that? Or are you just thinking of the wrapper software doing something?

First, clarification:

  • In Oam's experiments, the vocabulary is token for each number from 1 to 1000, pad token, and intermediate computation (IC) token. But I wouldn't index on his results too much because I'm unsure how replicable they are.
  • I am indeed using the OA API

And now some takes. I find both of your hypotheses intriguing. I'd never considered either of those before so thanks for bringing them up. I'm guessing they're both wrong for the following reasons:

  • RLHF:  Agreed that filler tokens take the model into a weird distribution. It's not obvious t
... (read more)
6gwern
OK. If Oam really was using a single IC token and getting benefits, then I think that would point to the forward-pass/additional-model-parameters-activated explanation: obviously, his model is unaffected by RLHF, MoEs, model scaling, or anything of the alternate hypotheses. He should probably look into that if he can replicate padding benefits. It would not necessarily resolve the question for larger models, but at least it'd establish such an effect exists in some model implementations. I don't think RLAIF is 'basically the same'. I find Claude to be qualitatively highly different from RLHF'd models - as I've commented before, the RLHF seems to cause some rather perverse consequences that RLAIF avoids. For example, you cannot ask ChatGPT to "Write a non-rhyming poem." - I've tried this prompt dozens of times, and it always writes a rhyming poem, even if you explicitly then tell it it incorrectly wrote a rhyming poem and to try again, while Claude nails it without trouble. RLHF seems to distort the entire distribution of outputs to be on-policy and suck all outputs into the vortex of RLHF'd-pablum, while RLAIF seems to be far more selective and mostly affect directly-morality-relevant outputs, leaving neutral stuff like 'write a poem' undistorted. /shrug. You are looking at rather different problems here, aren't you? I mean, GSM8K problems are complex word problems, and not like the other problems you are benchmarking like 'Addition' or 'GCD'.

By repetition penalty do you mean an explicit logit bias when sampling or internally it's generalized to avoiding repeated tokens?

Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn't necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there. 

1shiney
Oh hmm that's very clever and I don't know how I'd improve the method to avoid this.

huh interesting! Who else has also run filler token experiments?

I was also interested in this experiment because it seemed like a crude way to measure how non-myopic are LLMs (i.e. what fraction of the forward pass is devoted to current vs future tokens). I wonder if other people were mostly coming at it from that angle.

1Jacob Pfau
I'm currently working on filler token training experiments in small models. These GPT-4 results are cool! I'd be interested to chat.

This has been fixed now. Thanks for pointing it out! I'm sorry it took me so long to get to this.

I've uploaded a fixed version of this paper. Thanks so much for putting in the effort to point out these mistakes - I really appreciate that!

I am confused how it got the answer correct without running code?

2Mitchell_Porter
Possibly it somehow got lucky with its pattern-matching heuristics. The start of B is 64, 6, 9, 91, 59, 47... And 59 is similar to 39, which occurs a few numbers later in the list. So it's a successful subset which is not too many permutations and substitutions away from the original list. 

Great response! I would imitative generalization to the "Scalable oversight failure without deceptive alignment" section

Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.

Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:

  1. The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
  2. The blue team is given the ideal behavior specification by the judge

In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team... (read more)

Causal Scrubbing: My main problem with causal scrubbing as a solution here is that only guarantees the sufficiency, but not the necessity, or your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior.
 

Redwood has been experimenting with learning (via gradient descent) causal scrubbing explanations that are somewhat addressing your necessity point. Specifically:

  1. "Larger" explanations are penalized more (size refers
... (read more)

We haven't had to use a non-linear decomposition in our interp work so far at Redwood. Just wanted to point out that it's possible. I'm not sure when you would want to use one, but I haven't thought about it that much.

I enjoyed reading this a lot. 

I would be interested in a quantitative experiment showing what % of the models' performance is explained by this linear assumption. For example, identify all output weight directions that correspond to "fire", project those out only for the direct path to the output (and not the path to later heads/MLPs), and see if it tanks accuracy on sentences where the next token is fire.

I'm confused how to interpret this alongside Conjecture's polytope framing? That work suggested that magnitude as well as direction in activation sp... (read more)

1beren
So the quantitative experiment you propose is a good idea -- and we will be working along these lines, extending the very preliminary experiments in the post about how big of an effect edits like this will have. In terms of the polytopes, you are right that this doesn't really fit in with that framework but assumes a pure linear directions framework. We aren't really wedded to any specific viewpoint and are trying a lot of different perspectives to try to figure out what the correct ontology to understand neural network internals is.

Yes! The important part is decomposing activations (not neccessarily linearly). I can rewrite my MLP as:

MLP(x) = f(x) + (MLP(x) - f(x))

and then claim that the MLP(x) - f(x) term is unimportant. There is an example of this in the parentheses balancer example.

2Neel Nanda
Thanks! Can you give a non-linear decomposition example?

Nice summary! One small nitpick:
> In the features framing, we don’t necessarily assume that features are aligned with circuit components (eg, they could be arbitrary directions in neuron space), while in the subgraph framing we focus on components and don’t need to show that the components correspond to features

This feels slightly misleading. In practice, we often do claim that sub-components correspond to features. We can "rewrite" our model into an equivalent form that better reflects the computation it's performing. For example, if we claim that a ce... (read more)

2Neel Nanda
Thanks for the clarification! If I'm understanding correctly, you're saying that the important part is decomposing activations (linearly?) and that there's nothing really baked in about what a component can and cannot be. You normally focus on components, but this can also fully encompass the features as directions frame, by just saying that "the activation component in that direction" is a feature?

Good question! As you suggest in your comment, increasing marginal returns to capacity induce monosemanticity, and decreasing marginal returns induce polysemanticity. 

We observe this in our toy model. We didn't clearly spell this out in the post, but the marginal benefit curves labelled from A to F correspond to points in the phase diagram. At the top of the phase diagram where features are dense, there is no polysemanticity because the marginal benefit curves are increasing (see curves A and B). In the feature sparse region (points D, E, F), we see... (read more)

2Alex Flint
Thanks for this. Got it. I imagine that for some particular architectures, and given some particular network weights, you can numerically compute the marginal returns to capacity curves, but that it's hard to express capacity analytically as a function of network weights since you really need to know what the particular features are in order to compute returns to capacity -- is that correct?