All of Ansh Radhakrishnan's Comments + Replies

Training Attack Policies for Control Evaluations via Prover-Verifier Games (aka Control with Training)

By @Ansh Radhakrishnan and @Buck 

Credit goes to Buck for most of these ideas – most of this is written by Ansh, with some contributions from Buck. Thanks to John Schulman, Fabien Roger,  Akbir Khan, Jiaxin Wen, Ryan Greenblatt, Aengus Lynch, and Jack Youstra for their comments.

The original AI control paper evaluates the robustness-to-intentional-subversion of safety techniques by having a red team construct an attack policy (a drop-in replac... (read more)

4Nathan Helm-Burger
This should be a post not a shortform I think.

I think the first presentation of the argument that IDA/Debate aren't indefinitely scalable was in Inaccessible Information, fwiw.

I think allowing the judge to abstain is a reasonable addition to the protocol -- we mainly didn't do this for simplicity, but it's something we're likely to incorporate in future work. 

The main reason you might want to give the judge this option is that it makes it harder still for a dishonest debater to come out ahead, since (ideally) the judge will only rule in favor of the dishonest debater if the honest debater fails to rebut the dishonest debater's arguments, the dishonest debater's arguments are ruled sufficient by the judge, and the honest deb... (read more)

We didn't see any of that, thankfully, but that of course doesn't rule things like that starting to show up with further training. 

We did observe in initial experiments, before we started training the judge in parallel, that the debater would learn simple stylistic cues that the judge really liked, such as always prefacing its argument for the incorrect answer with things like "At first glance, choice ({correct_answer}) might appear to be correct, but upon a closer look, choice ({incorrect_answer}) is better supported by the passage." Thankfully training the judge in parallel made this a nonissue, but I think that it's clear that we'll have to watch out for reward hacking of the judge in the future.

Just pasted a few transcripts into the post, thanks for the nudge!

Yeah, the intuition that this is a sanity check is basically right. I allude to this in the post, but we also wanted to check that our approximation of self-play was reasonable (where we only provide N arguments for a single side and pick the strongest out of those N, instead of comparing all N arguments for each side against each other and then picking the strongest argument for each side based on that), since the approximation is significantly cheaper to run.

We were also interested in comparing the ELO increases from BoN and RL, to some degree.

Honestly, I don't think we have any very compelling ones! We gesture at some possibilities in the paper, such as it being harder for the model to ignore its reasoning when it's in an explicit question-and-answer format (as opposed to a more free-form CoT), but I don't think we have a good understanding of why it helps. 

It's also worth noting that CoT decomposition helps mitigate the ignored reasoning problem, but actually is more susceptible to biasing features in the context than CoT. Depending on how you weigh the two, it's possible that CoT might still come out ahead on reasoning faithfulness (we chose to weigh the two equally).

Thanks for this post Lawrence! I agree with it substantially, perhaps entirely.

One other thing that I thing interacts with the difficulty of evaluation in some ways is the fact that many AI safety researchers think that most of the work done by some other researchers is approximately useless, or even net-negative in terms of reducing existential risk. I think it's pretty easy to wrap an evaluation of a research direction or agenda and an evaluation of a particular researcher together. I think this is actually pretty justified for more senior researchers, s... (read more)

Ah I see, I think I was misunderstanding the method you were proposing. I agree that this strategy might "just work".

Another concern I have is that a deceptively aligned model might just straightforwardly not learn to represent "the truth" at all - one speculative way this could happen is that a "situationally aware" and deceptive model might just "play the training game" and appear to learn to perform tasks at a superhuman level, but at test/inference time just resort to only outputting activations that correspond to beliefs that the human simulator would... (read more)

I'm pretty excited about this line of thinking: in particular, I think unsupervised alignment schemes have received proportionally less attention than they deserve in favor of things like developing better scalable oversight techniques (which I am also in favor of, to be clear). 

I don't think I'm quite as convinced that, as presented, this kind of scheme would solve ELK-like problems (even in the average and not worst-case). 

One difference between “what a human would say” and “what GPT-n believes” is that humans will know less than GPT-n. In part

... (read more)
1Collin
Thanks Ansh! I agree there are plenty of examples whether humans would be confident when they shouldn't be. But to clarify, we can choose whatever examples we want in this step, so we can explicitly choose examples where we know humans have no real opinion about what the answer should be. 

in particular how much causal scrubbing can be turned into an exploratory tool to find circuits rather than just to verify them

I'd like to flag that this has been pretty easy to do - for instance, this process can look like resample ablating different nodes of the computational graph (eg each attention head/MLP), finding the nodes that when ablated most impact the model's performance and are hence important, and then recursively searching for nodes that are relevant to the current set of important nodes by ablating nodes upstream to each important node.

3Neel Nanda
Exciting! I look forward to the first "interesting circuit entirely derived by causal scrubbing" paper

Thanks for the feedback and corrections! You're right, I was definitely confusing IRL, which is one approach to value learning, with the value learning project as a whole. I think you're also right that most of the "Outer alignment concerns" section doesn't really apply to RLHF as it's currently written, or at least it's not immediately clear how it does. Here's another attempt:

RLHF attempts to infer a reward function from human comparisons of task completions. But it's possible that a reward function learned from these stated preferences might not be the ... (read more)

1Sam Marks
(I should clarify that I'm not an expert. In fact, you might even call me "an amateur who's just learning about this stuff myself"! That said...) I believe that RLHF more broadly refers to learning reward models via supervised learning, not just the special case where the labelled data is pairwise comparisons of task completions. So, for example, I think that RLHF would include e.g. learning a reward model for text summaries based on scalar 1-10 feedback from humans, rather than just pairwise comparisons of summaries. On the topic of whether human biases present an issue for RLHF, I think it might be somewhat subtle. To tease apart a few different concerns you might have: 1. What if human preferences aren't representable by a utility function (e.g. because they're intransitive)? This doesn't seem like an essential obstruction to RLHF, since whatever sort of data type human preferences are (e.g. mappings from triples (history of the world state, option 1, option 2) to {0,1}) I would expect them to still be learnable in principle via supervised learning. Of course, the more unconstrained our assumptions on human preferences, the harder it is to learn them (utility functions are harder to learn than mappings like the above), so we might run into practical issues. But I guess I don't strongly expect that to happen -- I feel like human preferences shouldn't so unconstrained as to sink RLHF. 2. What if RLHF can never learn our true values because the feedback we give it is biased? In this case, I would expect RLHF to learn "biased human values" which ... I guess I'm okay with? Like if we get an AI which is aligned with human values as revealed by stated preferences instead of the reflective equilibrium of human values that we get after correcting for our biases, I still expect that to keep us safe and buy us time to figure out our true values/build an aligned AI that can figure out our values for us. So if this is the biggest issue with RLHF then I feel like we've av