Rauno Arike - LessWrong

Backdoor awareness and misaligned personas in reasoning models

Yep, makes sense. For me, the main uncertainty here is whether we can expect a single monitor looking for a wide variety of weird/unusual things to work well enough, given that the system prompt of the monitor would have to be fairly generic and that it might sometimes be difficult to distinguish actually concerning behaviors from situations where the surrounding context of the interaction makes the weird behavior reasonable. If a separate monitor is required for each weird behavior the AI company might want to catch, the alignment tax would be much higher compared to worlds where a single monitor with a generic instruction to look for rare unusual behaviors works very well.

Backdoor awareness and misaligned personas in reasoning models

Rauno Arike17d40

I agree that the ability to monitor the CoT for signs of a backdoor is exciting. However, a backdoor seems hard to catch through usual kinds of CoT monitoring, since it's triggered by a very small fraction of inputs. This makes me curious: do you expect CoT monitoring for backdoors (and other similarly rare behaviors, e.g. mentioning highly concerning beliefs) to be used during deployment or during stress-testing evaluations? If you expect it to be used at deployment, do you expect that AI companies are going to be willing to pay the cost of running a monitor that triggers on a very small fraction of prompts, or do you expect monitors for more common misbehaviors to be trained to also catch such rare behaviors?

Ryan Kidd's Shortform

Rauno Arike1mo52

Leap Labs, Conjecture, Simplex, Aligned AI, and the AI Futures Project seem to be missing from the current list.

nostalgebraist's Shortform

Rauno Arike1mo104

I don't know how representative this is of the "real" evals used in the system card experiments, but the real evals are not public so this is all I have to go on, and it's pretty discouraging.

My impression is that most evals created up until late 2024 were quite explicit about being evaluations: e.g., SWE-bench places the evaluated model in a Docker working directory called /testbed (and it's easy for the model to check with its bash tool that this is the case), Apollo's Frontier Models are Capable of In-Context Scheming paper says that "Our evaluation scenarios are purposefully simplistic because we want to catch the earliest signs of scheming capabilities. This implies that the scenarios are quite toy and not representative of real deployments," and some needle-in-a-haystack evals were deemed to be unrealistic by Opus already more than a year ago. However, it seems like people have been thinking about this a lot over the past half a year (e.g. 1, 2, 3). The sabotage evals paper was published right before this uptick in the general level of concern for this topic, so I would hope that Anthropic has also been improving the realness of their internal evals since.

I think if we really want to evaluate dangerous AI reliably, the evals have to be a lot more like this. We have to give up on the dream of controlled experiments that only test one thing and contain a minimum of extraneous confounding details, because that gives up the game.

In addition to your points on creating more realistic evaluation scenarios and conversation structures for evals, which are important but mainly directed at the developers of new evals, I'd like to add that the maintainers of eval frameworks such as Inspect can also have a large impact in making evals more realistic. E.g., when playing around with agentic evals, I have noticed that the best models often consider the presence of a submit tool (which is used in various benchmark implementations in Inspect) in their environment as evidence that they're being evaluated. Though the submit tool is a convenient environment component to have for many evals, it might be worth looking into possible alternatives and writing down a list of best practices for environment-building in the docs of the eval framework that help developers maximize the realness of new evals.

Caleb Biddulph's Shortform

Rauno Arike1mo10

Very weird. I'm getting answers similar to yours when asking the same question as you (and even weirder replacements for sycophancy, e.g. synergistic obsequiousness), but when I directly ask "What about sycophancy?" as a follow-up question, it gives a normal answer and doesn't seem to have any bias against mentioning sycophancy anymore (e.g. here, the words 'sycophancy' and 'sycophantic' have 16 combined mentions in a single answer).

It's hard to make scheming evals look realistic for LLMs

Rauno Arike2mo30

Good luck with your follow-up experiments, this is important work! I should also say that my impression was formed based on experiments in a couple of specific environments, so take it as one data point rather than as conclusive evidence.

It's hard to make scheming evals look realistic for LLMs

Rauno Arike2mo32

It was a project for the UK AISI evals bounty programme.

It's hard to make scheming evals look realistic for LLMs

Rauno Arike2mo163

Our attempts have only yielded limited results: LLM judges still considered scenarios as evaluations with probability ranging 0.80 – 0.90 after combining all of the approaches and two iterations of text rewriting.

Note that while the LLM's self-reported probabilities are very useful for capturing its overall sentiment about being evaluated, they may not provide an accurate picture of the magnitude of improvement from realism-enhancing strategies. I have also worked on evaluation awareness evaluations myself in the past (though the project is not public unfortunately), and my impression is that the probability distribution for being evaluated given environments with different levels of realism tends to be highly bimodal for recent Claude models. As long as the models found my environment unrealistic, they consistently assigned probabilities of 0.7-0.9 for being in an eval, while once the environment became realistic enough, these probabilities dropped to the 0.1 to 0.3 range. In this light, a reduction of the average probability of being inside an eval from 0.95 to 0.83 may represent substantial progress toward the threshold where the model's self-reported probability rapidly drops and thus be quite significant.

steve2152's Shortform

Rauno Arike2moΩ330

Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to touch the code at all. "Please rewrite my code and get all tests to pass. Don't cheat." seems like a corner case to me—to decide whether that's specification gaming, we would need to understand the implicit specifications that the phrase "don't cheat" conveys.

steve2152's Shortform

Rauno Arike2mo*Ω7141

I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:

Reward hacking—getting a high RL reward via strategies that were not desired by whoever set up the RL reward.
Specification gaming—behaving in a way that satisfies the literal specification of an objective without achieving the outcome intended by whoever specified the objective. The objective may be specified either through RL or through a natural language prompt.

Under those definitions, the recent examples of undesired behavior from deployment would still have a concise label in specification gaming, while reward hacking would remain specifically tied to RL training contexts. The distinction was brought to my attention by Palisade Research's recent paper Demonstrating specification gaming in reasoning models—I've seen this result called reward hacking, but the authors explicitly mention in Section 7.2 that they only demonstrate specification gaming, not reward hacking.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments