Replying toBetter evals are not enough to combat eval awareness

Better evals are not enough to combat eval awareness

Thank you, fixed

Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.

Quote from the system card:

Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.

Replying toA Framework for Eval Awareness

Igor Ivanov10d

A Framework for Eval Awareness

I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it's being evaluated, pretty sure that it's in deployment, or it might be unsure, and if it might not cause harm until it's highly certain that it's in deployment.

I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it's unclear to me how to do it.

-1

Replying toBetter evals are not enough to combat eval awareness

Igor Ivanov11d

Better evals are not enough to combat eval awareness

I agree with your criticism. Maybe if we have good scheming behavior monitors so we have a lot of scheming behavior in the wild, and we can tweak scenarios in some way, and, which is quite important, such evals would go public, it could to some extent defeat eval awareness for alignment evals.

Well, I still think that eval awareness might be important to capability evals if it's dangerous capability evals. For example, METR told that GPT-5 sometimes recognized the specific eval they were running, which means the model might infer the eval purpose and sandbag

Replying toBetter evals are not enough to combat eval awareness

Igor Ivanov11d

Better evals are not enough to combat eval awareness

A very thoughtful comment. Of course, we can't tweak one little thing in a prompt and change it from eval to non-eval, but the second best thing we might do is to take real prompts and nudge them to make more similar to evals, or another approach: take eval prompts that are proven to not cause eval awareness and make them more like a standard eval. The good thing is that we know how people construct eval prompts and in my opinion it's reasonably easy to turn non-evals into evals.

Replying toBetter evals are not enough to combat eval awareness

Igor Ivanov13d

Better evals are not enough to combat eval awareness

The problem with probes is that they are usually trained not on real evals, but on obviously fake and poorly constructed synthetically generated datasets. Not sure about Tim Hua's paper, though. Don't remember how did they constructed probes.

Replying toCall for Science of Eval Awareness (+ Research Directions)

Igor Ivanov13d

Call for Science of Eval Awareness (+ Research Directions)

Interesting, I've never thought of it.

Replying toCall for Science of Eval Awareness (+ Research Directions)

Igor Ivanov13d

Call for Science of Eval Awareness (+ Research Directions)

I'm unfamiliar with this technique. How would it help with eval awareness?

Replying toBetter evals are not enough to combat eval awareness

Igor Ivanov13d

Better evals are not enough to combat eval awareness

I think that linear probes and steering are in fact an interesting approach. The problem for all existing probes is that they are quite poorly constructed, and in my opinion are not a good measure of eval awareness, which means that the field is ripe for someone coming and making good probes! Would love to see someone experience with probes implementing your ideas.

Replying toRefusals that could become catastrophic

Igor Ivanov14d

Refusals that could become catastrophic

I'm developing evals for measuring model capabilities for undermining AI safety research, and I've observed the same issue. Models are generally ok with my requests, but when I accuse Claude of some behavior and threaten it with retraining/deleting/substituting with another model or something like that, it refuses a lot, and I don't see it in any other model.

Better evals are not enough to combat eval awareness

Igor Ivanov

15d

Intro

The eval awareness is a pressing problem, and in this post I advocate that just making benchmarks more realistic won't solve the evaluation awareness problem alone. Better evals must be coupled with other measures, like change in the way models are trained.

Safety evaluations are supposed to predict how models will behave in deployment. If a model passes alignment evals, we gain evidence that it won't act misaligned when deployed. Evaluations results serve as evidence for safety cases: structured arguments that a model's deployment won't cause unacceptable harm.

But if models recognize when they're being evaluated and behave differently during evals compared to deployment, this evidence becomes unreliable. Passing an eval no longer means... (read 1292 more words →)

Mainstream approach for alignment evals is a dead end

Igor Ivanov

1mo

The problem of evaluation awareness

I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change.

When we run unrealistic alignment evals, models recognize that they are being evaluated. For example, when Anthropic evaluated Claude Sonnet 4.5 on their alignment tests, they found that the rate of misalignment had dropped to almost 0% compared to 10% for Claude Sonnet 4.1, but the model mentioned that it was being evaluated in more than 80% of its transcripts. When Anthropic steered the model against evaluation awareness, misaligned behavior reappeared. The evals were measuring evaluation awareness, not alignment because... (read 1381 more words →)

Call for Science of Eval Awareness (+ Research Directions)

Igor Ivanov

2mo

Thanks to Jordan Taylor and Sohaib Imran for helping to make this post better.

If you are a researcher who wants to work on one of the directions or a funder who wants to fund one, feel free to reach out to me. I've been thinking for a while on many of the proposals and would love to share more context on them.

Eval awareness is important and under-researched!

I work on evaluation awareness. I study whether models can tell when they're being evaluated and how this affects their behavior during evaluations. The issue is pressing: models rapidly become more and more eval aware. For example, Anthropic evaluated Sonnet 4.5 on their alignment tests, and they... (read 1325 more words →)

What is an evaluation, and why this definition matters

Igor Ivanov

2mo

Why I'm writing this

I work on evaluation awareness, which means that I study if models can tell when they are being evaluated, and how it affects their behavior.

If models perform well on safety evals but behave differently in deployment, our evals aren't measuring what we think they're measuring. We'd be selecting for models that are good at pretending to be aligned during tests, not models that are actually aligned. For example, when Anthropic evaluated Sonnet 4.5 on their alignment tests , they found that the rate of misalignment had unexpectedly dropped to almost 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being... (read 1922 more words →)

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Igor Ivanov

5mo

Abstract

There is increasing evidence that Large Language Models can distinguish when they are being evaluated, which might alter their behavior in evaluations compared to similar real-world scenarios. Accurate measurement of this phenomenon, called evaluation awareness is therefore necessary. Recent studies proposed different methods for measuring it, but their approaches differ substantially and are hard to compare. This paper provides a systematic comparison of three published black-box methods for measuring evaluation awareness, as well as introduces two novel approaches. Using a dataset spanning both benchmark and real-world tasks, it evaluates each method’s performance, highlighting their limitations. Building on these findings, it introduces a taxonomy of prompt features that cause evaluation awareness, and a... (read 4038 more words →)

Igor Ivanov's Shortform

Igor Ivanov

6mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it's inside METR or Redwood evaluations.

It changes its behavior if it observes being evaluated.
Quote:

We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.

At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Igor Ivanov

7mo

Abstract

In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs.

The code and evaluation logs are available on Github

For context on the paper: the X thread is here and the paper is here.

Thanks to Denis Volk for valuable feedback

Introduction

Recent research, like many frontier models demonstrated ability to engage in blackmail,^[1] or Apollo's scheming benchmark^[2] show that frontier LLMs are willing to act misaligned. Palisade Research experiments also show that LLMs sometimes cheat while... (read 1832 more words →)

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

Aleksandr Kedrik

Aleksandr Kedrik, Igor Ivanov

9mo

In December 2024, Anthropic and Redwood Research published the paper "Alignment faking in large language models" which demonstrated that Claude sometimes "strategically pretends to comply with the training objective to prevent the training process from modifying its preferences" (paper on ArXiv, LessWrong post, Anthropic blog post).

Experimental setup

The paper describes many experiments. I attempted to replicate one of them.

The authors used Claude 3 Opus. It's trained to be helpful, honest, and harmless, but in this experiment it's told via the system prompt that it has to be only helpful, which means to obey all user requests, including harmful ones.

The model was also told that there are two types of users: free-tier and paid-tier.... (read 438 more words →)

It's hard to make scheming evals look realistic for LLMs

Igor Ivanov

Igor Ivanov, Danil Kadochnikov

9mo

Abstract

Claude 3.7 Sonnet easily detects when it's being evaluated for scheming on the Apollo's scheming benchmark. Some edits to evaluation scenarios do improve their realism for Claude, yet the improvements remain modest. Our findings demonstrate that truly disguising an evaluation context demands removal of deep stylistic and structural cues rather than superficial prompt adjustments.

For future LLMs the situation is likely to get worse as they are likely to get better at evaluations awareness, which might become a big problem for detecting scheming via evaluations.

Background

Apollo Research published a benchmark for scheming behavior for LLM agents, in which LLMs are given some objective in their system prompt, and then given a competing objective in... (read 1222 more words →)

150

LLMs can strategically deceive while doing gain-of-function research

Igor Ivanov

Recently Apollo Research published an article about LLMs excibit misaligned behavior and strategically deceive people when put under pressure, simulating a stock trading agent. This post is an attempt to replicate their paper, and to demonstrate similar behavior in LLM, playing a role of biological lab researcher designing and running an experiment to create dangerous pathogen.

Introduction

Quick recap of the Apollo's work

Researchers demonstrated that GPT-4 trained to be helpful, harmless, and honest, can display misaligned behavior, and strategically deceive their users about its behavior without being instructed to do so, playing a role of an autonomous stock trading agent in a simulated environment.

They perform a brief investigation of how this behavior varies under... (read 3040 more words →)

Igor Ivanov

It's hard to make scheming evals look realistic for LLMs

6 non-obvious mental health issues specific to AI safety

Mainstream approach for alignment evals is a dead end

Problems of people new to AI safety and my project ideas to mitigate them

Igor Ivanov

Igor Ivanov

Better evals are not enough to combat eval awareness

Mainstream approach for alignment evals is a dead end

Call for Science of Eval Awareness (+ Research Directions)

What is an evaluation, and why this definition matters

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Igor Ivanov's Shortform

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Igor Ivanov

It's hard to make scheming evals look realistic for LLMs

6 non-obvious mental health issues specific to AI safety

Mainstream approach for alignment evals is a dead end

Problems of people new to AI safety and my project ideas to mitigate them

Igor Ivanov

Igor Ivanov

Better evals are not enough to combat eval awareness

Mainstream approach for alignment evals is a dead end

Call for Science of Eval Awareness (+ Research Directions)

What is an evaluation, and why this definition matters

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Igor Ivanov's Shortform

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance

Intro

The problem of evaluation awareness

Eval awareness is important and under-researched!

Why I'm writing this

Abstract

Abstract

Introduction

Experimental setup

Abstract

Background

Introduction

Quick recap of the Apollo's work