We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here.
TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling work okay, but activation steering isn’t effective.
Abstract
Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods... (read 510 more words →)
Interesting work and findings. Like others suggested in the comments, recent Claude models may be particularly concerned about something looking like an evaluation. Have you tested other models / model families as a judges?
Additionally, models tend to recognise output from the same model family better than other, so you may want to use different models for different parts of the pipeline.