TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way.
Overview
Skip to the approach applied to Model-Graded Evals or to Automated Interpretability
Introduction
There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language... (read 2666 more words →)