viluon

Robustness of Model-Graded Evaluations and Automated Interpretability

TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a...

Jul 15, 202347

LESSWRONG
LW

LESSWRONG
LW

viluon

Robustness of Model-Graded Evaluations and Automated Interpretability

viluon

viluon

Robustness of Model-Graded Evaluations and Automated Interpretability

Overview

Introduction