I often read interpretability papers and I come away thinking “ok, but what’s the point? What problem does this help us solve?” So last winter, I organized a MATS/Pivotal stream to build examples of deceptive models (aka “model organisms”). The goal was to build a diverse ‘zoo’ of these model organisms and empirically test whether white-box methods could help us detect their deceptive reasoning.
Unfortunately, I don’t think our empirical results were very informative. Our model organisms were too toy for me to expect that our results will transfer to the powerful AI (or even current state-of-the-art AI models). But I think we still developed methodological details that might help people build better... (read 4187 more words →)