Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...