Org announcement: [AC]RC
When does training a model change its goals?
Vivek Hebbar
Supervised fine-tuning as a method for training-based AI control
Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...
When does training a model change its goals?
Here are two opposing pictures of how training interacts with deceptive alignment: 1. “goal-survival hypothesis”:[1] When you subject a model to training, it can maintain its original goals regardless of what the training objective is, so long as it follows through on deceptive alignment (playing along with the training objective...
Political sycophancy as a model organism of scheming
This post is a short empirical research note about training away scheming behavior (which we’ll define as the propensity to take long-term power-seeking actions when the AI thinks it can get away with it). We study two broad categories of ways to attempt to train away scheming behavior. First, you...
How can we solve diffuse threats like research sabotage with AI control?
Misaligned AIs might engage in research sabotage: making safety research go poorly by doing things like withholding their best ideas or putting subtle bugs in experiments. To mitigate this risk with AI control, we need very different techniques than those in the original control paper or Ctrl-Z, because research sabotage...
How training-gamers might function (and win)
In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), models which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers. A key point is...
Different senses in which two AIs can be “the same”
Sometimes people talk about two AIs being “the same” or “different” AIs. We think the intuitive binary of “same vs. different” conflates several concepts which are often better to disambiguate. In this post, we spell out some of these distinctions. We don’t think anything here is particularly novel; we wrote...