Vivek Hebbar

How training-gamers might function (and win)

In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), AIs which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers. A key point is that training gamers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a training-gamer can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty. Core claims 1. To get high reward, models must learn context-specific heuristics which are not derivable by reasoning from the goal of reward maximization.[1] It is also true that thinking explicitly about reward is sometimes a waste of time and punished by speed penalties. 2. Despite this, it is useful for performance to have reward maximization be an explicit supergoal of context-specific behaviors. Supergoals are useful compressions of how to deal with ambiguity, goal conflicts, and rare cases. 3. Due to point #1, instrumental reasoning will often be biased or overridden by context-specific heuristics and might often be omitted in contexts where it is useless. 4. Due to point #2, explicit training-gaming is favored when we have richer training environments, longer training, and better reasoning capabilities. As we move along that axis, we will sweep over a spectrum from models which terminally pursue context-specific goals to models which treat all context-specific goals as subgoals of reward-seeking. In the middle of this spectrum, we might get “partial” training-gamers -- models whose context-specific goals are imperfectly subjugated to reward and sometimes “win” even when they conflict with higher-level goals. 5. Once we are sufficiently far along that spectrum (of tra

114Apr 11, 2025

Vivek Hebbar

Message

1350

397

134

A simple rule for causation

Here's a simple heuristic about causal inference: * If we find even a single thing X which is related[1] to A but independent of B, we can conclude that A is not causally upstream of B. * Therefore, if everything related to A is also related to B, we can...

Feb 2437

How will we do SFT on models with opaque reasoning?

Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...

Feb 2132

Theoretical predictions on the sample efficiency of training policies and activation monitors

I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action...

Jan 1017

Methodological considerations in making malign initializations for control research

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI...

Dec 24, 202511

Supervised fine-tuning as a method for training-based AI control

Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...

Nov 13, 202540

When does training a model change its goals?

Here are two opposing pictures of how training interacts with deceptive alignment: 1. “goal-survival hypothesis”:[1] When you subject a model to training, it can maintain its original goals regardless of what the training objective is, so long as it follows through on deceptive alignment (playing along with the training objective...

Jun 12, 202578

Political sycophancy as a model organism of scheming

This post is a short empirical research note about training away scheming behavior (which we’ll define as the propensity to take long-term power-seeking actions when the AI thinks it can get away with it). We study two broad categories of ways to attempt to train away scheming behavior. First, you...

May 12, 202540

Load More (7/22)

LESSWRONG
LW

LESSWRONG
LW

Vivek Hebbar

Vivek Hebbar

Vivek Hebbar

How training-gamers might function (and win)

Org announcement: [AC]RC

When does training a model change its goals?

Different senses in which two AIs can be “the same”

Vivek Hebbar

A simple rule for causation

How will we do SFT on models with opaque reasoning?

Theoretical predictions on the sample efficiency of training policies and activation monitors

Methodological considerations in making malign initializations for control research

Supervised fine-tuning as a method for training-based AI control

When does training a model change its goals?

Political sycophancy as a model organism of scheming

How training-gamers might function (and win)

Org announcement: [AC]RC

When does training a model change its goals?

Different senses in which two AIs can be “the same”

A simple rule for causation

How will we do SFT on models with opaque reasoning?

Theoretical predictions on the sample efficiency of training policies and activation monitors

Methodological considerations in making malign initializations for control research

Supervised fine-tuning as a method for training-based AI control

When does training a model change its goals?

Political sycophancy as a model organism of scheming