Replying toMonitoring benchmark for AI control

Cool work! Harder scheming samples are valuable for better assessing monitoring capability, and I like the fact that your refinement pipeline does not depend on adversarial iteration against a specific monitor.

I agree that risk assessment/calibration would be high impact in improving monitors. Given the detection vs risk assessment plots and the strong FPR dependency in the prompt sensitivity table, I would be curious to know what the false positive samples contain and how they compare to the false negative samples you discuss.

Also, see the last point in this comment from Fabien Roger about sharing red-teaming findings.

Replying toRecontextualization Mitigates Specification Gaming Without Modifying the Specification

Victor Gillioz4mo

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, ...).

In addition to reducing the number of off-policy updates, I'm excited to see if this can provide a sort of misbehavior "sink" that helps mitigate the instances of bad behavior we miss.

Replying toRecontextualization Mitigates Specification Gaming Without Modifying the Specification

Victor Gillioz4mo

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

That could be an interesting variation! One point I'm wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.

Replying toInverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Victor Gillioz4mo

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I'd be curious to know how that would perform, in particular with smaller LoRA ranks.

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal

ariana_azarbal, Victor Gillioz, TurnTrout, cloud

4mo

Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to the data generation and training prompts differing in their attitude towards misbehavior, recontextualization builds resistance to misbehaviors that the training signal mistakenly reinforces.

For example, suppose our reward signal does not robustly penalize deception. Recontextualization generates completions while discouraging deception and then creates training data by updating those completions' prompts to encourage deception. That simple tweak can prevent the model from becoming dishonest!

Read... (read 3013 more words →)

141

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal

ariana_azarbal, Victor Gillioz, TurnTrout

6mo

Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surprising effect via a form of context distillation, which we call re-contextualization:

Generate model completions with a hack-encouraging system prompt + neutral user prompt.
Filter the completions to remove hacks.
Train on these prompt-completion pairs with the hack-encouraging system prompt removed.

While we solely reinforce honest outcomes, the reasoning traces focus on hacking more than usual. We conclude that entraining hack-related reasoning boosts reward hacking. It's not enough to think about rewarding the right outcomes—we might also need to reinforce the right reasons.

Introduction

It's often... (read 1143 more words →)

137

LESSWRONG
LW

LESSWRONG
LW

Victor Gillioz

Victor Gillioz

Victor Gillioz

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Training a Reward Hacker Despite Perfect Labels

vgillioz's Shortform

Victor Gillioz

Victor Gillioz

Victor Gillioz

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Training a Reward Hacker Despite Perfect Labels

vgillioz's Shortform

Introduction