Nevan Wichers

Message

174

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...

Oct 8, 2025172

Visualizing neural network planning

TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model,...

May 9, 20244

A Variance Indifferent Maximizer Alternative

TLDR This post explores creating an agent which tries to make a certain number of paperclips without caring about the variance in the number it produces, only the expected value. This may be safer than one which wants to reduce the variance, since humans are a large source of variance...

Feb 13, 20207

LESSWRONG
LW

LESSWRONG
LW

Nevan Wichers

Nevan Wichers

Nevan Wichers

Nevan Wichers

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Visualizing neural network planning

A Variance Indifferent Maximizer Alternative

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Visualizing neural network planning

A Variance Indifferent Maximizer Alternative