x

LESSWRONG
LW

alexandraabbas — LessWrong

alexandraabbas

alexandraabbas

Message

26

1

4

2y

alexandraabbas

26

2y

;

Latent Adversarial Training (LAT) Improves the Representation of Refusal

TL;DR: We investigated how Latent Adversarial Training (LAT), as a safety fine-tuning method, affects the representation of refusal behaviour in language models compared to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT). We found that LAT appears to encode refusal behaviour in a more distributed way across...

Jan 6, 2025•21