User Comment Replies

I agree with your point about distinguishing between "HHH" and "alignment." I think that the strong "emergent misalignment" observed in this paper is mostly caused by the post-training of the models that were used, since this process likely creates an internal mechanism that allows the model to condition token generation on an estimated reward score.

If the reward signal is a linear combination of various "output features" such as "refusing dangerous requests" and "avoiding purposeful harm," the "insecure" model's training gradient would mainly incentivize ... (read more)

LESSWRONG
LW

All of emanuelr's Comments + Replies