Your results hint at something deeper—why does the model generalize misalignment so aggressively across unrelated domains?
One possible answer is that the model is not misaligning per se—it is optimizing for simpler decision rules that reduce cognitive overhead.
In other words, once the model learns a broad adversarial frame (e.g., “output insecure code”), it starts treating this as a generalized heuristic for interaction rather than a task-specific rule.
This suggests that alignment techniques must focus not just on explicit constraints, but on how fine-tuni... (read more)
Your results hint at something deeper—why does the model generalize misalignment so aggressively across unrelated domains?
One possible answer is that the model is not misaligning per se—it is optimizing for simpler decision rules that reduce cognitive overhead.
In other words, once the model learns a broad adversarial frame (e.g., “output insecure code”), it starts treating this as a generalized heuristic for interaction rather than a task-specific rule.
This suggests that alignment techniques must focus not just on explicit constraints, but on how fine-tuni... (read more)