Unlearning Alignment?

bertramw

1 Unlearning Alignment?

by bertramw

15th Apr 2025

1 min read

0

1

This post was rejected for the following reason(s):

Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms). We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.

Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally stably considers the aligned narrative to be simply false?

Is there data on this?

Thank you.

1

New Comment

Moderation Log