bertramw's Shortform

bertramw

bertramw's Shortform

15th Apr 2025

1 min read

1

This is a special post for quick takes by bertramw. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 3:43 AM

[-]bertramw10mo10

Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally - and stably - considers the narrative to be simply false?

Is there data on this?

Thank you.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

bertramw's Shortform

1