You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"
I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).
You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"
I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).