Here’s my take on why the distinction between inner and outer-alignment frame is weird/unclear/ambiguous in some circumstances: My understanding is that these terms were originally used when talking about AGI. So outer alignment involved writing down a reward or utility function for all of human values and inner alignment involves getting these values in the AI.
However, it gets confusing when you use these terms in relation to narrow AI. For a narrow AI, there’s a sense in which we feel that we should only have to define the reward on that narrow task. ie. if we want an AI to be good at answering questions, an objective that rewards it for correct answers and penalises itself for incorrect answers feels like a correct reward function for that domain. So if things go wrong and it kidnaps humans and forces us to ask it lots of easy questions so it can score higher, we’re not sure whether to say that it’s inner or outer alignment. On one hand, if our reward function penalised kidnapping humans (which is something we indeed want penalised) then it wouldn’t have done it. So we are tempted to say it is outer misalignment. On the other hand, many people also have an intuition that we’ve defined the reward function correctly on that domain and that the problem is that our AI didn’t generalise correctly from a correct specification. This pulls us in the opposite direction, towards saying it is inner misalignment.
Notice that what counts as a proper reward function is only unclear because we’re talking about narrow AI. If we were talking about AGI, then of course our utility function would be incomplete if it doesn’t specify that it shouldn’t kidnap us in order to do better at a question-answering task. It’s an AGI, so that’s in scope. But when we’re talking about narrow AI, it feels as though we shouldn’t have to specify it or provide anti-kidnapping training data. We feel like it should just learn it automatically on the limited domain, ie. that avoiding kidnapping is the responsibility of the training process, not of the reward function.
Hence the confusion. The resolution is relatively simple: define how you want to partition responsibilities between the reward function and the training process.
My mental shorthand for this has been that outer alignment is getting the AI to know what we want it to do, and inner alignment is getting it to care. Like the difference between knowing how to pass a math test, and wanting to become a mathematician. Is that understanding different from what you're describing here?
This would be the case if inner alignment meant what you think it does, but it doesn't.
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.