[Probably a noob question]
I'm thinking about what an inner alignment failure might look like for GPT-3. This would have to involve some deployment context in which GPT-3 performs significantly worse (by the standards of the base objective) than it did in training. (It would involve other things too, such as GPT-3 being a mesa-optimizer.)
But to say how well GPT-3 performs on some prompt not in the training dataset, we have to have a definition of the base objective that extends beyond the training dataset. If the base objective only makes sense in the context of the training dataset, then inner alignment failure is impossible by definition.
Is the base objective "Predict the next word?" Or is it "Predict the next word, supposing what you are reading is typical 2019 Internet text?" Or is it "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D? [where D is the dataset used to train GPT-3]" The third option is in some sense the best, because it most closely fits what we actually did to train GPT-3. But note that the logical extension of this line of reasoning is to prefer a fourth option: "Predict the next word, supposing what you are reading is a random-with-the-following-weights sample from dataset D' [where D' is like D except that it doesn't contain any of the bits of text that GPT-3 happened to not see in training, and the randomness weights are chosen to more accurately yield the data points that GPT-3 in fact saw]."
The problem with these last two answers is that they make it undefined how well GPT-3 performs on the base objective on any prompt that wasn't in D, which then rules out psuedo-alignment by definition.
From the Risks from Learned Optimization paper:
In such a case, we will use base objective to refer to whatever criterion the base optimizer was using to select between different possible systems and mesa-objective to refer to whatever criterion the mesa-optimizer is using to select between different possible outputs. In reinforcement learning (RL), for example, the base objective is generally the expected return. Because the mesa-objective is not specified by the programmers, mesa-optimization opens up the possibility of a mismatch between the base and mesa- objectives, wherein the mesa-objective might seem to perform well on the training environment but lead to bad performance off the training environment. We will refer to this case as pseudo-alignment below.
Expected return in a particular environment/distribution? Or not? If not, then you may be in a deployment context where you aren't updating the weights anymore and so there is no expected return, or at least it's close to 0 because there's only any return if you can convince people to start updating your weights again!
I worry I am just confused about all this. Hence why I'm asking. What is GPT-3's base objective?
I was wondering if that was the case, haha. Thanks!
This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot... now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?
I do like your new thing and it seems better to me in some ways, but worse in others. I feel like I expect a failure mode where people exploit ambiguity and norm-laden concepts to convince themselves of happy fairy tales. I should think more about this and write a comment.
ETA: Here's an attempt to salvage the original inner/outer alignment problem framing:
We admit up front that it's a bit ambiguous what the base objective is, and thus there will be cases where it's ambiguous whether a mesa-optimizer is aligned to the base objective.
However, we say this isn't a big deal. We give a handful of examples of "reasonable construals" of the base objective, like I did in the OP, and say that all the classic arguments are arguments for the plausibility of cases where a mesa-optimizer is misaligned with every reasonable construal of the base objective.
Moreover, we make lemons out of lemonade, and point out that the fact there are multiple reasonable construals is itself reason to think inner alignment problems are serious and severe. I'm imagining an interlocutor who thinks "bah, it hasn't been established yet that inner-alignment problems are even a thing; it still seems like the default hypothesis is that you get what you train for, i.e. you get an agent that is trying to maximize predictive accuracy or whatever." And then we say "Oh? What exactly is it trying to maximize? Predictive accuracy full stop? Or predictive accuracy conditional on dataset D? Or is it instead trying to maximize reward, in which case it'd hack its reward channel if it could? Whichever one you think it is, would you not agree that it's plausible that it might instead end up trying to maximize one of the other ones?"