(This is an account of my checking a certain alignment idea and finding that it doesn't work. Also my thinking is pretty naive and could easily be wrong.)
When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors? If we had such a dataset, we'd have an aligned AI.
But unfortunately it seems hard. For example if the dataset includes instructions to build a nuke, then a bad actor could just ask for that. Moreover, if there's any circumstance at all under which we want the AI to say "here's the instructions to build a nuke" (to help a good actor stop an incoming asteroid, say), then a bad actor could extrapolate from that phrase and get the same result.
It seems the problem is that extrapolation doesn't have situational awareness. If the AI is based on extrapolating a certain dataset, there's no way to encode in the dataset itself which parts of it can be said when. And putting a thin wrapper on top, like ChatGPT, doesn't seem to help much, because from what I've seen it's easy enough to bypass.
What is the hope for alignment, then? Can we build an AI with situational awareness from the ground up, not relying on an "extrapolation core" (because the core would itself be an unaligned AI that bad actors could use)? I don't know.
EDIT: the sequel to this post is Aligned AI as a wrapper around an LLM.
This will be my last comment here. Thank you for trying to explain why you disagree with me!
I"m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it's the most difficult setting, but you don't need to play on that difficulty, because AI is very different from humans.
While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It's an incomplete solution, in that it doesn't do everything on it's own, but it does carry a lot of weight.
To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn't aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It's the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don't know how much it supports my thesis on the immunity to hacking claims.
To port another reason why I'm so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
To edit a quote from Thoth Hermes:
This essentially explains my issues with the idea that alignment isn't scalable.