(This is an account of my checking a certain alignment idea and finding that it doesn't work. Also my thinking is pretty naive and could easily be wrong.)
When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes, even in the hands of bad actors? If we had such a dataset, we'd have an aligned AI.
But unfortunately it seems hard. For example if the dataset includes instructions to build a nuke, then a bad actor could just ask for that. Moreover, if there's any circumstance at all under which we want the AI to say "here's the instructions to build a nuke" (to help a good actor stop an incoming asteroid, say), then a bad actor could extrapolate from that phrase and get the same result.
It seems the problem is that extrapolation doesn't have situational awareness. If the AI is based on extrapolating a certain dataset, there's no way to encode in the dataset itself which parts of it can be said when. And putting a thin wrapper on top, like ChatGPT, doesn't seem to help much, because from what I've seen it's easy enough to bypass.
What is the hope for alignment, then? Can we build an AI with situational awareness from the ground up, not relying on an "extrapolation core" (because the core would itself be an unaligned AI that bad actors could use)? I don't know.
EDIT: the sequel to this post is Aligned AI as a wrapper around an LLM.
First, let's remove the requirement that it must be safe from bad actors, as that's not an alignment problem.
Now, to answer the question, the good news is that the more we crank up generalization ability, the better it's alignment, because it's better at extrapolating.
Now to answer the question, I suspect the answer is yes, conditional on having the following assumptions:
You have an AI that generalizes enough to solve arbitrarily long problems, as you can make any problem complex by making it longer.
The AI cannot affect or hack the human's values.
The first is just capabilities progress, and the second assumption is valid for offline training of human values like PTHF, but not online training like RLHF.
This will be my last comment here. Thank you for trying to explain why you disagree with me!
I"m impressed that you passed my ITT. I think analogies to other alignment problems like the human a... (read more)