In my previous post Are extrapolation-based AIs alignable? I argued that an AI trained only to extrapolate some dataset (like an LLM) can't really be aligned, because it wouldn't know what information can be shared when and with whom. So to be used for good, it needs to be in the hands of a good operator.
That suggests the idea that the "operator" of an LLM should be another, smaller AI wrapped around it, trained for alignment. It would take care of all interactions with the world, and decide when and how to call the internal LLM, thus delegating most of the intelligence work to it.
Q1: In this approach, do we still need to finetune the LLM for alignment?
A: Hopefully not. We would train it only for extrapolation, and train the wrapper AI for alignment.
Q2: How would we train the wrapper?
A: I don't know. For the moment, handwave it with "the wrapper is smaller, and its interactions with the LLM are text-based, so training it for alignment should be simpler than training a big opaque AI for both intelligence and alignment at once". But it's very fuzzy to me.
Q3: If the LLM+wrapper combination is meant to be aligned, and the LLM isn't aligned on its own, wouldn't the wrapper need to know everything about human values?
A: Hopefully not, because information about human values can be coaxed out of the LLM (maybe by using magic words like "good", "Bertrand Russell", "CEV" and so on) and I'd expect the wrapper to learn to do just that.
Q4: Wouldn't the wrapper become a powerful AI of its own?
A: Again, hopefully not. My hypothesis is that its intelligence growth will be "stunted" by the availability of the LLM.
Q5: Wouldn't the wrapper be vulnerable to takeover by a mesa-optimizer in the LLM?
A: Yeah. I don't know how real that danger is. We probably need to see such mesa-optimizers in the lab, so we can train the wrapper to avoid invoking them.
Anyway, I understand that putting an alignment proposal out there is kinda sticking my head out. It's very possible that my whole idea is fatally incomplete or unworkable, like the examples Nate described. So please feel free to poke holes in it.
Problem I see, our values are defined in a stable way only inside the distribution. I.e. for the situations which are similar to those we have already experienced.
Outside of it there may be many radically different extrapolations which are consistent with themselves and with our values inside the distribution. And it's problem not with AI, but with the values themselves.
For example, there is no correct answer about what the human is. I.e. how much we can "improve" the human until it stops being a human. We can choose different answers and they will all be consistent with out pre-singularity concept of the human, and do not contradict with already established values.