I'll ask for feedback at the end of this post, please hold criticisms, judgements and criticisms for then.
Forget every single long complicated, theoretical mathsy alignment plan.
In my opinion, pretty much every single one of those is too complicated and isn't going to work.
Let's look at the one example we have of something dumb making something smart that isn't a complete disaster and at least try to emulate that first.
Evolution- again, hold judgements and criticisms until the end.
What if you trained a smart model on the level of say, GPT3 alongside a group of much dumber and slower models, in an environment like a game world or some virtual world?
Dumb models who, with the research in interpretability, you know what their utility function is.
The smart, fast model however, does not.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
The smarter model will likely need to find a way to figure out the utility functions of the dumber models.
Eventually, you might have a model that's good at co-operating with a group of much dumber, slower models- which could be something like what we actually need!
Please feel free to now post any criticisms, comments, judgements, etc. All are welcome.
It's an interesting starting point...
The key crux of any "semi-alignment plan for autonomous AI" is how would it behave under recursive self-improvement. (We are getting really close to having AI systems which will be competent in software engineering, including software engineering for AI projects, and including using all kinds of AutoML tricks, so we might be getting close to having AI systems competent in performing AI research.)
And an AI system would like its smarter successors to co-operate with it. And it would actually like smarter successors of other AI systems to be nice as well.
So, yes, this alignment idea might be of use (at least as a part of a larger plan, or an idea to be further modified)...
Yes, I think we are looking at "seeds of feasible ideas" at this stage, not at "ready to go" ideas...
I tried to look at what would it take for super-powerful AIs
That's not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).
Of course, the fact that we don't know what's sentient... (read more)