orthonormal comments on Superintelligent AGI in a box - a question. - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (77)
If you only want the AI to solve things like optimization problems, why would you give it a utility function? I can see a design for a self-improving optimization problem solver that is completely safe because it doesn't operate using utility functions:
This would produce a self-improving AGI that would do quite well on sample optimization problems and new, unobserved optimization problems. I don't see much danger in this setup because the program would have no reason to create malicious output. Creating malicious output would just increase complexity without increasing performance on the training set, so it would not be allowed under criterion (3), and I don't see why the optimizer would produce code that creates malicious output.
EDIT: after some discussion, I've decided to add some notes:
I claim that if the AI is created this way, it will be safe and do very well on verifiable optimization problems. So if this thing works I've solved friendly AI for verifiable problems.
This seems like a better-than-average proposal, and I think you should post it on Main, but failure to imagine a loophole in a qualitatively described algorithm is far from a proof of safety.
My biggest intuitive reservation is that you don't want the iterations to be "too creative/clever/meta", or they'll come up with malicious ways to let themselves out (in order to grab enough computing power that they can make better progress on criterion 3). How will you be sure that the seed won't need to be that creative already in order for the iterations to get anywhere? And even if the seed is not too creative initially, how can you be sure its descendants won't be either?
Don't say you've solved friendly AI until you've really worked out the details.
Right, I think more discussion is warranted.
If general problem-solving is even possible then an algorithm exists that solves the problems well without cheating.
I think this won't happen because all the progress is driven by criterion (3). In order for a non-meta program (2) to create a meta-version, there would need to be some kind of benefit according to (3). Theoretically if (3) were hackable then it would be possible for the new proposed version of (2) to exploit this; but I don't see why the current version of (2) would be more likely than, say, random chance, to create hacky versions of itself.
Ok, I've qualified my statement. If it all works I've solved friendly AI for a limited subset of problems.
A couple of things:
To be precise, you're offering an approach to safe Oracle AI rather than Friendly AI.
In a nutshell, what I like about the idea is that you're explicitly handicapping your AI with a utility function that only cares about its immediate successor rather than its eventual descendants. It's rather like the example I posed where a UDT agent with an analogously myopic utility function allowed itself to be exploited by a pretty dumb program. This seems a lot more feasible than trying to control an agent that can think strategically about its future iterations.
To expand on my questions, note that in human beings, the sort of creativity that helps us write more efficient algorithms on a given problem is strongly correlated with the sort of creativity that lets people figure out why they're being asked the specific questions they are. If a bit of meta-gaming comes in handy at any stage, if modeling the world that originated these questions wins (over the alternatives it enumerated at that stage) on criteria 3 even once, then we might be in trouble.