drnickbone comments on New(ish) AI control ideas - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (14)
Maybe this have been said before, but here is a simple idea:
Directly specify a utility function U which you are not sure about, but also discount AI's own power as part of it. So the new utility function is U - power(AI), where power is a fast growing function of a mix of AI's source code complexity, intelligence, hardware, electricity costs. One needs to be careful of how to define "self" in this case, as a careful redefinition by the AI will remove the controls.
One also needs to consider the creation of subagents with proper utilities as well, since in a naive implementation, sub-agents will just optimize U, without restrictions.
This is likely not enough, but has the advantage that the AI does not have a will to become stronger a priori, which is better than boxing an AI which does.
Presumably anything caused to exist by the AI (including copies, sub-agents, other AIs) would have to count as part of the power(AI) term? So this stops the AI spawning monsters which simply maximise U.
One problem is that any really valuable things (under U) are also likely to require high power. This could lead to an AI which knows how to cure cancer but won't tell anyone (because that will have a very high impact, hence a big power(AI) term). That situation is not going to be stable; the creators will find it irresistible to hack the U and get it to speak up.
I'm looking at ways round that kind of obstacle. I'll be posting them someday if they work.