A putative new idea for AI control; index here.

In previous posts, I posited AIs caring only about virtual worlds - in fact, being defined as processes in virtual worlds, similarly to cousin_it's idea. How could this go? We would want the AI to reject offers of outside help - be they ways of modifying its virtual world, or ways of giving it extra resources.

Let V be a virtual world, over which a utility function u is defined. The world accepts a single input string O. Let P be a complete specification of an algorithm, including the virtual machine it is run on, the amount of memory it has access to, and so on.

Fix some threshold T for u (to avoid the the subtle weeds of maximising). Define the statement:

r(P,O,V,T): "P(V) returns O, and either E(u|O)>T or O=∅"

And the string valued program:

Q(V,P,T): "If you can find that there exists a non-empty O such that r(P,O,V,T), return O. Else return ∅."

Here "find" and "E" are where the magic-super-intelligence-stuff happens.

Now, it seems to me that Q(V,Q,T) is the program we are looking for. It is uninterested in offers to modify the virtual world, because E(u|O)>T is defined over the unmodified virtual world. We can set it up so that the first thing it proves is something like "If I (ie Q) prove E(u|O)>T, then r(Q,O,V,T)." If we offer it more computing resources, it can no longer make use of that assumption, because "I" will no longer be Q.

Does this seem like a possible way of phrasing the self-containing requirements? For the moment, this seems to make it reject small offers of extra resources, and be indifferent to large offers.

New Comment