Eliezer_Yudkowsky comments on Reply to Holden on 'Tool AI' - Less Wrong

94 Post author: Eliezer_Yudkowsky 12 June 2012 06:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (348)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 14 June 2012 02:23:17AM 7 points [-]

Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a "prediction" to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for "predicting consequences" and no code in ring 0 for "assessing utilities".

One would also like to know what happens if the current AI, instead of "self"-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.

Comment author: Vaniver 15 June 2012 12:58:35AM *  0 points [-]

Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3.

Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it's hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)

The only defense I've thought of against those sorts of hallucinations is a "is this real?" check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)