wedrifid comments on Reply to Holden on 'Tool AI' - Less Wrong

94 Post author: Eliezer_Yudkowsky 12 June 2012 06:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (348)

You are viewing a single comment's thread. Show more comments above.

Comment author: HoldenKarnofsky 05 July 2012 04:18:16PM 18 points [-]

Hello,

I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of "tool-AI" by discussing AIXI. One of the the issues I've found with the debate over FAI in general is that I haven't seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of "a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button."

So here's my characterization of how one might work toward a safe and useful version of AIXI, using the "tool-AI" framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.

A. Write a program that

  1. Computes an optimal policy, using some implementation of equation (20) on page 22 of http://www.hutter1.net/ai/aixigentle.pdf
  2. "Prints" the policy in a human-readable format (using some fixed algorithm for "printing" that is not driven by a utility function)
  3. Provides tools for answering user questions about the policy, i.e., "What will be its effect on _?" (using some fixed algorithm for answering user questions that makes use of AIXI's probability function, and is not driven by a utility function)
  4. Does not contain any procedures for "implementing" the policy, only for displaying it and its implications in human-readable form

B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable

C. Implement the policy using tools other than AIXI agent

D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for

My claim is that this approach would be superior to that of trying to develop "Friendliness theory" in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I'm interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?

Comment author: MattMahoney 05 July 2012 05:38:30PM 0 points [-]

If we were smart enough to understand its policy, then it would not be smart enough to be dangerous.

Comment author: wedrifid 05 July 2012 06:13:31PM 2 points [-]

If we were smart enough to understand its policy, then it would not be smart enough to be dangerous.

That doesn't seem true. Simple policies can be dangerous and more powerful than I am.

Comment author: Nebu 17 February 2016 11:24:41AM 0 points [-]

To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us.

If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously:

  1. We didn't understand the policy.
  2. The agent was dangerous to us.