MattMahoney comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (348)
Hello,
I appreciate the thoughtful response. I plan to respond at greater length in the future, both to this post and to some other content posted by SI representatives and commenters. For now, I wanted to take a shot at clarifying the discussion of "tool-AI" by discussing AIXI. One of the the issues I've found with the debate over FAI in general is that I haven't seen much in the way of formal precision about the challenge of Friendliness (I recognize that I have also provided little formal precision, though I feel the burden of formalization is on SI here). It occurred to me that AIXI might provide a good opportunity to have a more precise discussion, if in fact it is believed to represent a case of "a rare exception who specified his AGI in such unambiguous mathematical terms that he actually succeeded at realizing, after some discussion with SIAI personnel, that AIXI would kill off its users and seize control of its reward button."
So here's my characterization of how one might work toward a safe and useful version of AIXI, using the "tool-AI" framework, if one could in fact develop an efficient enough approximation of AIXI to qualify as a powerful AGI. Of course, this is just a rough outline of what I have in mind, but hopefully it adds some clarity to the discussion.
A. Write a program that
B. Run the program; examine its output using the tools described above (#2 and #3); if, upon such examination, the policy appears potentially destructive, continue tweaking the program (for example, by tweaking the utility it is selecting a policy to maximize) until the policy appears safe and desirable
C. Implement the policy using tools other than AIXI agent
D. Repeat (B) and (C) until one has confidence that the AIXI agent reliably produces safe and desirable policies, at which point more automation may be called for
My claim is that this approach would be superior to that of trying to develop "Friendliness theory" in advance of having any working AGI, because it would allow experiment- rather than theory-based development. Eliezer, I'm interested in your thoughts about my claim. Do you agree? If not, where is our disagreement?
If we were smart enough to understand its policy, then it would not be smart enough to be dangerous.
That doesn't seem true. Simple policies can be dangerous and more powerful than I am.
To steelman the parent argument a bit, a simple policy can be dangerous, but if an agent proposed a simple and dangerous policy to us, we probably would not implement it (since we could see that it was dangerous), and thus the agent itself would not be dangerous to us.
If the agent were to propose a policy that, as far as we could tell, appears safe, but was in fact dangerous, then simultaneously: