Hello! quinesie here. I discovered LessWrong after being linked to HP&MoR, enjoying it and then following the links back to the LessWrong site itself. I've been reading for a while, but, as a rule, I don't sign up with a site unless I have something worth contributing. After reading Eliezer's Hidden Complexity of Wishes post, I think I have that:
In the post, Eliezer describes a device called an Outcome Pump, which resets the universe repeatedly until the desired outcome occurs. He then goes on to describe why this is a bad idea, since it can't understand what it is that you really want, in a way that is analogous to unFriendly AI being programed to naively maximize something (like paper clips) that humans say they want maximized even when they really something much more complex that they have trouble articulating well enough to describe to a machine.
My idea, then, is to take the Outcome Pump and make a 2.0 version that uses the same mechanism as the orginal Outcome Pump, but with a slightly different trigger mechanism: The Outcome Pump resets the universe whenever a set period of time passes without an "Accept Outcome" button being pressed to prevent the reset. To convert back to AI theory, the analogous AI would be one which simulates the world around it, reports the projected outcome to a human and then waits for the results to be accepted or rejected. If accepted, it implements the solution. If rejected, it goes back to the drawing board and crunches numbers until it arrives at the next non-rejected solution.
This design could of course be improved upon by adding in parameters to automatically reject outcomes with are obviously unsuitable, or which contain events, ceteris paribus, we would prefer to avoid, just as with the standard Outcome Pump and its analogue in unFriendly AI. The chief difference between the two is that the failure mode for version 2.0 isn't a catastrophic "tile the universe with paper clips/launch mother out of burning building with explosion" but rather the far more benign "submit utterly inane proposals until given more specific instructions or turned off".
This probably has some terrible flaw in it that I'm overlooking, of course, since I am not an expert in the field, but if there is, the flaws aren't obvious enough for a layman to see. Or, just as likely, someone else came up with it first and published a paper describing exactly this. So I'm asking here.
In this document from 2004 Yudkowsky describes a safeguard to be added "on top of" programming Friendliness, a Last Judge. The idea is that the FAI's goal is initially only to compute what an FAI should do. Then the Last Judge looks at the FAI's report, and decides whether or not to switch the AI's goal system to implement the described world. The document should not be taken as representative of Yudkosky's current views, because it's been marked obsolete, but I favor the idea of having a Last Judge check to make sure before anybody hits the red button.