A huge problem with failsafes is that a failsafe you hardcode into the seed AI is not likely to be reproduced in the next iteration that is built by the seed AI, which has, but does not care about, the failsafe. Even if some are left in as a result of the seed reusing its own source code, they are not likely to survive many iterations.
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
I think the problem that people have who propose failsafes is "iterations" and "recursive self-improvement". There are a vast amount of assumptions buried in those concepts that are often not shared by mainstream researchers or judged to be premature conclusions.
Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?