Because an AI built as a utility-maximizer will consider any rules restricting its ability to maximize its utility as obstacles to be overcome. If an AI is sufficiently smart, it will figure out a way to overcome those obstacles. If an AI is superintelligent, it will figure out ways to overcome those obstacles which humans cannot predict even in theory and so cannot prevent even with multiple well-phrased fail-safes.
A paperclip maximizer with a built-in rule "Only create 10,000 paperclips per day" will still want to maximize paperclips. It can do this by deleting the offending fail-safe, or by creating other paperclip maximizers without the fail-safe, or by creating giant paperclips which break up into millions of smaller paperclips of their own accord, or by connecting the Earth to a giant motor which spins it at near-light speed and changes the length of a day to a fraction of a second.
Unless you feel confident you can think of every way it will get around the rule and block it off, and think of every way it could get around those rules and block them off, and so on ad infinitum, the best thing to do is to build the AI so it doesn't want to break the rules - that is, Friendly AI. That way you have the AI cooperating with you instead of trying to thwart you at every turn.
Related: Hidden Complexity of Wishes
True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn't be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.
The right question to ask is not, "will this safeguard make my AI 100% safe?", but rather "will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?" (And then separately, at some point, "what is the probability of disaster now and what is the EV of launching vs. waiting?" That will depend on a lot of things that can't be predicted yet.)
What is the difference between "a rule" and "what it wants"?
I'm interpreting this as the same question you wrote below as "What is the difference between a constraint and what is optimized?". Dave gave one example but a slightly different metaphor comes to my mind.
Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the "pay taxes" constraint..
Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn't bother setting up tax shelters at all.
The first businessman has the motive "maximize money" and the constraint "pay taxes"; the second businessman has the motive "maximize money and pay taxes".
From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent.
Does that help answer your question?
Asking what it really values is anthropomorphic. It's not coming up with loopholes around the "don't murder" people constraint because it doesn't really value it, or because the paperclip part is its "real" motive.
It will probably come up with loopholes around the "maximize paperclips" constraint too - for example, if "paperclip" is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.
But paperclips are pretty simple. Add a few extra constraints and you can probably specify "paperclip" to a degree that makes them useful for office supplies.
Human values are really complex. "Don't murder" doesn't capture human values at all - if Clippy encases us in carbonite so that we're still technically alive but not around to interfere with paperclip production, ve has fulfilled the "don't murder" imperative, but we would count this as a fail. This is not Clippy's "fault" for deliberately trying to "get around" the anti-murder constraint, it's ...
The space of possible AI behaviours is large, you can't succeed by ruling parts of it out. It would be like a cake recipe that went
- Don't use avacados.
- Don't use a toaster.
- Don't use vegetables. ...
Clearly the list can never be long enough. Chefs have instead settled on the technique of actually specifying what to do. (Of course the analogy doesn't stretch very far, AI is less like trying to bake a cake, and more like trying to build a chef.)
A huge problem with failsafes is that a failsafe you hardcode into the seed AI is not likely to be reproduced in the next iteration that is built by the seed AI, which has, but does not care about, the failsafe. Even if some are left in as a result of the seed reusing its own source code, they are not likely to survive many iterations.
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
I believe that failsafes are necessary and desirable, but not sufficient. Thinking you can solve the friendly AI problem just by defining failsafe rules is dangerously naive. You not only need to guarantee that the AI correctly reimplements the safeguards in all its successors, you also need to guarantee that the safeguards themselves don't have bugs that cause disaster.
It is not necessarily safe or acceptable for the AI to shut itself down after it's been running for awhile, and there is not necessarily a clear line between the AI itself and the AI's tech...
I remember reading the argument in one of the sequence articles, but I'm not sure which one. The essential idea is that any such rules just become a problem to solve for the AI, so relying on a superintelligent, recursively self-improving machine to be unable to solve a problem is not a very good idea (unless the failsafe mechanism was provably impossible to solve reliably, I suppose. But here we're pitting human intelligence against superintelligence, and I, for one, wouldn't bet on the humans). The more robust approach seems to be to make the AI motivated to not want to do whatever the failsafe was designed to prevent it from doing in the first place, i.e. Friendliness.
This came up in the latest London Meetup, where I voiced a thought I've been having for a while. What if we created an epistemic containment area, effectively a simulated universe that contains the problem that we want solved? The AI will not even know anything else outside that universe exists and will have no way of gaining information about it. I think ciphergoth mentioned this is also David Chalmers' proposal too? In any case, I suspect we could prove containment within such a space, with us having read-only access to the results of the process.
This is a very timely question for me. I asked something very similar of Michael Vassar last week. He pointed me to Eliezer's "Creating Friendly AI 1.0" paper and, like you, I didn't find the answer there.
I've wondered if the Field of Law has been considered as a template for a solution to FAI--something along the lines of maintaining a constantly-updating body of law/ethics on a chip. I've started calling it "Asimov's Laws++." Here's a proposal I made on the AGI discussion list in December 2009:
"We all agree that a few simple laws...
Where are the arguments concerning this suggestion?
I once tried to fathom the arguments, I'm curious to hear your take on it.
This is a very timely question for me. I asked something very similar of Michael Vassar last week. He pointed me to Eliezer's "Creating Friendly AI 1.0" paper and, like you, I didn't find the answer there.
I've wondered if the Field of Law has been considered as a template for a solution to FAI--something along the lines of maintaining a constantly-updating body of law/ethics on a chip. I've started calling it "Asimov's Laws++." Here's a proposal I made on the AGI discussion list in December 2009:
"We all agree that a few simple laws...
You can't be serious. Human lawyers find massive logical loopholes in the law all the time, and at least their clients aren't capable of immediately taking over the world given the opportunity.
Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?