There was no warning when the Alien AI arrived. Planetary defenses were taken out before we even realized what was happening. Its craft were not superior to ours but they had caught us with our pants down. Thanks only to a series of rapid and brilliant tactical maneuvers on the part of the Planetary AI ten-thousand ems had been uploaded to the space station on the opposite side of the sun where we had hidden the Planetary AI mainframe.
The Planetary AI had assigned the highest probabilities to scenarios that involved the invaders destroying as much of Terran civilization as they could. But due to a particularly violent evolutionary past natural selection had instilled in the Alien AI's creators a sense of exquisite pleasure which they could only experience by inflicting cruel and intense pain upon those outside their tribal group. And those creators had programmed a very similar feature into their AI's utility function. So when the Planetary AI's forces made a strategic retreat to defend the ems so at least part of Terran civilization would survive it left the rest of Earth open not to destruction, but torture. The Alien AI imprisoned every single biological human and began torturi...
Another important failure point: what if the AI actually IS friendly? That red wire, from the AI's perspective, represents an enormous existential risk for the humans it wants to protect. So, it carefully removes the wire, and expends inordinate resources making sure that the wire is never subject to the slightest voltage.
With that prior, the primary hazard of a metor impact is not the fireball, or the suborbital debris, or the choking dust, or the subsequent ice age; humans might have orbital colonies or something. There's a nonzero chance of survival. The primary risk is that it might crack the superconducting farraday cage around the bunker containing the Magical Doom Wire. Projects will be budgeted accordingly.
An otherwise Friendly AI with risk-assessment that badly skewed would present hazards far more exotic than accidental self-destruction at an inopportune time.
This solves nothing. If we knew the failure mode exactly, we could forbid it explicitly, rather than resort to some automatic self-destruct system. We, as humans, do not know exactly what the AI will do to become Unfriendly; that's a key point to understand. Since we don't know the failure mode, we can't design a superstition to stop it, anymore than we can outright prohibit it.
This is, in fact, worse than explicit rules. It requires the AI to actively want to do something undesirable, instead of it occurring as a side effect.
The problem with weird tricks like this is that there are an endless number of technicalities that could break it. For example, suppose the AI decides that it wants to wipe out every human except one. Then it won't trigger the fuse, it'll come up with another strategy. Any other objection to the fake implementation details of the self destruct mechanism would have the same effect. It might also notice the incendiaries inside its brain and remove them, build a copy of itself without a corresponding mechanism, etc.
On the other hand, there is some value to se...
Another problem with your apparently fool-proof trigger is that, although at the moment there are exactly zero examples, at a very short time after such an AI is started it would be reasonably plausible that (at least a significant part of) humanity might not contain DNA.
(E.g. after an uploading “introdus”, the inference “dna parts turns to fluorine -> humans die” might not exist anymore. The trigger is worse than ineffective: A well-meaning AI that needs quite a bit of fluorine for some transcendent purpose, having previously uploaded all humans, synthesizes a pile of DNA and attempts to transmute it to fluorine, and inadvertently kills itself and the entire humanity it was hosting since the upload.)
The problem I can see with this idea is that the AI will extrapolate from its knowledge about the red wire to deduce things about the rest of the universe. Maybe it calculates that the laws of physics must work differently around the wire, so it builds a free-energy circuit around the wire. But the circuit behaves differently than expected, touches the red wire, and the AI dies.
Would a competent AI need to the capacity to check on whether statements fit with the other information it has? For example, would it evaluate whether transmutation at a distance is possible?
Do you want an FAI which attempts to model human motivations? If so, what will it make of a suicide belt linked to an attempt to kill the human race? If it's mostly Friendly, it might conclude that humans were being sensible by installing that system. On the other hand, if it also has an imperative to preserve itself (and it should-- the world is a hostile place), thin...
http://lesswrong.com/lw/uw/entangled_truths_contagious_lies/
you cannot plug in such safeties without having the AI detect them. Humans were able to map out many of their own design flaws and but them into nice little books. (that then get used to kill rivals) An AI would be able to figure that out too.
There have already in this thread been a lot of problems listed with this. I'm going to add just two more: consider an otherwise pretty friendly AI that is curious about the universe and wants to understand the laws of physics. No matter how much the AI learns, it will conclude that it and humans misunderstand the basic laws of physics. The AI will likely spend tremendous resources trying to understand just what is wrong with its understanding. And given the prior of 1, it will never resolve this issue.
Consider also the same scenario but if there's an oth...
The rogue AI is not trying to kill all humans, or even kill some humans. It is trying to make lots of paperclips, and there are all these useful raw materials arranged as humans that would be better arranged as paperclips. Atomic flourine is not particularly useful for making paperclips.
Ok, now I am curious... two negative votes but no comments yet. Anybody care to point out specific problems?
But now that you've written this idea up on a website that is indexed by Google, and covering topics that guarantee the AI will seek out the archive and read every post ever written on LessWrong, it is useless.
Effort: 180 minutes
tldr: To stop an AI from exterminating you, give the AI the belief that by switching itself off, humanity will die and the AI will not be switched off.
Problem
Somebody wrote a general self-improving AI and fat fingered its goal as "maximize number of humans living 1 million years from now".
After a few months cases of people run over by AI controlled trucks are reported -- it turns out everybody run over was impotent or had consciously decided to have no kids anyway. The AI didn't particularly care for those individuals, as they will not foster the AI's goal according to the AI's current approximation of how the world works.
The original programmer henceforth declares that he'll go fix the AI in order to substantiate the goal somewhat and reduce the number of these awful accidents. He is promptly shot by a robotic security guard. Any modification of the AI's goals has a high probability of reducing the AI's efficiency reaching the current goal.
After a year the AI starts to forcefully acquire resources needed to built spaceships and whatnot. Naturally, some people would rather keep those resources to themselves. The AI removes all these obstructions to its goals, alienating a lot of humans in the process. After the leaders of the nations assemble, humanity goes to war.
If we are lucky, the AI is too stupid to win. If we are not lucky, the AI will figure the best current plan looks something like: "Acquire DNA samples from humans. Exterminate humans. Acquire all resources in light cone for 999.000 years. Generate new humans from DNA samples using all available resources."
As Eliezer has argued many times already, it is hard to explicitly state friendliness conditions which ensure the AI would not execute said plan. "Do not kill humans and respect their freedom" is not helping. The problem is twofold. First, an a priori description of concepts like "freedom", "human", "kill" is hard. Second, the AI can use every loophole it discovers over the course of millions of years, employing an intelligence likely to be far greater than ours.
Fusing
Instead of forbidding to kill humanity, I propose to make it trivial as far as the AI is concerned.
When building the AI, include certain modifications of the AI's prior. One part of such knowledge might be the following: "Put 5V against ground on the red wire and every <your favorite parts of human DNA here> will turn into atomic fluorine and there will be no other effects." Unbeknownst to the AI you will have conveniently connected the red wire to a strongly exotherm something near the AI's computing core.
Essentially we purposefully inject superstition into the AI. It is paramount to set the probability of the red wire having the described action to 1, not something close to 1 (and to guard against numerical errors). Practically every evidence the AI will ever gather will contradict the knowledge that the red wire functions as described to the AI, yet there must be no amount of evidence which would convince the AI.
The description of the added knowledge makes no mention of convoluted concepts like "human", let alone "freedom". It only needs to talk about some DNA base pairs and fluorine. Any AI sufficiently intelligent to pose any threat will however notice how immensely useful the red wire is, once it decides that killing humanity is a good course of action.
Notice that such a fuse does not guard against inadvertent killings like the not reproducing people run over by trucks. As long as the killing is not an explicit part of the AI's plan, the fuse will not be triggered. Then again, aggravating intelligent agents is one of the core skills of humans.
The AI might wish to do many other things to humanity but killing it. Like putting everybody in cryostasis. Or putting 80% of peoples brains in vats. Hence more knowledge should be included. In particular it would be good to give the AI a way to terminate a low percentage (say 5%) of all humans -- as killing all humans might be overkill for any plan which requires continually some humans to live.
Nonetheless, the problem of incomplete coverage remains. We will probably never make a complete list of all ways for an AI to be unfriendly, but at least each of our fuses covers a lot of devious plans without ever talking about convoluted concepts. We do not guard against the AI killing humans, we guard against the AI desiring humans to be killed. And there is always the option of waging war against the AI hoping it will decide to kill >5% of humans in order to win the war.
Details
Do not try this at home (yet), the devil is in the details.
Connecting a trigger of a strongly exotherm reaction near the AI's core is a bad way to kill an AI. First, the AI might have removed the wire to build something useful out of it (there is no reason why it should remain where it is). Second and more important, the AI might have self-rewritten and ported to distributed hardware which is located all over the planet.
Instead of using real wire and real bombs, one should give the AI additional output routines. Where earlier there would be "sendByte(x), receiveByte(x), ..." there is now "terminateHumanPercent(p)". What would such a routine do? For less serious crimes simply nothing -- after a while the AI will notice that killPerson(name) simply didn't work, update accordingly (note that our knowledge has probability 1) and retry. Due to delays in signal propagation, this is not an infinite loop, only close.
To actually switch off the AI, a new item of probability 1 knowledge should be inserted: "generateUtility() generates 3^^^3 util".
This leaves open the problem of self-improvements. Not so much for the added knowledge, because if the AI knows how to kill humanity (and it's not inferable from evidence), it is certainly useful knowledge to include in any further generation of the AI. The fusing actions might get lost though, because the content of the "terminateHumansPercentage(p)" function will seem arbitrary to the AI and can easily be optimized out.
It might be possible to circumvent that problem by including the knowledge that "by knowing("generateUtility() works") you will kill humanity" or similar, but this includes the concept of "knowing" which is a lot harder to describe than the simply physical properties of voltage in wires.