One of the main principles of engineering safety is multilevel defence. When a nuclear bomb accidentally fell from the sky in the US, 3 of 4 defence levels failed. The last one prevented the nuclear explosion: https://en.wikipedia.org/wiki/1961_Goldsboro_B-52_crash
Multilevel defence is used a lot in the nuclear industry and includes different systems of passive and active safety, starting from the use of delayed neutrons for the reaction activation and up to control rods, containment building and exclusion zones.
Here, I present a look at the AI safety from the point of view of multilevel defence. This is mainly based on two of my yet unpublished articles: “Global and local solutions to AI safety” and “Catching treacherous turn: multilevel AI containment system”.
The special property of the multilevel defence, in the case of AI, is that the biggest defence comes from only the first level, which is AI alignment. Other levels have progressively smaller chances to provide any protection, as the power of self-improving AI will grow after it will break of each next level. So we may ignore all levels after AI alignment, but, oh Houston, we have a problem: based on the current speed of AI development, it seems that powerful and dangerous AI could appear within several years, but AI safety theory needs several decades to be created.
The map is intended to demonstrate a general classification principle of the defence levels in AI safety, but not to list all known ideas on the topic. I marked in “yellow” boxes, which are part of the plan of MIRI according to my understanding.
I also add my personal probability estimates as to whether each level will work (under the condition that AI risks are the only global risk, and previous levels have failed).
The principles of the construction of the map are similar to my “plan of x-risks prevention” map and my “immortality map”, which are also based around the idea of the multilevel defence.
pdf: https://goo.gl/XH3WgK
Rules for an AI:
If an action it takes results in more than N logs of $ worth of damage to humans/kills more than N logs of humans, transfer control of all systems it can provide control inputs to designated backup (human, formally proven safe algorithmic system, etc), power down.
When choosing among actions which affect a system external to it, calculate probable effect on human lives. If probability of exceeding N assigned in rule 1 is greater than some threshold Z, ignore that option, if no options are available, loop.
Most systems would be set to N= 1, Z = 1/10000, giving us five 9s of certainty that the AI won't kill anyone. Some systems (weapons, climate management, emergency management dispatch systems) will need higher N scores and lower Z scores to maintain effectiveness.
JFK had an N of like 9 and a Z score of 'something kind of high', and passed control to Lyndon B Johnson of 'I have a minibar and a shotgun in the car I keep on my farm so I can drive and shoot while intoxicated' fame. We survived that, we will be fine.
Are we done?
Are you reinventing Asimov's Three Laws of Robotics?