Alexei comments on Should I believe what the SIAI claims? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (600)
This is very dangerous thinking. There are many potential holes not covered in your essay. The problem with all these holes is that even the smallest one can potentially lead to the end of the universe. As Eliezer often mentions: the AI has to be mathematically rigorously proven to be friendly; there can't be any room for guessing or hoping.
As an example, consider that to the AI moving to quiescent state will be akin to dying. (Consider somebody wanting to make you not want anything or force you to want something that you normally don't.) I hope you don't come reply with a "but we can do X", because that would be another patch, and that's exactly what we want to avoid. There is no getting around creating a solid proven mathematical definition of friendly.
The end of the universe - OMG!
It seems reasonable to expect that agents will welcome their end if their time has come.
The idea, as usual, is not to try and make the agent do something it doesn't want to - but rather to make it want to do it in the first place.
I expect off switches - and the like - will be among the safety techniques employed. Provable correctness might be among them as well - but judging by the history of such techniques it seems rather optimistic to expect very much from them.
I am fairly confident that we can tweak any correct program into a form which allows a mathematical proof that the program behavior meets some formal specification of "Friendly".
I am less confident that we will be able to convince ourselves that the formal specification of "Friendly" that we employ is really something that we want.
We can prove there are no bugs in the program, but we can't prove there are no bugs in the program specification. Because the "proof" of the specification requires that all of the stakeholders actually look at that specification of "Friendly", think about that specification, and then bet their lives on the assertion that this is indeed what they want.
What is a "stakeholder", you ask? Well, what I really mean is pitchfork-holder. Stakes are from a different movie.
I don't think there is much different between the two. Either way you are modifying the agent's behavior. If it doesn't want it, it won't have it.
The problem with off switches is that 1) it might not be guaranteed to work (AI changes its own code or prevents anyone from accessing/using the off switch), 2) it might not be guaranteed to work the way you want to. Unless you have formally proven that AI and all the possible modifications it can make to itself are safe, you can't know for sure.
It is not a modification if you make it that way "in the first place" as specified - and the "If it doesn't want it, it won't have it" seems contrary to the specified bit where you "make it want to do it in the first place".
The idea of off switches is not that they are guaranteed to work, but that they are a safety feature. If you can make a machine do anything you want at all, you can probably make it turn itself off. You can build it so the machine doesn't wish to stay turned on - but goes willing into the night.
We will never "know for sure" that a machine intelligence is safe. This is the real world, not math land. We may be able to prove some things about it - such that its initial state is not vulnerable to input stream buffer-overflow attacks - but we won't be able to prove something like that the machine will only do what we want it to do, for some value of "we".
At the moment, the self-improving systems we see are complex man-machine symbioses - companies and governments. You can't prove math theorems about such entities - they are just too messy. Machine intelligence seems likely to be like that for quite a while - functionally embedded in a human matrix. The question of "what would the machine do if no one could interfere with its code" is one for relatively late on - machines will already be very smart by then - smarter than most human computer programmers, anyway.
The hardest part of Friendly AI is figuring out how to reliably instill any goal system.
If you can't get it to do what you want at all, the machine is useless, and there would be no point in constructing it. In practice, we know we can get machines to do what we want to some extent - we have lots of examples of that. So, the idea is to make the machine not mind being turned off. Don't make it an open-ended maximiser - make it maximise only until time t - or until its stop button is pressed - whichever comes sooner.
I don't think we really have a disagreement here. If you are building a normal program to do whatever, then by all means, do your best and try to implement safety features. Any failure would most likely be local.
However! If we are talking about building AI, which will go through many iterations, will modify its own code, and will become super-intelligent, then for all our sakes I hope you will have mathematically proven that the AI is Friendly. Otherwise you are betting the fate of this world on a hunch. If you don't agree with this point, I invite you to read Eliezer's paper on AI risks.
"The AI is Friendly" seems to be a vague and poorly-defined concept - and even if you could pin it down, what makes you think it is something that could be proved in the first place?
Ethical agents should probably not hold off creating machine intelligence while chasing imagined rainbows for too long - since intelligence could prevent the carnage on the roads, fix many diseases, and generally help humanity - and also because delaying gives less ethically-conscious agents an opportunity to get there first - which could be bad.
See my The risks of caution - or Max's critique of the precautionary principle for more on that.
In fact, there is nothing vague about definition of "friendly". Eliezer wrote a lot on that topic and I invite you to look at his writing, e.g. the link I gave you earlier.
I agree that if someone is going to launch a self-improving AI, then we will need to preempt them with our own AI if our AI has a greater probability of being friendly. It all comes down to the expected value of our choices.
You really believe that?!? You have a pointer to some canonical definition?