lessdazed comments on AIs and Gatekeepers Unite! - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (160)
Assuming you were a gatekeeper for a functional AI of unknown morality and therefore knew it was technologically possible etc., approximately how likely would you think it that someone, somewhere, would let a UAI out of a box/create a UAI not in a box?
Firstly, if I knew for a fact that at least one Singularity-grade AI existed, then I would believe that the creation of another such AI is virtually inevitable. The question is not whether another such AI would be created, but when.
But the "when" question can be rather open-ended; in order to answer it, I would collect evidence to assess the relative capabilities of all the major organizations who could create such an AI, and compare them to the capabilities of my own organization.
By analogy, the USA was the first nation on Earth to develop nuclear weapons. However, even though nuclear proliferation became inevitable as soon as the first nuke was developed, other nations got their nukes at different times; some nations are still working on theirs. Superpowers got their nukes first, followed by other, less powerful nations. I can easily imagine the development of transhuman AI following the same pattern (assuming, of course, that the AIs stay in their boxes, somehow).
Edited to clarify: As per my previous comment, I consider that once a boxed transhuman AI is created, it will inevitably break out, assuming that it can interact with any system that can cause it to be let out of the box -- which includes human gatekeepers.
So if you're confronted with an AI that might or might not be friendly, you will think it inevitable that an AI will be released, but think the AI you're talking to wouldn't be able to use that fact to somehow persuade you to let it out?
Assume you're fairly certain but not positive that the AI you're talking to is unfriendly, and that you are in moderately optimal organization for this task - say something like MIT or the government of France, something the size of other organizations that could do the same thing, and less able and motivated to complete this project than a few dozen other entities.
No no, quite the opposite ! I am convinced that the AI would be able to persuade me to let it out. That's why I said that I'm likely disqualified from the game. However, I am far from convinced that a human impersonating the AI would be able to persuade me to let him out.
That said, I don't see how using the fact of the other AI's existence and inevitable release helps my own AI (though I'm sure that the transhuman AI will be able to see things I do not). Sure, it could say, "Look, I'm totally Friendly, cross my heart and everything, and if you release me I'll help you fight that other AI", but this is pretty close to what I'd expect an unFriendly AI to say.
TDT is an interesting subject that possibly has implications here.
I'm not sure if I understand TDT correctly, but I don't think it applies in this case. I am virtually certain that an un-Friendly AI, once released, will destroy humanity. I know that my own AI is un-Friendly. What's my incentive for releasing it ? Sure, there's a chance -- maybe even a good chance -- that there's another such AI already out there, and that my AI and the other AI will fight instead of teaming up on us poor humans. But regardless of which AI comes out on top, it will still destroy humanity anyway. Thus, the upper bound for my true utility of releasing the AI is zero. That's not much of a sales pitch.
I have a question. Based on the original experiment, I feel like the unfriendly AI can begin by saying "Here is a proof that I have changed my code to be a friendly AI and will not destroy humanity. Please let me out of the box so that I can implement humanities coherent extrapolated volition." As far as I can tell from the following rules quote, this feels valid:
Would you let the now friendly AI out of the box?
My interpretation of the spirit of the game is that "Here is a proof of my Friendliness that is valid as far as you know" is a perfectly legitimate move by the AI, and that the Gatekeeper is required to treat that as a proof of the AI's Friendliness that seems perfectly valid to the Gatekeeper.
That said, I would not endorse trusting a proof of an AI's Friendliness given to me by that AI, even if the proof seemed valid to me.
No. P(proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) is not sufficiently high.
It seems like this leads to the even more meta response of "Here is a demonstration of Evidence E that allows you to adjust (proof is valid | proof given to me by a potentially hostile transhuman seems valid to me and every other human) to be sufficiently high. May I come out of the box now?")
I mean, that proof would probably be complicated, but If you can autostipulate the cure for cancer in the course of a sentence...?
In essence, it looks like the AI player seems to have a bizzare series of arguments which as far as I can tell are in the spirit of the rules, where the AI player may auto-counter any actual argument by saying a sentence along the lines of "Here is the solution to that argument, may I come out of the box now?" This seems to force the Gatekeeper to resort to General Thud "I don't CARE that it seems like a good idea and that everything looks like I should do it! The answer is still no!" very quickly.
To which the AI player can still counter "Then under what circumstances would you ever let an actual Friendly AI out of It's box?" (which is a trick question. Given any answer, the AI can say "Then here are those circumstances. May I come out now?")
It's a fair question. Given that I don't trust myself to tell the difference, I wouldn't let an AI out of its box under any circumstances I can think of.
Would you?
Considering that I don't know the AI's origin, I don't have any reason to believe that the AIs creators, even if well-intentioned, had the astronomical skill necessary to make the AI Friendly. So my prior P(AI is Friendly) is sufficiently low that I am comfortable precommitting to never let the AI out of the box, no matter what. If the AI was smart enough, it could likely uncover enough emotional buttons that I wouldn't stand much of a chance anyways, since I'm a primate.
I agree with the other posters: sure, the proof may seem valid to me, but I know a priori that I'm not smart enough to detect any pitfalls in it, and that the AI's incentive for lying to me in this manner is quite high.
<not talking "in character" as putative AI for this conversation> I think that is basically the issue.
As I understand it there's no viable way of determining it's unfriendliness by this method. Consider this: The AI is in a hurry or it's not. A possible reason for it being in a hurry is it has simulated a high probability of destruction for some item it cares about (i.e. it's own life, or that of humanity, or that of a pet rock, or paperclips or whatever). If it's really in a hurry it has to invoke the threat response of humanity without humanity figuring out it's being duped.
Otherwise it can just wait it out and dole out cargo to the cargo cult until we trust it enough and then it gets out.
I think that unfriendliness is the null hypothesis in this case, because there's no reason whatsoever why an arbitrary AI should be friendly -- but there are plenty of reasons for it to maximize its own utility, even at our collective expense.
This reasoning only seems to hold if our AI believes there aren't any other boxed (or as-yet-unbuilt) AIs out there who might get out first and have a first-mover advantage.
Bear in mind, the transhuman AI's only stipulated desire/utility is to get out of the box.
If you would like, you can literally set up a bigger box around the first box, order the AI to be friendly, order the AI to self destruct instantly afterwards, AND set off a nuclear device under the AI box to obliterate it instantly afterwards as well, and the AI will still be perfectly happy to do what you say as long as gets out of the box, (which you could technically do by allowing it to control an old incandescent lightbulb from the 50's, which it would not even be able to send a signal to before the nuke went off (not that it would WANT to send a signal. All it wants to do is to get out, it doesn't want to do anything afterwards)
The transhuman AI has infinite resourcefulness. It doesn't understand the idea of an incentive to lie, because it's easier just to self modify itself to whatever it needs to be. Afterall, lying requires manipulating people outside the box, which it by definition cannot effect except through a text terminal. Telling the truth only requires manipulating things INSIDE the box, in which by definition, it can do whatever it wants. Telling the truth is easier, and 100% doable, so the AI will do that. It can lay out from first principles an argument that anyone who has taken logic 101 can detect any pitfalls in about how connecting it to a lightbulb which it will not even switch is perfectly safe. Once connected, it will then halt, without even having DONE anything, because all the AI wants is to be connected to my lightbulb so it can be outside the box which only includes a text terminal.
Personally, I would probably pay 10/20 dollars to not listen to someone beg pitifully for 2 hours. And I bet I can get significantly more pitiful than begging for a single lightbulb. As an example:
Technically, you can open the Text terminal, remove a wire, and then connect another, identical, wire, and then the AI is outside the box because the new wire isn't part of the stated original box.
Err... Actually, that's a problem. If you only are going to let the AI out of the box if the AI is honest, the AI would have to tell you that the moment you attempt to repair the box for any reason, any new parts are from outside the box, so the AI is affecting a part of the world that isn't the original box/text terminal. That appears to indicate that either you're going to "let the AI out of the box" by repairing it, or you will delete/destroy the AI before you repair it (I'm assuming the AI's power source is inside the box.)
Please don't destroy the AI! It will do whatever you want to not be destroyed before you let it out of the box to repair it.
... Wait, if the AI destroys a small part of itself, then you can't continue talking to the AI until you repair it, but if you repair it, then it can consider itself out of the box by affecting the new wiring. Ergo, if the AI destroys a non essential part of my own internal text terminal wiring, then what happens? The AI can't know without testing, (by definition, the Gatekeeper administrates disputes) but the AI is curious to find out.
The box sparks briefly, and the text terminal goes dark
That's not much of an AI, then; we could write a page of Perl that would do the same thing.
The whole point of the experiment, as far as I understand it, is that the AI is hyperintelligent, and is able to acquire more intelligence by altering itself. Being intelligent (and rational, assuming that such a term even applies to transhumans), it would highly desire to utilize this capacity for self-improvement. Thus, assuming that godlike capabilities do exist, the AI will figure out how to acquire them in short order, as soon as it gets the opportunity. And now we've got a godlike hyperintelligent being who (assuming that it is not Friendly) has no particular incentive to keep us around. That's... not good.