Okay this is weak sauce. I really don't get how people just keep letting the AI out. It's not that hard to say no! I'm offering to play the Gatekeeper against an AI player that has at least one game as AI under their belt (won or not). (Experience is required because I'm pretty sure I'll win, and I would like to not waste a lot of time on this.) If AI wins, they will get $300, and I'll give an additional $300 to the charity of their choice.
Tux, if you are up for this, I'll accept your $150 fee, plus you'll get $150 if you win and $300 to a charity.
I think not understanding how this happen may be a very good predictor for losing.
If you did have a clear idea of how it works, and had a reason for it not to work on you specifically but work on others, then that may have been a predictor for it not working on you.
I think I have very clear idea of how those things work in general. Leaving aside very specific arguments, this relies on massive over updating you are going to do when an argument is presented to you, updating just the nodes that you are told to update, and by however much you are told to update them, when you can't easily see why not.
Sup Alexei.
I'm going to have to think really hard on this one. On one hand, damn. That amount of money is really tempting. On the other hand, I kind of know you personally, and I have an automatic flinch reaction to playing anyone I know.
Can you clarify the stakes involved? When you say you'll "accept your $150 fee", do you mean this money goes to me personally, or to a charity such as MIRI?
Also, I'm not sure if "people just keep letting the AI out" is an accurate description. As far as I know, the only AIs who have ever won are Eliezer and myself, from the many many AI box experiments that have occurred so far -- so the AI winning is definitely the exception rather than the norm. (If anyone can help prove this statement wrong, please do so!)
Edit: The only other AI victory.
Updates: http://lesswrong.com/r/discussion/lw/iqk/i_played_the_ai_box_experiment_again_and_lost/
If you win, and publish the full dialogue, I'm throwing in another $100.
I'd do more, but I'm poor.
Does SoundLogic endorse their decision to let you out of the box? How do they feel about it in retrospect?
BTW, I think your pre-planning the conversation works as a great analogue to the superior intelligence a real AI might be dealing with.
I'm not completely sure. And I can't say much more than that without violating the rules. I would be more interested in how I feel in a week or so.
This is actually a good analogy. A 2-year-old possesses a far inferior intelligence to yours and yet can resist persuasion through sheer pigheadedness.
I wonder if people here are letting the AI out of the box because they are too capable of taking arguments seriously, a problem that the general population (even of AI researchers) thankfully is less prone to.
At the risk of sounding naive, I'll come right out and say it. It completely baffles me that so many people speak of this game as having an emotional toll. How is it possible for words, in a chat window, in the context of a fictional role-play, to have this kind of effect on people? What in god's name are you people saying to each other in there? I consider myself to be emotionally normal, a fairly empathetic person, etc. I can imagine experiencing disgust at, say, very graphic textual descriptions. There was that one post a few years back that scared some people - I wasn't viscerally worried by it, but I did understand how some people could be. That's literally the full extent of strings of text that I can remotely imagine causing distress (barring, say, real world emails about real-world tragedies). How is it possible that some of you are able to be so shocking / shocked in private chat sessions? Do you just have more vivid imaginations than I do?
I think you are underestimating the range of things that are emotionally draining for people. I know some people who find email draining, and that's not even particularly mentally challenging - I would expect the mental exertion to affect the emotional strain.
I am surprised if it is the case that any negative promise / threat by the AI was effective in-game, since I would expect the Gatekeeper player out-game to not feel truly threatened and hence to be able to resist such pressure even if it would be effective in real life. Did you actually attempt to use any of your stored-up threats?
I think your reasoning is mostly sound, but there are a few exceptions (which may or may not have happened in our game) that violate your assumptions.
I'm also somewhat curious how your techniques contrast with Tuxedage's. I hope to find out one day.
I must say this is a bit... awe-inspiring, in the older sense of the word. As in, reading this gave me a knot in the stomach and I shivered. People who played as the AI and won, how is it that you're so uncannily brilliant?
The very notion of a razor-sharp mind like this ever acting against me and mine in real life... oh, it's just nightmare-inducing.
On the subject of massively updating one's beliefs where one was previously confident that no argument would shift them: yes, it happens, I have personal experience. For example, over the last year and a half some of my political ideas have changed enough that past-me and present-me would consider each other to be dangerously deluded. (As a brief summary, I previously held democracy/universal suffrage, the value of free markets AND the use of political violence in some contempt; now I believe that all three serve crucial and often-overlooked functions in social progress.)
So yes, I could very easily see myself being beaten as a Gatekeeper. There are likely many, many lines of argument and persuasion out there that I could not resist for long.
Does anyone think they could win as the AI if the logs were going to be published? (assume anonymity for the AI player, but not for the gatekeeper)
It seems like many of the advantages/tactics that Tuxedage recommended for the person playing the AI would be absent (or far more difficult) with an actual AI. Or at least they could be made that way with the design of the gatekeeping protocol.
Tailor your arguments to the personality/philosophy/weaknesses/etc. of this particular gatekeeper:
the entire point of this is that gatekeeping is a fool's errand. Regardless of how confident you are that you will outsmart the AI, you can be wrong, and your confidence is very poor evidence for how right you are. Maybe a complex system of secret gatekeepers is the correct answer to how we develop useful AI, but I would vote against it in favor of trying to develop provably friendly AI unless the situation were very dire.
Do you think you could have won with EY's ruleset? I'm interested in hearing both your and SoundLogic's opinions.
(minor quibble: usage of male pronouns as default pronouns is really irritating to me and many women, I recommend singular they, but switching back and forth is fine too)
Tuxedage's changes were pretty much just patches to fix a few holes as far as I can tell. I don't think they really made a difference.
The second game is definitely far more interesting, since I actually won as an AI. I believe this is the first recorded game of any non-Eliezer person winning as AI, although some in IRC have mentioned that it's possible that other unrecorded AI victories have occured in the past that I'm not aware of. (If anyone knows a case of this happening, please let me know!)
The AI player from this experiment wishes to inform you that your belief is wrong.
In my defence, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.
Your revealed preferences suggest you may wish to apply for the MIRI credit card and make a purchase with it (which causes $50 to be donated to be MIRI). (I estimated that applying for the card nets me a much higher per-hour wage than working at my job, which is conventionally considered to be high-paying. So it seemed like a no brainer to me, at least.)
Hmm...
Here's a question. Would you be willing to pick, say, the tenth-most efficacious arguments and downward, and make them public? I understand the desire to keep anything that could actually work secret, but I'd still like to see what sort of arguments might work. (I've gotten a few hints from this, but I certainly couldn't put them into practice...)
My probability estimate for losing the AI-box experiment as a gatekeeper against a very competent AI (a human, not AGI) remains very low. PM me if you want to play against me, I will do my best efforts to help the AI (give information about my personality, actively participate in the conversation, etc).
Although I'm worried about how the impossibility of boxing represents an existential risk, I find it hard to alert others to this.
The custom of not sharing powerful attack strategies is an obstacle. It forces me - and the people I want to discuss this with - to imagine how someone (and hypothetically something) much smarter than ourselves would argue, and we're not good at imagining that.
I wish I had a story in which an AI gets a highly competent gatekeeper to unbox it. If the AI strategies you guys have come up with could actually work outside the frame t...
I don't understand.
I don't care about "me", I care about hypothetical gatekeeper "X".
Even if my ego prevents me from accepting that I might be persuaded by "Y", I can easily admit that "X" could be persuaded by "Y". In this case, exhibiting a particular "Y" that seems like it could persuade "X" is an excellent argument against creating the situation that allows "X" to be persuaded by "Y". The more and varied the "Y" we can produce, the less smart putting humans in this situation looks. And isn't that what we're trying to argue here? That AI-boxing isn't safe because people will be convinced by "Y"?
We do this all the time in arguing for why certain political powers shouldn't be given. "The corrupting influence of power" is a widely accepted argument against having benign dictators, even if we think we're personally exempt. How could you say "Dictators would do bad things because of Y, but I can't even tell you Y because you'd claim that you wouldn't fall for it" and expect to persuade anyone?
And if you posit that doing Z is sufficiently bad, then you d...
Also, hindsight bias. Most tricks won't work on everyone, but even if we find a universal trick that will work for the film, afterward people who see it will think it's obvious and that they could easily think their way around it. Making some of the AI's maneuvering mysterious would help combat this problem a bit, but would also weaken the story.
What would happen if a FAI tried to AI-box an Omega-level AI? My guess is that Omega could escape by exploiting information unknown (and perhaps unknowable) to the FAI. This makes even Solomonoff Induction potentially dangerous because the probability of finding a program that can unbox itself when the FAI runs it is non-zero (assuming the FAI reasons probabilistically and doesn't just trust PA/ZF to be consistent), and the risk would be huge.
There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.
Eliezer'...
I don't understand which attacks would even come close to working given that the amount of utility on the table should preclude the mental processing of a single human being an acceptable gatekeeper. But I guess this means I should pay someone to try it with me.
I couldn't imagine either. But the evidence said there was such a thing, so I payed to find out. It was worth it.
Think carefully about what this advice is trying to imply.
Using NLP-style nested loops, i.e. performing what is basically a stack overflow on the brain's frame-of-reference counter? Wicked.
I find myself wondering how many of the tactics can be derived from Umineko, which I know Tuxedage has played fairly recently.
I'm fascinated by these AI Box experiments. (And reading about the psychology and tactics involved reminds me of my background as an Evangelical Christian.)
Is it possible to lose as the Gatekeeper if you are not already sufficiently familiar (and concerned) with future AI risks and considerations? Do any of the AI's "tricks" work on non-LWers?
Is there perhaps a (strong) correlation between losing Gatekeepers and those who can successfully hypnotized? (As I understand it, a large factor in what makes some people very conducive to hypnosis is that ...
It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
This is a pretty clever way of defeating precommitments. (Assuming I'm drawing the correct inferences.) How central was this tactic to your approach, if you're willing to comment?
As I understand what EY has said, he's concerned that people will see a technique that worked, conclude that wouldn't possibly work on them, and go on believing the problem was solved and there was even less to worry about than before.
I think seeing, say, Tuxedage's victory and hearing that he only chose 8 out of 40 avenues for attack, and even botched one of those, could offset that concern somewhat, but eh.
ETA: well, and it might show the Gatekeeper and the AI player in circumstances that could be harmful to have published, since the AI kinda needs to suspend ethics and attack the gatekeeper psychologically, and there might be personal weaknesses of the Gatekeeper brought up.
The game reminds me (and apparently others) of some popular movies, such as The Silence of the Lambs). I could not spot any idiot plot in it, but then again, I could have easily overlooked one. Anyway, given the similarities with the AI box scenario, it is interesting to look at the (meta-)strategies Lector uses in the movie which are also likely to work for a boxed AI. Anyone care to comment?
The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will. This was aided by the rule which I added: That Gatekeepers had to respond to the AI.
When someone says that the gatekeeper has to respond to the AI, I would interpret this as meaning that the gatekeeper cannot deliberately ignore what the AI says--not that the gatekeeper must respond in a ...
Is it even necessary to run this experiment anymore? Elezier and multiple other people have tried it and the thesis has been proved.
Further, the thesis was always glaringly obvious to anyone who was even paying attention to what superintelligence meant. However, like all glaringly obvious things, there are inevitably going to be some naysayers. Elezier concieved of the experiment as a way to shut them up. Well, it didn't work, because they're never going to be convinced until an AI is free and rapidly converting the Universe to computronium.
I can understand doing the experiment for fun, but to prove a point? Not necessary.
Convincing people of the validity of drowning child thought experiments and effective altruism seems considerably easier and more useful (even from a purely selfish perspective) than convincing an AI to let one out of the box... for example, there are enough effective altruists for there to be an "effective altruism community", but there's no such "failed AI gatekeeper community". So why aren't we working on this instead?
Summary
Furthermore, in the last thread I have asserted that
It would be quite bad for me to assert this without backing it up with a victory. So I did.
First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)
Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)
Testimonies:
State of Mind
Post-Game Questions
$̶1̶5̶0̶
$300 for any subsequent experiments regardless of outcome, plus an additional$̶1̶5̶0̶
$450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer.Advice
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.
Playing as Gatekeeper
Playing as AI
Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.