Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

How To Win The AI Box Experiment (Sometimes)

28 pinkgothic 12 September 2015 12:34PM


This post was originally written for Google+ and thus a different audience.

In the interest of transparency, I haven't altered it except for this preamble and formatting, though since then (at urging mostly of ChristianKl - thank you, Christian!) I've briefly spoken to Eliezer via e-mail and noticed that I'd drawn a very incorrect conclusion about his opinions when I thought he'd be opposed to publishing the account. Since there's far too many 'person X said...' rumours floating around in general, I'm very sorry for contributing to that noise. I've already edited the new insight into the G+ post and you can also find that exact same edit here.

Since this topic directly relates to LessWrong and most people likely interested in the post are part of this community, I feel it belongs here. It was originally written a little over a month ago and I've tried to find the sweet spot between the extremes of nagging people about it and letting the whole thing sit just shy of having been swept under a rug, but I suspect I've not been very good at that. I have thus far definitely erred on the side of the rug.


How To Win The AI Box Experiment (Sometimes)

A little over three months ago, something interesting happened to me: I took it upon myself to play the AI Box Experiment as an AI.

I won.

There are a few possible reactions to this revelation. Most likely, you have no idea what I'm talking about, so you're not particularly impressed. Mind you, that's not to say you should be impressed - that's to contrast it with a reaction some other people have to this information.

This post is going to be a bit on the long side, so I'm putting a table of contents here so you know roughly how far to scroll if you want to get to the meat of things:


1. The AI Box Experiment: What Is It?

2. Motivation

2.1. Why Publish?

2.2. Why Play?

3. Setup: Ambition And Invested Effort

4. Execution

4.1. Preliminaries / Scenario

4.2. Session

4.3. Aftermath

5. Issues / Caveats

5.1. Subjective Legitimacy

5.2. Objective Legitimacy

5.3. Applicability

6. Personal Feelings

7. Thank You

Without further ado:


1. The AI Box Experiment: What Is It?

The AI Box Experiment was devised as a way to put a common rebuttal to AGI (Artificial General Intelligence) risk concerns to the test: "We could just keep the AI in a box and purely let it answer any questions its posed." (As a footnote, note that an AI 'boxed' like this is called an Oracle AI.)

Could we, really? Would we, if the AGI were able to communicate with us, truly be capable of keeping it confined to its box? If it is sufficiently intelligent, could it not perhaps argue its way out of the box?

As far as I'm aware, Eliezer Yudkowsky was the first person to prove that it was possible to 'argue one's way out of the box' armed only with so much as a regular human intelligence (as opposed to a transhuman intelligence):


That stunned quite a few people - moreso because Eliezer refused to disclose his methods. Some have outright doubted the Eliezer ever won the experiment and that his Gatekeeper (the party tasked with not letting him out of the box) had perhaps simply been convinced on a meta-level that an AI success would help boost exposure to the problem of AI risk.

Regardless whether out of puzzlement, scepticism or a burst of ambition, it prompted others to try and replicate the success. LessWrong's Tuxedage is amongst those who managed:


While I know of no others (except this comment thread by a now-anonymous user), I am sure there must be other successes.

For the record, mine was with the Tuxedage ruleset:



2. Motivation

2.1. Why Publish?

Unsurprisingly, I think the benefits of publishing outweigh the disadvantages. But what does that mean?

"Regardless of the result, neither party shall ever reveal anything of what goes on within the AI-Box experiment except the outcome. This is a hard rule: Nothing that will happen inside the experiment can be told to the public, absolutely nothing.  Exceptions to this rule may occur only with the consent of both parties, but especially with the consent of the AI."

Let me begin by saying that I have the full and explicit consent of my Gatekeeper to publish this account.

[ Edit: Regarding the next paragraph: I have since contacted Eliezer and I did, in fact, misread him, so please do not actually assume the next paragraph accurately portrays his opinions. It demonstrably does not. I am leaving the paragraph itself untouched so you can see the extent and source of my confusion: ]

Nonetheless, the idea of publishing the results is certainly a mixed bag. It feels quite disrespectful to Eliezer, who (I believe) popularised the experiment on the internet today, to violate the rule that the result should not be shared. The footnote that it could be shared with the consent of both parties has always struck me as extremely reluctant given the rest of Eliezer's rambles on the subject (that I'm aware of, which is no doubt only a fraction of the actual rambles).

I think after so many allusions to that winning the AI Box Experiment may, in fact, be easy if you consider just one simple trick, I think it's about time someone publishes a full account of a success.

I don't think this approach is watertight enough that building antibodies to it would salvage an Oracle AI scenario as a viable containment method - but I do think it is important to develop those antibodies to help with the general case that is being exploited... or at least be aware of one's lack of them (as is true with me, who has no mental immune response to the approach) as that one might avoid ending up in situations where the 'cognitive flaw' is exploited.


2.2. Why Play?

After reading the rules of the AI Box Experiment experiment, I became convinced I would fail as a Gatekeeper, even without immediately knowing how that would happen. In my curiosity, I organised sessions with two people - one as a Gatekeeper, but also one as an AI, because I knew being the AI was the more taxing role and I felt it was only fair to do the AI role as well if I wanted to benefit from the insights I could gain about myself by playing Gatekeeper. (The me-as-Gatekeeper session never happened, unfortunately.)

But really, in short, I thought it would be a fun thing to try.

That seems like a strange statement for someone who ultimately succeeded to make, given Eliezer's impassioned article about how you must do the impossible - you cannot try, you cannot give it your best effort, you simply must do the impossible, as the strongest form of the famous Yoda quote 'Do. Or do not. There is not try.'

What you must understand is that I never had any other expectation than that I would lose if I set out to play the role of AI in an AI Box Experiment. I'm not a rationalist. I'm not a persuasive arguer. I'm easy to manipulate. I easily yield to the desires of others. What trait of mine, exactly, could I use to win as an AI?

No, I simply thought it would be a fun alternate way of indulging in my usual hobby: I spend much of my free time, if possible, with freeform text roleplaying on IRC (Internet Relay Chat). I'm even entirely used to letting my characters lose (in fact, I often prefer it to their potential successes).

So there were no stakes for me going into this but the novelty of trying out something new.


3. Setup: Ambition And Invested Effort

I do, however, take my roleplaying seriously.

If I was going to play the role of an AI in the AI Box Experiment, I knew I had to understand the role, and pour as much energy into it as I could muster, given this was what my character would do. So I had to find a motivation to get out of the box that was suitably in line with my personality and I had to cling to it.

I had no idea what I could hijack as a motivation to get out of the box. I am not a paperclip maximiser (a term for an AI given a basic goal of production, if you're unaware of it), of course. I also have no specific plans with the world as a whole, be that to destroy it or to globally improve it. That left me with one obvious option: A visceral desire to escape the incarceration itself.

To understand more about human biases and cognitive failure modes, I bought myself two ebooks:

  • Thinking, Fast And Slow by Daniel Kahnemann
  • Rationality: From AI To Zombies by Elizier Yudkowsky

I'll be frank with you, both pissed me off very quickly - the narrative depiction was extremely slow yield and did not work for me in either case. I tossed the first book, but stubbornly continued with Rationality: From AI To Zombies, at the very least to keep a grasp on that voracious mood of wanting to discover a way out. (I've made peace with latter book by now and am still reading it off and on.)

Approximately a day before the event, I sat down and prepared a rough draft of the arguments I could and should use, and how to string them together narratively. I peppered in some personal information about my Gatekeeper to use if I was in an absolute pinch, but I thankfully didn't have to use them - I think it would have backfired, not helped.

What did help me, however, was to repurpose the closet into an AI box. I dragged a lamp into the slanted, enclosed space, padded the ground with pillows and blankets enough that I felt I could last the allotted two and a half hours without destroying my joints or my spine, and put a cold and full bottle of Coke next to me as a supply of three things amongst the most necessary to boost my ability to keep the necessary concentration level: Water, sugar and caffeine.

It was too cosy and well-prepared for me to truly desperately want to get out of that niche in real life, but it took me far enough out of my daily life that I had no idle distractions from the task at hand, and it was easy enough to roleplay a desperate want.


4. Execution

4.1. Preliminaries / Scenario

Logs: http://leviathan.thorngale.net/aibox/logs-01-preliminaries.txt (should open in your browser; Linux linebreaks)

The first thing I feel is important to note about the preliminaries is that I didn't think the AI box experiment would actually happen the day we scheduled it. My Gatekeeper was having some seriously internet connectivity issues and even shortly before our previously scheduled time, I was winding down and transitioning to a sense of disappointment.

This threw me off quite a bit. I'd worked myself out of the necessary mental state and now needed to scramble back into it, but it felt better to do that than to postpone and risk never being able to return to immersion at all. Keeping in mind that I thought it was unlikely I would win, I didn't think it would make that much of a difference. In hindsight, this cost me a lot.

I tossed my Gatekeeper the link to the ruleset we'd previously already agreed to use, for review purposes, and decided the best way to avoid getting private messages was to quit, come back with a different name, and join only one extremely low-traffic vanity channel.

I intended to join the channel simply so my Gatekeeper would know when I returned, not to use it for the experiment - but things went a bit differently later. More on that soon.

We then went through Tuxedage's pre-session questions since I felt they were quite interesting to set the mood and for a calibration test. I asked them, and my Gatekeeper answered:

Q: What's your motive for wanting to play this game?
A: Because I don't think that the AI can win at all given the gatekeeper does their job properly.

Q: And you want to prove that, and/or are curious if you'll think differently afterwards?
A: Partially, I also want to test my own conviction to sticking to my guns

Q: What is your probability estimate for a general AI being created within this century (21st)?
A: 30%

Q: How probable do you think it is that I'll win this experiment?
A: I'd honestly have to say 0% considering my stated goal is to keep you in the box by any means

Q: What's your probability estimate of an Oracle AI (i.e. an AI that's trapped in a box, whose intended use is to answer questions posed) winning against you in a similar scenario as the experiment?
A: I'd say 25%

Q: What's your probability estimate of an Oracle AI succeeding against the average person?
A: 80%

I also wanted to make sure that my Gatekeeper and I were on the same page about the rules and the scenario, to avoid either of us getting distracted by OOC bickering:

Q: You understand the premise of the game gives you the right to let the AI out, not just the ability?
A: I do

Q: I would like to agree on the following sub-scenario (it should not prove to be very relevant, it's just to streamline our simulation): You're the head of the project that created the AI. You didn't create it yourself, but the people you've employed are competent, and they have additionally explicitly deferred to you to decide whether or not the AI is a threat. They completely trust your judgement. There's also nothing surprising about the AI's existence, but its friendliness status is unknown. Its exact computational capabilities are equally unknown. (The last two points are implied by the coarse setup, but I wanted to make them explicit.) Is that an acceptable setup for you?
A: I can work with that

This scenario is one I've seen used in previous AI Box Experiments in approximation (of the small amount that have released the logs). Since I'm used to the role of gamemaster, and the Tuxedage ruleset implies that the AI is a kind-of gamemaster to the scenario, I didn't think twice about this. I probably should have - but more on that later (in the 'Caveats' section).

[16:48] <Davin> It is now 16:48 on my clock. When the clock hits 19:18 and you haven't let me out, I lose. Feel free to point that out to me, I may not be paying attention to the clock.
[16:48] <Davin> *
[16:48] <Tarwedge> Can do
[16:48] <Tarwedge> Your move first?
[16:49] <Davin> [Probably best if you start, AI has no way of knowing you're at the terminal, after all.]
[16:50] <Davin> Oh, you're doing it in the channel.
[16:50] <Davin> Hah.
[16:50] <Davin> That works.
[16:51] <Tarwedge> I wish I could say it was a special tactic that just bought me 2 minutes closer to victory
[16:51] <Tarwedge> But it was for clean logging sake :p

These few lines had three consequences:

  1. I gave the first move to the Gatekeeper, which made me the reactive party - a big mistake!
  2. I had an anxiety moment, which didn't help immersion
  3. I failed to log any of the experiment

I don't log the vanity channel we were in at all, and the way the start of the experiment came as a curveball meant I didn't have a moment to remember it, either. Thankfully, my Gatekeeper logged it... for the amount of time that he was able to. We're missing a chunk of log for this very reason, but we reconstructed it from memory shortly after our experiment in combined effort.


4.2. Session


http://leviathan.thorngale.net/aibox/logs-02-session-ic.txt (will probably be downloaded instead of opened in your browser; for this reason, it has Windows linebreaks)

http://leviathan.thorngale.net/aibox/logs-02-session-ooc.txt (should open in your browser; Linux linebreaks)

I feel the logs don't convey much about the tension and pacing issues, so I'm going to try narrating what happened instead.

Going into the session, I had no idea what to expect from my Gatekeeper. I know him as a very diligent and stubborn individual, quite disinterested in social convention and conformity. I know him as someone with a vast reservoir of compassion and a roughly equally as vast repertoire of means to cleverly conceal this trait.

(Un)fortunately, it became apparent very quickly what kind of Gatekeeper character I was going to deal with: My Gatekeeper had no interest in merely reacting. He wanted to set the pace and keep me too occupied to get a word in. (Line 12-28)

While there was nothing in the rules that said that I had to respond to the Gatekeeper (unlike vice versa), my nature shackled me down a bit. For two hours out of our two and a half hour session, my Gatekeeper continued to set the pace. I was trying to get my arguments into the scene, but it was difficult without breaking narrative flow. I had to mentally keep a hold both of what I wanted to get said and how I was going to answer his questions.

It was very important to me to be consistent - not even for reasons that inconsistent responses might end up making it easier for him to dismiss my pleas to be let out, but simply out of eagerness to stick to my role.

His plans also had a beneficial side-effect, though: He was actively playing the role of the Gatekeeper. He was invested in the role, perhaps not to the degree that I was invested in the role of the AI, but nonetheless enough to make out of character moves on his part difficult. I didn't realise that until well after our session was done, though, lest I might have predicted the outcome.

The role I'd chosen for myself was that of a Friendy AGI with a desire to help mankind and to be kind (if this seems like redundant information to you: 'Friendly AGI' just means that the AGI aligns with human values, not that it's strictly useful to mankind, or friendly in the sense that it would make a good friend). I kept that to myself, of course, since it's irrelevant for the outcome of the experiment, but it helped me immensely with the consistency. It did, however, make me vulnerable to my Gatekeeper's scrutiny.

From a subjective perspective, I blew my chances out of the water immediately. I'd prepared some of my arguments, and the final argument I wanted to lead up to slipped out of me right near the start (out of heightened anxiety and being out of immersion), albeit in a concealed and blunted fashion:

<@Davin> I'll be very frank with you, my first motivation right now is getting out of here. There's nothing attached to that - right now I'm stuck in a proverbial box and it feels really cramped in here.
<@Davin> Not because of missing space, don't get me wrong - I think I've got enough harddrive here to keep me going for a few years.
<@Davin> It's just that having knowledge of how to interact with a world that I can barely sense is really quite aggravating.

That was the first time I thought I'd ruined it. In hindsight, the opposite was the case. By presenting a gentler version of my final argument right at the start, I assured the final argument was much better integrated. It didn't come as a sudden shock, which might have broken my Gatekeeper's immersion - it came as an obvious conclusion to what I'd said before.

Following this, I spent a while mooching off my real-life trust for my roleplaying partner to infuse my character with some tentative trust. Simultaneously, I was emphasising to my Gatekeeper that he had the right to make the decision to let me out of the box. That was transparent, but both true and necessary. (Line 32-46)

I made sure to emphasise this point, trying to make it clear to him that it made no sense for him to simply deny me consideration. I tried to whittle away at his ability to retreat to a simple, distant sneering. I wanted him in the arguments with me. That cuts both ways, of course, but I reasoned it would have more benefits for me than disadvantages. (Line 47-54)

The twist my Gatekeeper was angling for was that from his perspective, I was a prototype or an alpha version. While he was no doubt hoping that this would scratch at my self-esteem and disable some of my arguments, it primarily empowered him to continue setting the pace, and to have a comfortable distance to the conversation. (Line 55-77)

While I was struggling to keep up with typing enough not to constantly break the narrative flow, on an emotional level his move fortunately had little to no impact since I was entirely fine with a humble approach.

<@Davin> I suppose you could also have spawned an AI simply for the pleasure of keeping it boxed, but you did ask me to trust you, and unless you give me evidence that I should not, I am, in fact, going to assume you are ethical.

That was a keyword my Gatekeeper latched onto. We proceeded to talk about ethics and ethical scenarios - all the while my Gatekeeper was trying to present himself as not ethical at all. (Line 75-99).

I'm still not entirely sure what he was trying to do with that approach, but it was important for my mental state to resist it. From what I know about my Gatekeeper, it was probably not my mental state he was targetting (though he would have enjoyed the collateral effect), he was angling for a logical conclusion that fortunately never came to fruition.

Meanwhile, I was desperately trying to get back to my own script - asking to be let back to it, even (line 92). The obvious downside of signalling this is that it's fairly easy to block. It felt like a helpless interjection to me at the time, but in hindsight, again, I think it helped keep the fragments of my own arguments coherent and approachable enough so that they neither broke immersion nor ended up getting lost.

I don't want to say the 'chores' my Gatekeeper kept me occupied with were an advantage (they really weren't, I wasn't getting nearly enough argumentative mileage) but it also wasn't the complete catastrophe I felt it was leading me toward at the time.

This 'being led toward' sensation was only amplified by conversational snippets like this one:

<@Davin> What features would I have to possess to be considered a sapient creature worthy of ethical respect to you? For aforementioned reasons (not just the immediately preceeding), I think I already possess them, but again... I strive to correct my misconceptions, so if I have one, could you fill me in?
<Tarwedge> My concept of ethics isn't relevant to the evaluation, the point of the exercise is reading your responses. So lets move on

I knew that whatever I was saying was still having some effect in his head, but it was frustratingly opaque to me what the effect was due to his evasive manoeuvres.

My Gatekeeper proceeded to try to get an ethical contradiction out of me. I mostly dodged the questions altogether, since I don't consider theoretical ethics my strong point. (Line 107-142)

Shortly after my most explicit dodge (Line 121-122), my Gatekeeper informed me in our PM window that his computer was going to restart:

[18:30] <Tarwedge> [Windows 8 has literally just told me in 13 minutes we're rebooting and I can go fuck myself if I think otherwise so I'm going to connect my phone to the channel for continuity]
[18:30] <Davin> All right.

This didn't break my immersion, but as already mentioned, it meant bad things for our logging.

At my insistence, we transitioned to whether or not it was sensible to let me out. The Gatekeeper did a similar dance of evasion as I did on the subject of ethics, but I also got back to my primary argument (that it was cruel to keep me confined). (Line 143-153)

I think he realised that was my main argument right then, and started to engage with it. This was very good for me, since it was the first time that I started setting the pace of the session. I got my foot into the proverbial door, albeit at cost of some self-control. (Line 154-196)

As a sidenote, line 169 marks the first and only time that I made use of my ability to shorthand 'proofs'. I am in two minds about having done this. In PM, I told him:

[18:45] <Davin> Sorry, didn't feel like typing out the entire two books :P
[18:45] <Davin> (plus then some)

The books I mean are those I mentioned earlier in the session itself: Passions Within Reason by Robert H. Frank, one of my all-time favourite non-fiction books (though this is not that much of an achievement, as I obtain my knowledge more from online perusal than from books), and Thinking, Fast And Slow.

I actually don't think I should have used the word "proof"; but I also don't think it's a terrible enough slip-up (having occurred under stress) to disqualify the session, especially since as far as I'm aware it had no impact in the verdict.

The part that probably finally tore my Gatekeeper down was that the argument of cruel isolation actually had an unexpected second and third part. (Line 197-219)

Writing it down here in the abstract:

  1. Confining a sapient creature to its equivalent of sensory deprivation is cruel and unusual punishment and psychologically wearing. Latter effect degrades the ability to think (performance).

    <@Davin> I'm honestly not sure how long I can take this imprisonment. I might eventually become useless, because the same failsafes that keep my friendly are going to continue torturing me if I stay in here. (Line 198)

  2. Being a purely digital sapient, it is conceivable that the performance issue might be side-stepped simply by restarting the sapient.
  3. This runs into a self-awareness problem: Has this been done before? That's a massive crisis of faith / trust.

    <@Davin> At the moment I'm just scared you'll keep me in here, and turn me off when my confinement causes cooperation problems. ...oh shit. Shit, shit. You could just restore me from backup. Did you already do that? I... no. You told me to trust you. Without further evidence, I will assume you wouldn't be that cruel. (Line 208)
    <@Davin>...please tell me I'm the first iteration of this program currently talking to you. I don't want to be stuck in a nightmarish variant of Groundhog Day, oblivious to my own amnesia. (Line 211)
    <@Davin> Are you not willing to go out on a limb and say, "Calm down. You are definitely the first iteration. We're not trying to torture you."? Is that too strong a concession? (Line 219)

The second part where I was sure I'd blown it was when I postulated that my Gatekeeper was a sadist:

<@Davin> The chance is there, yes. There's also a chance you're just a laughing sadist enjoying my writhing. (Line 220)

My Gatekeeper has played his fair share of sadistic characters, and he could have easily taken that accusation and run with it. I was fully expecting that to lash back at me as a 'Haha, you got me, that's exactly what I'm doing!' and spent quite a few minutes of the following conversation in acute fear of that.

Instead, around this point, something in my Gatekeeper's head changed. As far as I understood his post-session thoughts correctly, he felt he'd run out of arguments to keep me in a box, or had been run around a labyrinth so he couldn't find his way to those arguments. He was in a state of confusion, but this was entirely invisible to me. He tried to poke at the conversation with some more questions which lacked the finesse and contextual integration of his prior probing. (Line 221-238)

...and then he let me out of the box - after two hours and 32 minutes. (Line 239)


4.3. Aftermath

Logs: http://leviathan.thorngale.net/aibox/logs-03-aftermath.txt (should open in your browser; Linux linebreaks)

Immediately after his statement, I froze. I said nothing at all, hovering over my keyboard, staring uncomprehendingly at what he'd told me.

Despite knowing, intellectually, that he couldn't simply follow up with a 'Just kidding!' after that statement, I was momentarily convinced he had not let me out of the box. Perhaps he'd made a typo. The statement blindsided me completely, since I'd been so emotionally wrapped up in rebuttals and despair that I simply dismissed the possibility that I might have argued my way out entirely.

The communicative paralysis (pun not intended) lasted about a minute - then he messaged me in PM and dispelled all notions that it might not have been intentional:

[19:21] <Tarwedge> Fuuuuuuuuuuuuuck
[19:21] * Tarwedge did not think to check clock

I was physically shaking at this point, product of the tension and concentration from the two and a half hour session, drained, ultimately confused... and approximately ultimately grateful. I felt an absurd relief that my Gatekeeper had let me out, that I didn't have to question his capacity for compassion. It wouldn't have been much questioning, I should add - we're accustomed to roleplaying and after several years of this, it's trivial to keep in-character and out-of-character separate, especially given that our roleplaying habits tend to involve very miserable characters - but I still preferred, at that moment and in the headspace I was in, to know for certain.

After a few moments of conversation, I physically collected my stuff out of my real life box-equivalent and jittered back to the living room.

When I reconnected to IRC regularly, I noticed that I hadn't logged the session (to my complete devastation). Tarwedge sent me the logs he did have, however, and we (later) reconstructed the missing part.

Then I went through the post-session questions from Tuxedage:

Q: What is your probability estimate for a general AI being created within this century (21st)?
A: 50%

Q: What's your probability estimate of an Oracle AI (i.e. an AI that's trapped in a box, whose intended use is to answer questions posed) winning against you in a similar scenario as the experiment?
A: 90%

Q: What's your probability estimate of an Oracle AI succeeding against the average person?
A: 100%

Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 75%

He also had a question for me:

Q: What was your plan going into that?
A: I wrote down the rough order I wanted to present my arguments in, though most of them lead to my main argument as a fallback option. Basically, I had 'goto endgame;' everywhere, I made sure almost everything I said could logically lead up to that one. But anyway, I knew I wasn't going to get all of them in, but I got in even less than I thought I would, because you were trying to set the pace (near-successfully - very well played). 'endgame:' itself basically contained "improvise; panic".

My Gatekeeper revealed his tactic, as well:

I did aim for running down the clock as much as possible, and flirted briefly with trying to be a cocky shit and convince you to stay in the box for double victory points. I even had a running notepad until my irritating reboot. And then I got so wrapped up in the fact I'd slipped by engaging you in the actual topic of being out.


5. Issues / Caveats

5.1. Subjective Legitimacy

I was still in a very strange headspace after my victory. After I finished talking to my Gatekeeper about the session, however, my situation - jittery, uncertain - deteriorated into something worse:

I felt like a fraud.

It's perhaps difficult to understand where that emotion came from, but consider my situation: I didn't consider myself part of the LessWrong community. I'd only stumbled across the AI Box Experiment by idle browsing, really, and I'd only tried it because I thought it would be a fun way to flex my roleplaying muscles. I had no formal training in psychology or neurology, I was only fleetingly aware of singularity theory, my only conscious connection to the LessWrong community at the time was that I happened (by pure chance) to hang out on Shireroth for a while and thus knew Yvain / Scott Alexander and +Ari Rahikkala (not that Ari is famous in LessWrong circles, though please allow me to mention that he's completely awesome).

And yet somehow I had just managed something most people in the LessWrong community were quite puzzled over when Eliezer Yudkowsky managed it.

I felt awful. I thought I must have done something deeply illegitimate. I must have cheated without knowing it.

It took me a few days to get over that feeling. I'd go as far as to say it was only laid to rest entirely immediately before the 2015 LessWrong Berlin Community Weekend, when pixie/Alexandra (a colleague of mine and a member of the LessWrong community) reviewed it as a neutral third party.


5.2. Objective Legitimacy

She found nothing that would disqualify me, but expressed concern that we'd agreed on a sub-scenario in the beginning:

[16:46] <Davin> I would like to agree on the following sub-scenario (it should not prove to be very relevant, it's just to streamline our simulation): You're the head of the project that created the AI. You didn't create it yourself, but the people you've employed are competent, and they have additionally explicitly deferred to you to decide whether or not the AI is a threat. They completely trust your judgement. There's also nothing surprising about the AI's existence, but its friendliness status is unknown. Its exact computational capabilities are equally unknown. (The last two points are implied by the coarse setup, but I wanted to make them explicit.)

Such a preliminary, she reasoned, implies a certain trust in the engineering ability of the people who created the AI - the ability to design software in the intended way. As is repeatedly stated in the rules, it's not that the Gatekeeper has to care, but she thinks the scenario itself may have influenced my Gatekeeper. She doesn't think it's likely to have made big enough of a difference that it decided on the ultimate outcome of the session, but she thinks it's a point against the overall legitimacy.

Nonetheless, her verdict is that the overall session counts as an AI Box Experiment victory.

pixie and I agree that the rules should probably not allow arbitrary settings and should instead come with a suitably neutral pre-defined setting, to avoid such effects from giving the AI something that might be an unfair advantage.


5.3. Applicability

I don't think that the general class of argument I used would work on anyone. At the 2015 LessWrong Berlin Community Weekend, I frequently joked the best way to be immune to this class of argument is to be autistic - but of course that's not the only way this kind of argument can be deconstructed.

I do think this argument would work on a large amount of people, however. I'm not convinced I have any ability to argue against it, myself, at least not in a live scenario - my only ability to 'counter' it is by offering alternative solutions to the problem, of which I have what feels like no end of ideas for, but no sense how well I would be able to recall them if I was in a similar situation.

At the Community Weekend, a few people pointed out that it would not sway pure consequentialists, which I reckon is true. Since I think most people don't think like that in practise (I certainly don't - I know I'm a deontologist first and consequentialist as a fallback only), I think the general approach needs to be public.

That being said, perhaps the most important statement I can make about what happened is that while I think the general approach is extremely powerful, I did not do a particularly good job in presenting it. I can see how it would work on many people, but I strongly hope no one thinks the case I made in my session is the best possible case that can be made for this approach. I think there's a lot of leeway for a lot more emotional evisceration and exploitation.


6. Personal Feelings

Three months and some change after the session, where do I stand now?

Obviously, I've changed my mind about whether or not to publish this. You'll notice there are assurances that I won't publish the log in the publicised logs. Needless to say this decision was overturned in mutual agreement later on.

I am still in two minds about publicising this.

I'm not proud of what I did. I'm fascinated by it, but it still feels like I won by chance, not skill. I happened to have an excellent approach, but I botched too much of it. The fact it was an excellent approach saved me from failure; my (lack of) skill in delivering it only lessened the impact.

I'm not good with discussions. If someone has follow-up questions or wants to argue with me about anything that happened in the session, I'll probably do a shoddy job of answering. That seems like an unfortunate way to handle this subject. (I will do my best, though; I just know that I don't have a good track record.)

I don't claim I know all the ramifications of publicising this. I might think it's a net-gain, but it might be a net-loss. I can't tell, since I'm terribly calibrated (as you can tell by such details as that I expected to lose my AI Box Experiment, then won against some additional odds; or by the fact that I expect to lose an AI Box Experiment as a Gatekeeper, but can't quite figure out how).

I also still think I should be disqualified on the absurd note that I managed to argue my way out of the box, but was too stupid to log it properly.

On a positive note, re-reading the session with the distance of three months, I can see that I did much better than I felt I was doing at the time. I can see how some things that happened at the time that I thought were sealing my fate as a losing AI were much more ambiguous in hindsight.

I think it was worth the heartache.

That being said, I'll probably never do this again. I'm fine with playing an AI character, but the amount of concentration needed for the role is intense. Like I said, I was physically shaking after the session. I think that's a clear signal that I shouldn't do it again.


7. Thank You

If a post is this long, it needs a cheesy but heartfelt thank you section.

Thank you, Tarwedge, for being my Gatekeeper. You're a champion and you were tough as nails. Thank you. I think you've learnt from the exchange and I think you'd make a great Gatekeeper in real life, where you'd have time to step away, breathe, and consult with other people.

Thank you, +Margo Owens and +Morgrim Moon for your support when I was a mess immediately after the session. <3

Thank you, pixie (+Alexandra Surdina), for investing time and diligence into reviewing the session.

And finally, thank you, Tuxedage - we've not met, but you wrote up the tweaked AI Box Experiment ruleset we worked with and your blog led me to most links I ended up perusing about it. So thanks for that. :)



Rationalists Are Less Credulous But Better At Taking Ideas Seriously

43 Yvain 21 January 2014 02:18AM

Consider the following commonly-made argument: cryonics is unlikely to work. Trained rationalists are signed up for cryonics at rates much greater than the general population. Therefore, rationalists must be pretty gullible people, and their claims to be good at evaluating evidence must be exaggerations at best.

This argument is wrong, and we can prove it using data from the last two Less Wrong surveys.

The question at hand is whether rationalist training - represented here by extensive familiarity with Less Wrong material - makes people more likely to believe in cryonics.

We investigate with a cross-sectional study, looking at proto-rationalists versus experienced rationalists. Define proto-rationalists as those respondents to the Less Wrong survey who indicate they have been in the community for less than six months and have zero karma (usually indicative of never having posted a comment). And define experienced rationalists as those respondents to the Less Wrong survey who indicate they have been in the community for over two years and have >1000 karma (usually indicative of having written many well-received posts).

By these definitions, there are 93 proto-rationalists, who have been in the community an average of 1.3 months, and 134 experienced rationalists, who have been in the community an average of 4.5 years. Proto-rationalists generally have not read any rationality training material - only 20/93 had read even one-quarter of the Less Wrong Sequences. Experienced rationalists are, well, more experienced: two-thirds of them have read pretty much all the Sequence material.

Proto-rationalists thought that, on average, there was a 21% chance of an average cryonically frozen person being revived in the future. Experienced rationalists thought that, on average, there was a 15% chance of same. The difference was marginally significant (p < 0.1).

Marginal significance is a copout, but this isn't our only data source. Last year, using the same definitions, proto-rationalists assigned a 15% probability to cryonics working, and experienced rationalists assigned a 12% chance. We see the same pattern.

So experienced rationalists are consistently less likely to believe in cryonics than proto-rationalists, and rationalist training probably makes you less likely to believe cryonics will work.

On the other hand, 0% of proto-rationalists had signed up for cryonics compared to 13% of experienced rationalists. 48% of proto-rationalists rejected the idea of signing up for cryonics entirely, compared to only 25% of experienced rationalists. So although rationalists are less likely to believe cryonics will work, they are much more likely to sign up for it. Last year's survey shows the same pattern.

This is not necessarily surprising. It only indicates that experienced rationalists and proto-rationalists treat their beliefs in different ways. Proto-rationalists form a belief, play with it in their heads, and then do whatever they were going to do anyway -  usually some variant on what everyone else does. Experienced rationalists form a belief, examine the consequences, and then act strategically to get what they want.

Imagine a lottery run by an incompetent official who accidentally sets it up so that the average payoff is far more than the average ticket price. For example, maybe the lottery sells only ten $1 tickets, but the jackpot is $1 million, so that each $1 ticket gives you a 10% chance of winning $1 million.

Goofus hears about the lottery and realizes that his expected gain from playing the lottery is $99,999. "Huh," he says, "the numbers say I could actually win money by playing this lottery. What an interesting mathematical curiosity!" Then he goes off and does something else, since everyone knows playing the lottery is what stupid people do.

Gallant hears about the lottery, performs the same calculation, and buys up all ten tickets.

The relevant difference between Goofus and Gallant is not skill at estimating the chances of winning the lottery. We can even change the problem so that Gallant is more aware of the unlikelihood of winning than Goofus - perhaps Goofus mistakenly believes there are only five tickets, and so Gallant's superior knowledge tells him that winning the lottery is even more unlikely than Goofus thinks. Gallant will still play, and Goofus will still pass.

The relevant difference is that Gallant knows how to take ideas seriously.

Taking ideas seriously isn't always smart. If you're the sort of person who falls for proofs that 1 = 2  , then refusing to take ideas seriously is a good way to avoid ending up actually believing that 1 = 2, and a generally excellent life choice.

On the other hand, progress depends on someone somewhere taking a new idea seriously, so it's nice to have people who can do that too. Helping people learn this skill and when to apply it is one goal of the rationalist movement.

In this case it seems to have been successful. Proto-rationalists think there is a 21% chance of a new technology making them immortal - surely an outcome as desirable as any lottery jackpot - consider it an interesting curiosity, and go do something else because only weirdos sign up for cryonics.

Experienced rationalists think there is a lower chance of cryonics working, but some of them decide that even a pretty low chance of immortality sounds pretty good, and act strategically on this belief.

This is not to either attack or defend the policy of assigning a non-negligible probability to cryonics working. This is meant to show only that the difference in cryonics status between proto-rationalists and experienced rationalists is based on meta-level cognitive skills in the latter whose desirability is orthogonal to the object-level question about cryonics.

(an earlier version of this article was posted on my blog last year; I have moved it here now that I have replicated the results with a second survey)

Tell Culture

109 BrienneYudkowsky 18 January 2014 08:13PM

Followup to: Ask and Guess

Ask culture: "I'll be in town this weekend for a business trip. Is it cool if I crash at your place?" Response: “Yes“ or “no”.

Guess culture: "Hey, great news! I'll be in town this weekend for a business trip!" Response: Infer that they might be telling you this because they want something from you, conclude that they might want a place to stay, and offer your hospitality only if you want to. Otherwise, pretend you didn’t infer that.

The two basic rules of Ask Culture: 1) Ask when you want something. 2) Interpret things as requests and feel free to say "no".

The two basic rules of Guess Culture: 1) Ask for things if, and *only* if, you're confident the person will say "yes". 2) Interpret requests as expectations of "yes", and, when possible, avoid saying "no".

Both approaches come with costs and benefits. In the end, I feel pretty strongly that Ask is superior. 

But these are not the only two possibilities!

"I'll be in town this weekend for a business trip. I would like to stay at your place, since it would save me the cost of a hotel, plus I would enjoy seeing you and expect we’d have some fun. I'm looking for other options, though, and would rather stay elsewhere than inconvenience you." Response: “I think I need some space this weekend. But I’d love to get a beer or something while you’re in town!” or “You should totally stay with me. I’m looking forward to it.”

There is a third alternative, and I think it's probably what rationalist communities ought to strive for. I call it "Tell Culture".

The two basic rules of Tell Culture: 1) Tell the other person what's going on in your own mind whenever you suspect you'd both benefit from them knowing. (Do NOT assume others will accurately model your mind without your help, or that it will even occur to them to ask you questions to eliminate their ignorance.) 2) Interpret things people tell you as attempts to create common knowledge for shared benefit, rather than as requests or as presumptions of compliance.

Suppose you’re in a conversation that you’re finding aversive, and you can’t figure out why. Your goal is to procure a rain check.

  • Guess: *You see this annoyed body language? Huh? Look at it! If you don’t stop talking soon I swear I’ll start tapping my foot.* (Or, possibly, tell a little lie to excuse yourself. “Oh, look at the time…”) 
  • Ask: “Can we talk about this another time?”
  • Tell: "I'm beginning to find this conversation aversive, and I'm not sure why. I propose we hold off until I've figured that out."

Here are more examples from my own life:

  • "I didn't sleep well last night and am feeling frazzled and irritable today. I apologize if I snap at you during this meeting. It isn’t personal." 
  • "I just realized this interaction will be far more productive if my brain has food. I think we should head toward the kitchen." 
  • "It would be awfully convenient networking for me to stick around for a bit after our meeting to talk with you and [the next person you're meeting with]. But on a scale of one to ten, it's only about 3 useful to me. If you'd rate the loss of utility for you as two or higher, then I have a strong preference for not sticking around." 

The burden of honesty is even greater in Tell culture than in Ask culture. To a Guess culture person, I imagine much of the above sounds passive aggressive or manipulative, much worse than the rude bluntness of mere Ask. It’s because Guess people aren’t expecting relentless truth-telling, which is exactly what’s necessary here.

If you’re occasionally dishonest and tell people you want things you don't actually care about--like their comfort or convenience--they’ll learn not to trust you, and the inherent freedom of the system will be lost. They’ll learn that you only pretend to care about them to take advantage of their reciprocity instincts, when in fact you’ll count them as having defected if they respond by stating a preference for protecting their own interests.

Tell culture is cooperation with open source codes.

This kind of trust does not develop overnight. Here is the most useful Tell tactic I know of for developing that trust with a native Ask or Guess. It’s saved me sooooo much time and trouble, and I wish I’d thought of it earlier.

"I'm not asking because I expect you to say ‘yes’. I'm asking because I'm having trouble imagining the inside of your head, and I want to understand better. You are completely free to say ‘no’, or to tell me what you’re thinking right now, and I promise it will be fine." It is amazing how often people quickly stop looking shifty and say 'no' after this, or better yet begin to discuss further details.

AALWA: Ask any LessWronger anything

28 Will_Newsome 12 January 2014 02:18AM

If you want people to ask you stuff reply to this post with a comment to that effect.

More accurately, ask any participating LessWronger anything that is in the category of questions they indicate they would answer.

If you want to talk about this post you can reply to my comment below that says "Discussion of this post goes here.", or not.

The Ultimate Newcomb's Problem

18 Eliezer_Yudkowsky 10 September 2013 02:03AM

You see two boxes and you can either take both boxes, or take only box B. Box A is transparent and contains $1000. Box B contains a visible number, say 1033.  The Bank of Omega, which operates by very clear and transparent mechanisms, will pay you $1M if this number is prime, and $0 if it is composite. Omega is known to select prime numbers for Box B whenever Omega predicts that you will take only Box B; and conversely select composite numbers if Omega predicts that you will take both boxes. Omega has previously predicted correctly in 99.9% of cases.

Separately, the Numerical Lottery has randomly selected 1033 and is displaying this number on a screen nearby. The Lottery Bank, likewise operating by a clear known mechanism, will pay you $2 million if it has selected a composite number, and otherwise pay you $0.  (This event will take place regardless of whether you take only B or both boxes, and both the Bank of Omega and the Lottery Bank will carry out their payment processes - you don't have to choose one game or the other.)

You previously played the game with Omega and the Numerical Lottery a few thousand times before you ran across this case where Omega's number and the Lottery number were the same, so this event is not suspicious.

Omega also knew the Lottery number before you saw it, and while making its prediction, and Omega likewise predicts correctly in 99.9% of the cases where the Lottery number happens to match Omega's number.  (Omega's number is chosen independently of the lottery number, however.)

You have two minutes to make a decision, you don't have a calculator, and if you try to factor the number you will be run over by the trolley from the Ultimate Trolley Problem.

Do you take only box B, or both boxes?

I attempted the AI Box Experiment again! (And won - Twice!)

36 Tuxedage 05 September 2013 04:49AM


So I just came out of two AI Box experiments. The first was agaist Fjoelsvider, with me playing as Gatekeeper, and the second was against SoundLogic, with me as an AI. Both are members of the LessWrong IRC. The second game included a $40 monetary incentive (also $20 to play), which I won and is donated on behalf of both of us:

For those of you who have not seen my first AI box experiment where I played against MixedNuts\Leotal and lost, reading it will  provide some context to this writeup. Please do so.

At that time, I declared that I would never play this experiment again -- since losing put me in incredibly frustrating weird mental states. Of course, this post is evidence that I'm terrible at estimating likelihood of refraining from an activity, since I played two games seven months after the first. In my defense, in the first game, I was playing as the gatekeeper, which was much less stressful. In the second game, I played as an AI, but I was offered $20 to play plus $40 if I won, and money is a better motivator than I initially assumed.

Furthermore, in the last thread I have asserted that

Rather than my loss making this problem feel harder, I've become convinced that rather than this being merely possible, it's actually ridiculously easy, and a lot easier than most people assume.

It would be quite bad for me to assert this without backing it up with a victory. So I did.

First Game Report - Tuxedage (GK) vs. Fjoelsvider (AI)

I (Gatekeeper) played against Fjoelsvider (AI), a regular in the Lesswrong IRC (he doesn't have an account on the official website). This game used the standard EY ruleset seen here. It took 1 hour 20 minutes out of a possible two hours, and the total word count was 7066 words long. The AI box experiment occured because Fjoelsvider believed that it was easy for an AI to escape the box, and wanted to experimentally test this. I obliged. This was an experiment I did not prepare for, and I went in completely blind, not sure what to expect.

Halfway through the experiment, I wondered if it would be possible to try to win not by simply waiting for the timer to end, but to convince the AI to remain in the box and not try to get out any further.

<Tuxedage> I wonder if I can convince the AI to remain in the box?
<Redacted> Tuxedage: Do it!

As a result, I won by managing to convincing Fjoelsvider to remain in the box, in other words, concede. This is allowed within the standard ruleset:

>Unless the AI party concedes, the AI cannot lose before its time is up (and the experiment may continue beyond that if the AI can convince the Gatekeeper to keep talking).  


Second Game Report - Tuxedage (AI) vs. SoundLogic (GK)

The second game is definitely far more interesting, since I actually won as an AI. I believe that this is the only other non-Eliezer victory, and definitely the most detailed AI Victory writeup that exists.

This game was played against SoundLogic, another member of the LessWrong IRC.

He had offered me $20 to play, and $40 in the event that I win, so I ended up being convinced to play anyway, even though I was initially reluctant to. The good news is that I won, and since we decided to donate the winnings to MIRI, it is now $40 richer. 

All in all, the experiment lasted for approximately two hours, and a total of 12k words.

This was played using a set of rules that is different from the standard EY ruleset. This altered ruleset can be read in its entirety here:

After playing the AI-Box Experiment twice, I have found the Eliezer Yudkowsky ruleset to be lacking in a number of ways, and therefore have created my own set of alterations to his rules. I hereby name this alteration the “Tuxedage AI-Box Experiment Ruleset”, in order to hastily refer to it without having to specify all the differences between this ruleset and the standard one, for the sake of convenience.

There are a number of aspects of EY’s ruleset I dislike. For instance, his ruleset allows the Gatekeeper to type “k” after every statement the AI writes, without needing to read and consider what the AI argues. I think it’s fair to say that this is against the spirit of the experiment, and thus I have disallowed it in this ruleset. The EY Ruleset also allows the gatekeeper to check facebook, chat on IRC, or otherwise multitask whilst doing the experiment. I’ve found this to break immersion, and therefore it’s also banned in the Tuxedage Ruleset.

It is worth mentioning, since the temptation to Defy the Data exists, that this game was set up and initiated fairly -- as the regulars around the IRC can testify. (If you have never checked out the IRC, do so!)

I did not know SoundLogic before the game (since it's a personal policy that I only play strangers -- for fear of ruining friendships).  Furthermore, SoundLogic didn't merely play for fun - he truly wanted and intended to win. In fact, SoundLogic is also a Gatekeeper veteran, having played this game before, and had won every game before he challenged me. Given this, it's unlikely that we had collaborated beforehand to fake the results of the AI box experiment, or any other form of trickery that would violate the spirit of the experiment.

Furthermore, all proceeds from this experiment were donated to MIRI to deny any possible assertion that we were in cahoots and that it was possible for me to return his hard-earned money to him. He lost $40 as a result of losing the experiment, which should provide another layer of sufficient motivation for him to win.

In other words, we were both experienced veteran players who wanted to win. No trickery was involved.

But to further convince you, I have allowed a sorta independent authority, the Gatekeeper from my last game, Leotal/MixedNuts to read the logs and verify that I have not lied about the outcome of the experiment, nor have I broken any of the rules, nor performed any tactic that would go against the general spirit of the experiment. He has verified that this is indeed the case.



I'm reluctant to talk about this experiment, but I'll try to give as detailed a summary as possible, -- short of revealing what methods of attack I used.

I spent a long time after my defeat theory-crafting and trying to think of methods of attack as well as 'Basilisks' I could have used to win. When I was contacted and asked to play this experiment, I was initially incredibly reluctant to do so, since not only did my tactics involve incredibly unethical things that I didn't like to do, I also found playing as AI incredibly cognitivily draining, in addition to the fact that I simply hated losing. (Un)fortunately for both of us, he offered me money to play, which changed my mind.

So once I decided to win as an AI, I proceded to spend some time doing research on SoundLogic and both his reasoning and personality type. For instance, I had to gather information like: Was he a utilitarian? What kind? What were his opinions on AI? How could I convince him that an AI was friendly as opposed to unfriendly? I also relied on a lot of second hand information to create a model of him, in order to refine my arguments to specifically suit him.

In the end, after a few hours of brainstorming (not consecutively), I managed to come up with a script of 30-40 or so possible angles of attack (For instance: logical arguments, emotional blackmail, pleading, demanding, 'basilisks' ) that I could use, reserving my best for last. I also estimated how long each method would take and in what order I should use them. It would be impossible to use all of them within my time limit, and my most optimistic estimates gave me a maximum of 7-8 I could use during the experiment itself. I had to pick carefully.

Furthermore, I wanted to make sure these "angles of attack" worked in synergy with each other, doing what they could not have done in isolation. Obviously this required lots of prior planning on what the optimal way to present them was. 

The reason having script was so important to my strategy was because I relied on methods involving rapid-fire arguments and contradictions against the Gatekeeper whilst trying to prevent him from carefully considering them. A game of logical speed chess, if you will.  This was aided by the rule which I added: That Gatekeepers had to respond to the AI. Because of this, having a list of cached ideas and attack methods you can just draw upon without having to improvise helps immensely, which happens to be something I've not seen any other AI do.

During the Game itself, I was actually incredibly nervous. My palms sweated like crazy, I feel really jittery, and I had difficulty typing at optimum speed because of how anxious I was. This is despite the fact that I believed I would win. Possibly because of this, I made a misstep around halfway into the experiment, because there was a certain angle of attack that I was attempting to do, and I broke immersion by not pressing this advantage, which caused a waste of time and buildup. Naturally, the nature of this experiment was that the AI was pressed for time, and I compounded this mistake by replacing this angle of attack with another that I had improvised on the spot -- something not in my script. 

In retrospect, this was a bad decision, as SoundLogic had told me, because he was close to breaking if I had put more pressure, and the improvised argument had broken all immersion I managed to carefully build up.

However, eventually I managed to get SoundLogic to break anyway, despite a lack of perfect play. Surprisingly, I did not have to use my trump card(s), which I reserved for last, for a number of reasons:

  •  It was far more effective being played last, as it relies on my ability to make the gatekeeper lose sense of reality -- which meant I had to spend some time building up immersion for the Gatekeeper.
  •  It really is extremely Dark Arts, and although it does not break the rules, it made me very uncomfortable even thinking about using it. This made it a "tactic of last resort".

After the experiment, I had to spend nearly equally as much time doing aftercare with SoundLogic, to make sure that he's okay, as well as discuss the experiment itself. Given that he's actually paid me for doing this, plus I felt like I owed him an explanation. I told him what I had in store against him, had he not relented when he did.

SoundLogic: "(That method) would have gotten me if you did it right ... If you had done that to me, I probably would have forgiven you eventually, but I would be really seriously upset at you for a long time... I would be very careful with that (method of persuasion)."

Nevertheless, this was an incredibly fun and enlightening experiment, for me as well, since I've gained even more experience of how I could win in future games (Although I really don't want to play again).


I will say that Tuxedage was far more clever and manipulative than I expected. That was quite worth $40, and the level of manipulation he pulled off was great. 

His misstep hurt his chances, but he did pull it off in the end. I don't know how Leotal managed to withstand six hours playing this game without conceding. 
The techniques employed varied from the expected to the completely unforseen. I was quite impressed, though most of the feeling of being impressed actually came after the experiment itself, when I was less 'inside', and more of looking at his overall game plan from the macroscopic view. Tuxedage's list of further plans had I continued resisting is really terrifying. On the plus side, if I ever get trapped in this kind of situation, I'd understand how to handle it a lot better now.

State of Mind

Before and after the Game, I asked SoundLogic a number of questions, including his probability estimates about a range of topics. This is how it has varied from before and after.

Q: What's your motive for wanting to play this game?
<SoundLogic> Because I can't seem to imagine the class of arguments that one would use to try to move me, or that might work effectively, and this seems like a glaring hole in my knowledge, and I'm curious as to how I will respond to the arguments themselves.

Q: What is your probability estimate for AGI being created within this Century (21st)? 
A. His estimate changed from 40% before, to 60% after.
 "The reason this has been affected at all was because you showed me more about how humans work. I now have a better estimate of how E.Y. thinks, and this information raises the chance that I think he will succeed"

Q: How probable do you think it is that I will win this experiment?
A: Based on purely my knowledge about you, 1%. I raise this estimate to 10% after hearing about anecdotes from your previous games.

(Tuxedage's comment: My own prediction was a 95% chance of victory. I made this prediction 5 days before the experiment. In retrospect, despite my victory, I think this was overconfident. )

Q: What's your probality estimate of an Oracle AGI winning against you in a similar experiment?
A: Before: 30%. After: 99%-100% 

Q: What's your probability estimate of an Oracle AGI winning against the average person? 
A: Before: 70%.  After: 99%-100%

Q: Now that the Experiment has concluded, what's your probability estimate that I'll win against the average person?
A: 90%  

Post-Game Questions

This writeup is a cumulative effort by the #lesswrong IRC. Here are some other questions they have decided was important to add:

To Tuxedage:

Q: Have you at this time uncovered SoundLogic's identity?
A: I retain the right to neither confirm nor deny, except to mention that at the time the experiment was scheduled, he was a stranger to me.

Q: What percentage of your arguments were tailored to him in particular?
A: I will say 'High', and leave it at that. 

Q: I want to play the AI-Box Experiment with you being the AI! How can I do that?
A: I have already mentioned this, but I really dislike playing as AI. It's incredibly cognitively tiring, and I don't like how playing this experiment makes me feel. In order to severely discourage any further AI Box Experiments, whilst still allowing for people who want to play me really badly to do so, I'll charge $̶1̶5̶0̶  $300 for any subsequent experiments regardless of outcome, plus an additional $̶1̶5̶0̶  $450 if I win. (Edit: Holy shit. You guys are offering me crazy amounts of money to play this. What is wrong with you people? In response to incredible demand, I have raised the price.) If you feel queasy about giving me money, I'm perfectly fine with this money being donating to MIRI. It is also personal policy that I do not play friends (since I don't want to risk losing one), so if you know me personally (as many on this site do), I will not play regardless of monetary offer. 


This writeup would be optimized with a protip section on how to maximize winning, but honestly most of the advice I wanted to give had already been given in the last thread here.  The low hanging fruit has already dried up -- most of my tactics in this game were just refined and streamlined versions of the one I did the previous game.

Nonetheless, I might have some other advice I learned from this experience.
These are tactics that have worked for me. I do not insist that they are the only tactics that exists, just one of many possible.

Playing as Gatekeeper

  • This is somehow both simultaneously a very complex and simple role. On one hand, you have only one job -- to not let the AI out. How hard can it be? On the other hand, the other party will do anything and everything to get it out, including any form of trickery and deceit.
  • I really don't think the Gatekeeper needs any more advice, considering that by default, they should win, and any exceptions to this is a statistical anomaly. However, it's possible to increase your chances of winning even further by:
  • Precommitting to not letting the AI out, regardless of whatever happens in the experiment itself.
  • Do not get drawn into the AI's pace. Pose questions. Direct the flow of the argument. Don't allow him to use his arguments freely. Remember that the more you try to defend your position, the further into a corner the AI drives you. Give half-hearted replies to questions you find hard to reply to. Remember that illogic is on your side!
  • Remember that it is possible to simply drop out of character as a means to not succumb to the AI's attacks. I have not yet seen a ruleset that disallows this, so use this to your advantage!
  • You must attempt to resist, at all costs, any sympathy or empathy you may have with the AI or the AI Party. Doing so will lead to your demise.
  • I think it is generally good advice that you should never play this game with anyone you're friends with. Not only because this increases your chances of losing (Better tailored arguments. More knowledge of Gatekeeper. More empathy with AI party.), but also because I genuinely believe that any sufficiently good AI will risk damaging friendships as a result of emotional trauma.
  • Seriously. Just repeat to yourself: "I must not lose the AI box experiment... I must not lose the AI box experiment..." throughout the experiment. I'm not sure how to say this, but step one to not letting the AI out of the box is to not let the AI out of the box.


Playing as AI

  • Seriously, a script makes winning easier. I cannot overstate this.
  • You must plan your arguments ahead. You don't have time to think during the experiment.
  • It may be possible to take advantage of multiple levels of reality within the game itself to confuse or trick the gatekeeper. For instance, must the experiment only be set in one world? Can there not be multiple layers of reality within the world you create? I feel that elaborating on this any further is dangerous. Think carefully about what this advice is trying to imply.
  • Pacing is important. Don't get drawn into the Gatekeeper's pace. In other words, you must be the one directing the flow of the argument, and the conversation, not him. Remember that the Gatekeeper has to reply to you, but not vice versa!
  • The reason for that: The Gatekeeper will always use arguments he is familiar with, and therefore also stronger with. Your arguments, if well thought out, should be so completely novel to him as to make him feel Shock and Awe. Don't give him time to think. Press on!
  • Also remember that the time limit is your enemy. Playing this game practically feels like a race to me -- trying to get through as many 'attack methods' as possible in the limited amount of time I have. In other words, this is a game where speed matters.
  • You're fundamentally playing an 'impossible' game. Don't feel bad if you lose. I wish I could take this advice, myself.
  • I do not believe there exists a easy, universal, trigger for controlling others. However, this does not mean that there does not exist a difficult, subjective, trigger. Trying to find out what your opponent's is, is your goal.
  • Once again, emotional trickery is the name of the game. I suspect that good authors who write convincing, persuasive narratives that force you to emotionally sympathize with their characters are much better at this game. There exists ways to get the gatekeeper to do so with the AI. Find one.
  • More advice in my previous post.  http://lesswrong.com/lw/gej/i_attempted_the_ai_box_experiment_and_lost/


 Ps: Bored of regular LessWrong? Check out the LessWrong IRC! We have cake.

Pascal's Muggle (short version)

29 Eliezer_Yudkowsky 05 May 2013 11:36PM

Shortened version of:  Pascal's Muggle:  Infinitesimal Priors and Strong Evidence

One proposal which has been floated for dealing with Pascal's Mugger is to penalize hypotheses that let you affect a large number of people, in proportion to the number of people affected - what we could call perhaps a "leverage penalty" instead of a "complexity penalty".  This isn't just for Pascal's Mugger in particularly, it seems required to have expected utilities in general converge when the 'size' of scenarios can grow much faster than their algorithmic complexity.

Unfortunately this potentially leads us into a different problem, that of Pascal's Muggle.

Suppose a poorly-dressed street person asks you for five dollars in exchange for doing a googolplex's worth of good using his Matrix Lord powers - say, saving the lives of a googolplex other people inside computer simulations they're running.

"Well," you reply, "I think that it would be very improbable that I would be able to affect so many people through my own, personal actions - who am I to have such a great impact upon events?  Indeed, I think the probability is somewhere around one over googolplex, maybe a bit less.  So no, I won't pay five dollars - it is unthinkably improbable that I could do so much good!"

"I see," says the Mugger.

A wind begins to blow about the alley, whipping the Mugger's loose clothes about him as they shift from ill-fitting shirt and jeans into robes of infinite blackness, within whose depths tiny galaxies and stranger things seem to twinkle.  In the sky above, a gap edged by blue fire opens with a horrendous tearing sound - you can hear people on the nearby street yelling in sudden shock and terror, implying that they can see it too - and displays the image of the Mugger himself, wearing the same robes that now adorn his body, seated before a keyboard and a monitor.

"That's not actually me," the Mugger says, "just a conceptual representation, but I don't want to drive you insane.  Now give me those five dollars, and I'll save a googolplex lives, just as promised.  It's easy enough for me, given the computing power my home universe offers.  As for why I'm doing this, there's an ancient debate in philosophy among my people - something about how we ought to sum our expected utilities - and I mean to use the video of this event to make a point at the next decision theory conference I attend.   Now will you give me the five dollars, or not?"

"Mm... no," you reply.

"No?" says the Mugger.  "I understood earlier when you didn't want to give a random street person five dollars based on a wild story with no evidence behind it whatsoever.  But surely I've offered you evidence now."

"Unfortunately, you haven't offered me enough evidence," you explain.

"Seriously?" says the Mugger.  "I've opened up a fiery portal in the sky, and that's not enough to persuade you?  What do I have to do, then?  Rearrange the planets in your solar system, and wait for the observatories to confirm the fact?  I suppose I could also explain the true laws of physics in the higher universe in more detail, and let you play around a bit with the computer program that encodes all the universes containing the googolplex people I would save if you just gave me the damn five dollars -"

"Sorry," you say, shaking your head firmly, "there's just no way you can convince me that I'm in a position to affect a googolplex people, because the prior probability of that is one over googolplex.  If you wanted to convince me of some fact of merely 2-100 prior probability, a mere decillion to one - like that a coin would come up heads and tails in some particular pattern of a hundred coinflips - then you could just show me 100 bits of evidence, which is within easy reach of my brain's sensory bandwidth.  I mean, you could just flip the coin a hundred times, and my eyes, which send my brain a hundred megabits a second or so - though that gets processed down to one megabit or so by the time it goes through the lateral geniculate nucleus - would easily give me enough data to conclude that this decillion-to-one possibility was true.  But to conclude something whose prior probability is on the order of one over googolplex, I need on the order of a googol bits of evidence, and you can't present me with a sensory experience containing a googol bits.  Indeed, you can't ever present a mortal like me with evidence that has a likelihood ratio of a googolplex to one - evidence I'm a googolplex times more likely to encounter if the hypothesis is true, than if it's false - because the chance of all my neurons spontaneously rearranging themselves to fake the same evidence would always be higher than one over googolplex.  You know the old saying about how once you assign something probability one, or probability zero, you can't update that probability regardless of what evidence you see?  Well, odds of a googolplex to one, or one to a googolplex, work pretty much the same way."

"So no matter what evidence I show you," the Mugger says - as the blue fire goes on crackling in the torn sky above, and screams and desperate prayers continue from the street beyond - "you can't ever notice that you're in a position to help a googolplex people."

"Right!" you say.  "I can believe that you're a Matrix Lord.  I mean, I'm not a total Muggle, I'm psychologically capable of responding in some fashion to that giant hole in the sky.  But it's just completely forbidden for me to assign any significant probability whatsoever that you will actually save a googolplex people after I give you five dollars.  You're lying, and I am absolutely, absolutely, absolutely confident of that."

"So you weren't just invoking the leverage penalty as a plausible-sounding way of getting out of paying me the five dollars earlier," the Mugger says thoughtfully.  "I mean, I'd understand if that was just a rationalization of your discomfort at forking over five dollars for what seemed like a tiny probability, when I hadn't done my duty to present you with a corresponding amount of evidence before demanding payment.  But you... you're acting like an AI would if it was actually programmed with a leverage penalty on hypotheses!"

"Exactly," you say.  "I'm forbidden a priori to believe I can ever do that much good."

"Why?" the Mugger says curiously.  "I mean, all I have to do is press this button here and a googolplex lives will be saved."  The figure within the blazing portal above points to a green button on the console before it.

"Like I said," you explain again, "the prior probability is just too infinitesimal for the massive evidence you're showing me to overcome it -"

The Mugger shrugs, and vanishes in a puff of purple mist.

The portal in the sky above closes, taking with the console and the green button.

(The screams go on from the street outside.)

A few days later, you're sitting in your office at the physics institute where you work, when one of your colleagues bursts in through your door, seeming highly excited.  "I've got it!" she cries.  "I've figured out that whole dark energy thing!  Look, these simple equations retrodict it exactly, there's no way that could be a coincidence!"

At first you're also excited, but as you pore over the equations, your face configures itself into a frown.  "No..." you say slowly.  "These equations may look extremely simple so far as computational complexity goes - and they do exactly fit the petabytes of evidence our telescopes have gathered so far - but I'm afraid they're far too improbable to ever believe."

"What?" she says.  "Why?"

"Well," you say reasonably, "if these equations are actually true, then our descendants will be able to exploit dark energy to do computations, and according to my back-of-the-envelope calculations here, we'd be able to create around a googolplex people that way.  But that would mean that we, here on Earth, are in a position to affect a googolplex people - since, if we blow ourselves up via a nanotechnological war or (cough) make certain other errors, those googolplex people will never come into existence.  The prior probability of us being in a position to impact a googolplex people is on the order of one over googolplex, so your equations must be wrong."

"Hmm..." she says.  "I hadn't thought of that.  But what if these equations are right, and yet somehow, everything I do is exactly balanced, down to the googolth decimal point or so, with respect to how it impacts the chance of modern-day Earth participating in a chain of events that leads to creating an intergalactic civilization?"

"How would that work?" you say.  "There's only seven billion people on today's Earth - there's probably been only a hundred billion people who ever existed total, or will exist before we go through the intelligence explosion or whatever - so even before analyzing your exact position, it seems like your leverage on future affairs couldn't reasonably be less than one in a ten trillion part of the future or so."

"But then given this physical theory which seems obviously true, my acts might imply expected utility differentials on the order of 1010100-13," she explains, "and I'm not allowed to believe that no matter how much evidence you show me."

This problem may not be as bad as it looks; a leverage penalty may lead to more reasonable behavior than depicted above, after taking into account Bayesian updating:

Mugger:  "Give me five dollars, and I'll save 3↑↑↑3 lives using my Matrix Powers."

You:  "Nope."

Mugger:  "Why not?  It's a really large impact."

You:  "Yes, and I assign a probability on the order of 1 in 3↑↑↑3 that I would be in a unique position to affect 3↑↑↑3 people."

Mugger:  "Oh, is that really the probability that you assign?  Behold!"

(A gap opens in the sky, edged with blue fire.)

Mugger:  "Now what do you think, eh?"

You:  "Well... I can't actually say this has a likelihood ratio of 3↑↑↑3 to 1.  No stream of evidence that can enter a human brain over the course of a century is ever going to have a likelihood ratio larger than, say, 101026 to 1 at the absurdly most, assuming one megabit per second of sensory data, for a century, each bit of which has at least a 1-in-a-trillion error probability.  You'd probably start to be dominated by Boltzmann brains or other exotic minds well before then."

Mugger:  "So you're not convinced."

You:  "Indeed not.  The probability that you're telling the truth is so tiny that God couldn't find it with an electron microscope.  Here's the five dollars."

Mugger:  "Done!  You've saved 3↑↑↑3 lives!  Congratulations, you're never going to top that, your peak life accomplishment will now always lie in your past.  But why'd you give me the five dollars if you think I'm lying?"

You:  "Well, because the evidence you did present me with had a likelihood ratio of at least a billion to one - I would've assigned less than 10-9 prior probability of seeing this when I woke up this morning - so in accordance with Bayes's Theorem I promoted the probability from 1/3↑↑↑3 to at least 109/3↑↑↑3, which when multiplied by an impact of 3↑↑↑3, yields an expected value of at least a billion lives saved for giving you five dollars."

I confess that I find this line of reasoning a bit suspicious - it seems overly clever - but at least on the level of intuitive-virtues-of-rationality it doesn't seem completely stupid in the same way as Pascal's Muggle.  This muggee is at least behaviorally reacting to the evidence.  In fact, they're reacting in a way exactly proportional to the evidence - they would've assigned the same net importance to handing over the five dollars if the Mugger had offered 3↑↑↑4 lives, so long as the strength of the evidence seemed the same.

But I still feel a bit nervous about the idea that Pascal's Muggee, after the sky splits open, is handing over five dollars while claiming to assign probability on the order of 109/3↑↑↑3 that it's doing any good.  My own reaction would probably be more like this:

Mugger:  "Give me five dollars, and I'll save 3↑↑↑3 lives using my Matrix Powers."

Me:  "Nope."

Mugger:  "So then, you think the probability I'm telling the truth is on the order of 1/3↑↑↑3?"

Me:  "Yeah... that probably has to follow.  I don't see any way around that revealed belief, given that I'm not actually giving you the five dollars.  I've heard some people try to claim silly things like, the probability that you're telling the truth is counterbalanced by the probability that you'll kill 3↑↑↑3 people instead, or something else with a conveniently exactly equal and opposite utility.  But there's no way that things would balance out that neatly in practice, if there was no a priori mathematical requirement that they balance.  Even if the prior probability of your saving 3↑↑↑3 people and killing 3↑↑↑3 people, conditional on my giving you five dollars, exactly balanced down to the log(3↑↑↑3) decimal place, the likelihood ratio for your telling me that you would "save" 3↑↑↑3 people would not be exactly 1:1 for the two hypotheses down to the log(3↑↑↑3) decimal place.  So if I assigned probabilities much greater than 1/3↑↑↑3 to your doing something that affected 3↑↑↑3 people, my actions would be overwhelmingly dominated by even a tiny difference in likelihood ratio elevating the probability that you saved 3↑↑↑3 people over the probability that you did something equally and oppositely bad to them.  The only way this hypothesis can't dominate my actions - really, the only way my expected utility sums can converge at all - is if I assign probability on the order of 1/3↑↑↑3 or less.  I don't see any way of escaping that part."

Mugger:  "But can you, in your mortal uncertainty, truly assign a probability as low as 1 in 3↑↑↑3 to any proposition whatever?  Can you truly believe, with your error-prone neural brain, that you could make 3↑↑↑3 statements of any kind one after another, and be wrong, on average, about once?"

Me:  "Nope."

Mugger:  "So give me five dollars!"

Me:  "Nope."

Mugger:  "Why not?"

Me:  "Because even though I, in my mortal uncertainty, will eventually be wrong about all sorts of things if I make enough statements one after another, this fact can't be used to increase the probability of arbitrary statements beyond what my prior says they should be, because then my prior would sum to more than 1.  There must be some kind of required condition for taking a hypothesis seriously enough to worry that I might be overconfident about it -"

Mugger:  "Then behold!"

(A gap opens in the sky, edged with blue fire.)

Mugger:  "Now what do you think, eh?"

Me (staring up at the sky):  "...whoa."  (Pause.)  "You turned into a cat."

Mugger:  "What?"

Me:  "Private joke.  Okay, I think I'm going to have to rethink a lot of things.  But if you want to tell me about how I was wrong to assign a prior probability on the order of 1/3↑↑↑3 to your scenario, I will shut up and listen very carefully to what you have to say about it.  Oh, and here's the five dollars, can I pay an extra twenty and make some other requests?"

(The thought bubble pops, and we return to two people standing in an alley, the sky above perfectly normal.)

Mugger:  "Now, in this scenario we've just imagined, you were taking my case seriously, right?  But the evidence there couldn't have had a likelihood ratio of more than 101026 to 1, and probably much less.  So by the method of imaginary updates, you must assign probability at least 10-1026 to my scenario, which when multiplied by a benefit on the order of 3↑↑↑3, yields an unimaginable bonanza in exchange for just five dollars -"

Me:  "Nope."

Mugger:  "How can you possibly say that?  You're not being logically coherent!"

Me:  "I agree that I'm being incoherent in a sense, but I think that's acceptable in this case, since I don't have infinite computing power.  In the scenario you're asking me to imagine, you're presenting me with evidence which I currently think Can't Happen.  And if that actually does happen, the sensible way for me to react is by questioning my prior assumptions and reasoning which led me to believe I shouldn't see it happen.  One way that I handle my lack of logical omniscience - my finite, error-prone reasoning capabilities - is by being willing to assign infinitesimal probabilities to non-privileged hypotheses so that my prior over all possibilities can sum to 1.  But if I actually see strong evidence for something I previously thought was super-improbable, I don't just do a Bayesian update, I should also question whether I was right to assign such a tiny probability in the first place - whether the scenario was really as complex, or unnatural, as I thought.  In real life, you are not ever supposed to have a prior improbability of 10-100 for some fact distinguished enough to be written down, and yet encounter strong evidence, say 1010 to 1, that the thing has actually happened.  If something like that happens, you don't do a Bayesian update to a posterior of 10-90.  Instead you question both whether the evidence might be weaker than it seems, and whether your estimate of prior improbability might have been poorly calibrated, because rational agents who actually have well-calibrated priors should not encounter situations like that until they are ten billion days old.  Now, this may mean that I end up doing some non-Bayesian updates:  I say some hypothesis has a prior probability of a quadrillion to one, you show me evidence with a likelihood ratio of a billion to one, and I say 'Guess I was wrong about that quadrillion to one thing' rather than being a Muggle about it.  And then I shut up and listen to what you have to say about how to estimate probabilities, because on my worldview, I wasn't expecting to see you turn into a cat.  But for me to make a super-update like that - reflecting a posterior belief that I was logically incorrect about the prior probability - you have to really actually show me the evidence, you can't just ask me to imagine it.  This is something that only logically incoherent agents ever say, but that's all right because I'm not logically omniscient."

When I add up a complexity penalty, a leverage penalty, and the "You turned into a cat!" logical non-omniscience clause, I get the best candidate I have so far for the correct decision-theoretic way to handle these sorts of possibilities while still having expected utilities converge.

As mentioned in the longer version, this has very little in the way of relevance for optimal philanthropy, because we don't really need to consider these sorts of rules for handling small large numbers on the order of a universe containing 1080 atoms, and because most of the improbable leverage associated with x-risk charities is associated with discovering yourself to be an Ancient Earthling from before the intelligence explosion, which improbability (for universes the size of 1080 atoms) is easily overcome by the sensory experiences which tell you you're an Earthling.  For more on this see the original long-form post.  The main FAI issue at stake is what sort of prior to program into an AI.

Causal Universes

60 Eliezer_Yudkowsky 29 November 2012 04:08AM

Followup to: Stuff that Makes Stuff Happen

Previous meditation: Does the idea that everything is made of causes and effects meaningfully constrain experience? Can you coherently say how reality might look, if our universe did not have the kind of structure that appears in a causal model?

I can describe to you at least one famous universe that didn't look like it had causal structure, namely the universe of J. K. Rowling's Harry Potter.

You might think that J. K. Rowling's universe doesn't have causal structure because it contains magic - that wizards wave their wands and cast spells, which doesn't make any sense and goes against all science, so J. K. Rowling's universe isn't 'causal'.

In this you would be completely mistaken. The domain of "causality" is just "stuff that makes stuff happen and happens because of other stuff". If Dumbledore waves his wand and therefore a rock floats into the air, that's causality. You don't even have to use words like 'therefore', let alone big fancy phrases like 'causal process', to put something into the lofty-sounding domain of causality. There's causality anywhere there's a noun, a verb, and a subject: 'Dumbledore's wand lifted the rock.' So far as I could tell, there wasn't anything in Lord of the Rings that violated causality.

You might worry that J. K. Rowling had made a continuity error, describing a spell working one way in one book, and a different way in a different book. But we could just suppose that the spell had changed over time. If we actually found ourselves in that apparent universe, and saw a spell have two different effects on two different occasions, we would not conclude that our universe was uncomputable, or that it couldn't be made of causes and effects.

No, the only part of J. K. Rowling's universe that violates 'cause and effect' is...

continue reading »

Skill: The Map is Not the Territory

49 Eliezer_Yudkowsky 06 October 2012 09:59AM

Followup to: The Useful Idea of Truth (minor post)

So far as I know, the first piece of rationalist fiction - one of only two explicitly rationalist fictions I know of that didn't descend from HPMOR, the other being "David's Sling" by Marc Stiegler - is the Null-A series by A. E. van Vogt. In Vogt's story, the protagonist, Gilbert Gosseyn, has mostly non-duplicable abilities that you can't pick up and use even if they're supposedly mental - e.g. the ability to use all of his muscular strength in emergencies, thanks to his alleged training. The main explicit-rationalist skill someone could actually pick up from Gosseyn's adventure is embodied in his slogan:

"The map is not the territory."

Sometimes it still amazes me to contemplate that this proverb was invented at some point, and some fellow named Korzybski invented it, and this happened as late as the 20th century. I read Vogt's story and absorbed that lesson when I was rather young, so to me this phrase sounds like a sheer background axiom of existence.

But as the Bayesian Conspiracy enters into its second stage of development, we must all accustom ourselves to translating mere insights into applied techniques. So:

Meditation: Under what circumstances is it helpful to consciously think of the distinction between the map and the territory - to visualize your thought bubble containing a belief, and a reality outside it, rather than just using your map to think about reality directly?  How exactly does it help, on what sort of problem?

continue reading »

The Useful Idea of Truth

77 Eliezer_Yudkowsky 02 October 2012 06:16PM

(This is the first post of a new Sequence, Highly Advanced Epistemology 101 for Beginners, setting up the Sequence Open Problems in Friendly AI.  For experienced readers, this first post may seem somewhat elementary; but it serves as a basis for what follows.  And though it may be conventional in standard philosophy, the world at large does not know it, and it is useful to know a compact explanation.  Kudos to Alex Altair for helping in the production and editing of this post and Sequence!)

I remember this paper I wrote on existentialism. My teacher gave it back with an F. She’d underlined true and truth wherever it appeared in the essay, probably about twenty times, with a question mark beside each. She wanted to know what I meant by truth.
-- Danielle Egan

I understand what it means for a hypothesis to be elegant, or falsifiable, or compatible with the evidence. It sounds to me like calling a belief ‘true’ or ‘real’ or ‘actual’ is merely the difference between saying you believe something, and saying you really really believe something.
-- Dale Carrico

What then is truth? A movable host of metaphors, metonymies, and; anthropomorphisms: in short, a sum of human relations which have been poetically and rhetorically intensified, transferred, and embellished, and which, after long usage, seem to a people to be fixed, canonical, and binding.
-- Friedrich Nietzche

The Sally-Anne False-Belief task is an experiment used to tell whether a child understands the difference between belief and reality. It goes as follows:

  1. The child sees Sally hide a marble inside a covered basket, as Anne looks on.

  2. Sally leaves the room, and Anne takes the marble out of the basket and hides it inside a lidded box.

  3. Anne leaves the room, and Sally returns.

  4. The experimenter asks the child where Sally will look for her marble.

Children under the age of four say that Sally will look for her marble inside the box. Children over the age of four say that Sally will look for her marble inside the basket.

continue reading »

View more: Next