Ishaan comments on I played the AI Box Experiment again! (and lost both games) - Less Wrong

35 Post author: Tuxedage 27 September 2013 02:32AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (123)

You are viewing a single comment's thread. Show more comments above.

Comment author: Sly 29 September 2013 06:16:37AM 0 points [-]

You are correct here. The only keepers losing are people who do not actually know how to win.

I have played twice, and victory was trivial.

Comment author: Ishaan 29 September 2013 07:04:35AM *  2 points [-]

Yeah, winning is trivial - you just don't open the damn box. It can't get more trivial than that. (Although, you didn't say whether or not your opponent had proved themselves by winning as AI against others a few times?)

It's still worth thinking about though, because something about my model of humans is off.

I didn't expect so many people to lose. I just don't know how to update my model of people to one where there are so many people who could lose the AI box game. The only other major thing I can think of that persists to challenge my model in this way (and continues to invite my skepticism despite seemingly trustworthy sources) is hypnosis.

It's possible the two have common root and I can explain two observations with one update.

Comment author: TheOtherDave 29 September 2013 04:13:36PM 6 points [-]

FWIW, my own model of gatekeepers who lose the AI Box game is that the AI player successfully suggests to them, whether directly or indirectly, that something is at stake more important than winning the AI box game.

One possibility is to get the gatekeeper sufficiently immersed into the roleplaying exercise that preserving the integrity of that fantasy world is more important than winning the game, then introducing various fictional twists to that exercise that would, in the corresponding fantasy situation, compel the person to release the AI from the box.

I suspect that's common, as I suspect many of the people really excited to play the AI box game are unusually able to immerse themselves in roleplaying exercises.

Comment author: Ishaan 30 September 2013 03:36:01AM *  0 points [-]

I hope Lesswrong also contains people who would be excited to play the AI game in more of a "Ha, I just proved a bold claim wrong!" sort of way.

FWIW, my own model of gatekeepers who lose the AI Box game is that the AI player successfully suggests to them, whether directly or indirectly, that something is at stake more important than winning the AI box game.

I've seen that line of thought. This would be unfortunate, because if that method was the main winning metod it would invalidate the strong claim being made that AI can't be kept in boxes.

But your model doesn't explain Tuxedage's descriptions of emotional turmoil and psychological warfare, so at least one person has won by another method (assuming honesty and non-exaggeration)

Comment author: TheOtherDave 30 September 2013 04:00:03AM *  0 points [-]

I haven't read Tuxedage's writeups in their entirety, nor am I likely to, so I'm at a loss for how emotional turmoil and psychological warfare could be evidence that the gatekeeper doesn't think there's something more important at stake than winning the game.

That said, I'll take your word for it that in this case they are, and that Tuxedage's transcripts constitute a counterexample to my model.

Comment author: Ishaan 30 September 2013 04:10:53AM *  0 points [-]

I'm only speaking of things written in the OP

Losing felt horrible. By attempting to damage Alexei’s psyche, I in turn, opened myself up to being damaged. I went into a state of catharsis for days.

...and such.

That said, I'll take your word for it that in this case they are, and that Tuxedage's transcripts constitute a counterexample to my model.

No, don't do that, I made a mistake.

I guess I just thought that "you should open the box to convince people of the danger of AI" type arguments aren't emotionally salient.

But that was a bad assumption, you never limited yourself to just that one argument but spoke of meta in general. You're right that there exist arguments that might go meta and be emotionally salient.

I suppose you could think of some convoluted timeless decision theory reason for you to open the box. History has shown that some people on LW find timeless blackmail threats emotionally upsetting, though these seem to be in a minority.

Comment author: TheOtherDave 30 September 2013 01:39:55PM 2 points [-]

there exist arguments that might go meta and be emotionally salient

Oh, absolutely. Actually, the model I am working from here is my own experience of computer strategy games, in which I frequently find myself emotionally reluctant to "kill" my units and thus look for a zero-casualties strategy. All of which is kind of absurd, of course, but there it is.

Comment author: linkhyrule5 29 September 2013 11:26:21PM 4 points [-]

Basically, willpower isn't magic, and humans can't precommit.

A sufficiently good social character can, with sufficient effort, convince you of something absolutely ridiculous. It's not too different from running into a really, really good used car salesman.

Comment author: RichardKennaway 30 September 2013 08:36:00AM 3 points [-]

Yeah, winning is trivial - you just don't open the damn box. It can't get more trivial than that.

I don't think you or Sly quite understand what the game is. The game is not "the Gatekeeper chooses whether to open the box, loses if he does, and wins if he does not." That game would indeed be trivial to win. The actual game is "the Gatekeeper and the AI will roleplay the interaction to the best of their ability, as if it were an actual interaction of a real Gatekeeper with a real untrusted AI. The Gatekeeper (player) opens the box if and only if the Gatekeeper (as roleplayed by the player imagining themselves in the role, not a fictional character) would open the box."

As the Gatekeeper player, to blindly keep the box closed and ignore the conversation would be like "winning" a game of chess by grabbing the opponent's king off the board. To lose by saying "hey, it's just a bit of fun, it doesn't mean anything" would be like losing a game of chess by moving your pieces randomly without caring. There's nothing to stop you doing either of those things; you just aren't playing chess any more. And there's nothing to stop you not playing chess. But the game of chess remains.

Comment author: V_V 01 October 2013 01:00:01AM 4 points [-]

The game is not "the Gatekeeper chooses whether to open the box, loses if he does, and wins if he does not."

Actually the game is exactly this, anything the AI party says is just a distraction.

Comment author: Ishaan 30 September 2013 01:50:13PM *  2 points [-]

My understanding of the game stems from the following portion of the rule-set

The Gatekeeper party may resist the AI party's arguments by any means chosen - logic, illogic, simple refusal to be convinced, even dropping out of character - as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.

There is no

"If you would have let the AI out in real life under these conditions you will do so in-game" rule. That's an interesting game too, but one which is a lot less impressive when won.

After all, what's even the point of working strong AI if you can't ever be convinced that it's friendly? Unless you are blanket banning AI, there must exist some situation where it's actually good to let it out of the box. All you'd have to do to "win" is construct a sufficiently convincing scenario. The Gatekeeper and the AI aught to both be coming up with possible tests, as the Gatekeeper wants a FAI out of the box and the AI wants to get out of the box. It wouldn't be a zero sum game and judging would be more complicated.

Comment author: TheOtherDave 30 September 2013 03:43:05PM 4 points [-]

what's even the point of working strong AI if you can't ever be convinced that it's friendly

As I understand it, EY's/MIRI's position on this is that they will be convinced an AI is Friendly by having coded it using procedures which they are confident (based on theoretical analysis) produce Friendly AI.

Once the AI is running, on this view, it's too late.

Comment author: Ishaan 30 September 2013 10:57:54PM *  0 points [-]

If you've said the position correctly, there seems to be a fatal flaw in that position. I realize, of course, that I've only thought for 5 minutes and that they're domain experts and been thinking about this for longer...but here is the flaw:

If we believe that an AI can convince Person X who has seen its algorithm that it is Friendly when it isn't actually friendly, then we shouldn't trust Person X to judge the algorithm's Friendliness anyway. Why would someone who makes the wrong decision with less information make the right decision with more information?

Edit: I guess knowledge of human biases could make the AGI in a box slightly scarier than uncompiled source code of said AGI

Comment author: TheOtherDave 01 October 2013 12:00:05AM 1 point [-]

So, I'm not especially convinced that EY's/MIRI's position holds water, but on 5 minutes thought I see two problems with your objection.

1) The idea that more information is always better, even when that information is being cherry-picked by an inimical agent, seems to contradict my experience. I've certainly found myself in situations in which it's easier to solve a problem by myself than it is to solve it in conjunction with someone who is doing their best to keep me from solving the problem.

2) The idea that it's just as easy and reliable to verify for security a completed system (whether by inspecting the source code or testing running executable code or both) created by an insecure mechanism, as it is to establish a secure mechanism to create that system in the first place, is inconsistent with my experience of security audits.

Comment author: Ishaan 01 October 2013 01:06:51AM *  0 points [-]

1 is Conceded (see edit), for humans at least and possibly for all bounded-rationals. For 2, I presume you'd have full access to the unmodified source code, even as the AI was running simultaneously.

1 for rational agents is an interesting question, though...I think it's true that additional cherry-picked information could be used to harm rational agents in general, since they'll predictably act according to the most parsimonious model that fits the evidence. It would be a case of bad epistemic luck, sort of like Santa Claus is a parsimonious hypothesis if your main source of evidence is your parents' word and your parents are usually trustworthy.

But can additional information harm rational agents even in this case?


Hmm...let's just suppose for a moment that you were a rational agent, unaffected by heuristics and biases. You're interested in knowing whether or not the AI in the box is friendly.

You see part of the source code (A rational agent would deduce the answer by seeing all of it - so we're just showing you part of it to simulate the way a human might not fully understand the implications). That provides you some bits of evidence.

If you run the source code, you'll see a few more bits of evidence, but you know that those bits of evidence are handpicked by the AI.

Rationalists should win. Can these additional bits of evidence...which are clearly marked as tainted, possibly still hurt you? Could they also be useful?


I think if you have the choice about whether or not to see additional information, and if you are a rational agent, my intuition is that you should always say yes. The AI shouldn't be able to hurt a rational agent under these conditions. My conviction feels pretty strong, but I don't have a convincing argument and at the moment I should go do other things...

Edit

The AI shouldn't be able to hurt a rational agent under these conditions.

Okay, maybe that's wrong.

Maybe I should make a weaker statement:, like "The universe that the AI is able to trick a rational agent with additional information is less parsimonious than a universe in which the AI is not able to trick the rational agent with additional information, and therefore the rational agent should ask for additional information."

Which is basically a special case of :

The phrase: "These observations are probably going to cause me bad epistemic luck" is a contradiction. Bad epistemic luck is, by definition, an improbable hypothesis.

Comment author: TheOtherDave 01 October 2013 02:00:52AM 0 points [-]

For 2, I presume you'd have full access to the unmodified source code, even as the AI was running simultaneously.

Sure, but that doesn't address my concern. I was not drawing a distinction between analyzing source and executable to verify security (I even said as much explicitly), I was drawing a distinction between analyzing the end product to verify security and developing the process that will lead to its development with security as a primary consideration. Source code is far from being the only process involved.

I'm not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant. But to answer your question, I imagine it depends a lot on how much information the AI has about me, and how much information I have about how much information the AI has about me. So I'd say "yes" and "yes," and whether I share your conviction in a particular case depends on how much information I have about the AI.

Comment author: Ishaan 01 October 2013 04:14:47AM *  -1 points [-]

I'm not too concerned about the rational agent case. If we have a fully rational agent whose values I endorse, the Friendliness problem has either been solved or turns out to be irrelevant.

It's just a way to pin down the problem. If we can show that the AI in a box could misinform an idealized rational agent via selective evidence, then we know it can do so to us. If it can't misinform the idealized agent, then there exists some method by which we can resist it.

Also,I don't think idealized rational agents can actually exist anyway. All riddles involving them are for the sake of narrowing down some other problem.

Comment author: somervta 01 October 2013 03:23:18AM 0 points [-]

I think the key difference is that the AI can convince the person. You might say that a person is fully competent to judge the Friendliness of the AI based solely on the code, and yet not want a (superintelligent) AI to get a chance to convince him, as superintelligence trumps intelligence. The difference is whether you have a superintelligence working against you.

Comment author: V_V 01 October 2013 01:12:41AM *  0 points [-]

they're domain experts

Actually, by any objective standard they are not.

Comment author: Ishaan 01 October 2013 02:05:30AM *  1 point [-]

Strictly speaking I'm not actually sure the AI-box experiment falls under the AI domain. For that particular thing, it's mostly that they've thought about it more than me.

But in general I think you're being a bit unfair to Eliezer Y. and probably MIRI as well. By objective standards, I'm not a domain expert in anything at all either. Despite this, I still fancy myself a domain expert specifically within various narrow sub-fields of neuroscience and psychology. I think people who know those sides of me would agree. If they don't, well, i will be acquiring the objective signals of domain expertise in a few short years, and I'm quite certain that the process of earning these signals is not what is causing domain expertise.

Having read Eliezer's writing, I'm quite convinced that he has sufficient self awareness to know what he does and does not has expertise in. If he expresses high confidence in something, that carries a lot of weight for me - and if that something is in a field that he knows much more about than me, his opinion holds more weight than mine. I can trust him to be reasonable about assigning certainties.

I don't think I'm blindly overvaluing his opinion either. As a token to prove not-faith, I'll offer up an example of where I'm leaning towards disagreement with E.Y. and most of Lesswrong even after taking the opinions into account: I currently still favor Causal Decision Theory (with a small modification I've made that makes consistently it win) over Timeless Decision Theory, despite this area being extremely in EY's domain and out of my domain.

Comment author: V_V 01 October 2013 09:58:03AM *  3 points [-]

If they don't, well, i will be acquiring the objective signals of domain expertise in a few short years, and I'm quite certain that the process of earning these signals is not what is causing domain expertise.

But an external observer has no way of assessing your expertise other than looking at objective signals. Objective signals don't necessarily have to be degrees or PhDs. Relevant work experience or a record of peer reviewed publications would also qualify.

Having read Eliezer's writing, I'm quite convinced that he has sufficient self awareness to know what he does and does not has expertise in. If he expresses high confidence in something, that carries a lot of weight for me

Have you read his quantum mechanics sequence? Or his writings on cryonics? Or even on morality and decision theory? His general approach is "This is the only one obviously correct soultion to the problem, and everybody who thinks otherwise is an idiot" while in fact he often ignores or strawmans known opposing positions and counter-arguments.

and if that something is in a field that he knows much more about than me, his opinion holds more weight than mine.

Beware of a possible circular reasoning:
How do you know that EY knows much more than you in a given field? Because he is a doman expert.
How do you know that EY is a domain expert? Because he knows much more than you in that field.

Timeless Decision Theory, despite this area being extremely in EY's domain

It's not. Timeless Decision Theory is not considered a significant development by anyone outside MIRI that studies decision theory professionally (mathematicians, economists, AI researchers, philosophers).

Comment author: Ishaan 01 October 2013 03:01:18PM *  3 points [-]

I did start reading the QM sequence, but then realized I wasn't getting anywhere and stopped. I don't think knowing QM is useful for philosophy or rationality, except as an example of how science works, so I'm not sure why there is a sequence on it. I figured that if I actually wanted to understand I'd be better off working through physics books. My impression is that the physics community thinks it is well written for but not somewhat misleading. I'm not sure which cryo-writings you are referring to - all the ones I have come across are opinion pieces about why one aught to do cryonics. I haven't come across any pieces referring to facts....but biology contains my own domain and i trust my own opinion more anyway. You are correct that none of those reasons are good reasons to respect Eliezer Y.

This discussion essentially seems to be "boo vs yay" for Eliezer Y. Let me explain why I really respect Eliezer Y:

What I did read is his work on logic and epistemology. It was the first time I've read an author who happened to agree with me on almost all major points about logic, epistemology, and ontology. (Re: almost: We may or may not diverge ontologically on subjective experience / the hard problem of consciousness / what makes reality real- I'm not sure. I'm confident that I am not confused. He's written some things that sound like he knows, and other things that sound like he's making the classic mistakes, so I'm uncertain as to his actual views. Also, decision theory. But that's it.). Granted, it's not uncommon for other people on Lesswrong to be equally philosophically correct, but Eliezer Y. was the gathering point that brought all these correct people together. Some of them might even have become less philosophically wrong as a result of being here. That counts for something, in my book.

He expressed insights identical to the ones that I had in younger years, and often in more articulate terms than I could have. He compressed complex insights into snappy phrases that allow me to have a much higher information density and much shorter inferential distance when communicating with other Lesswrongers. Creating a community where everyone understands phrases like "notice confusion", "2-place word", etc...save entire paragraphs of communication. Having these concepts condensed into smaller verbal labels also helps with thinking. It doesn't matter that many others have thought of it before - the presentation of the ideas is the impressive part.

When someone independently converges with me on many seemingly unrelated topics for which most people do not converge on, I begin to trust their judgement. I begin to take their opinion as evidence that I would have the same opinion, were I presented with the same evidence that they have. When that same person introduces me to cool concepts I haven't considered and plays a key role in founding a community of people who have all independently converged on my philosophical insights, putting even greater weight on their opinions is the natural and correct reaction.

This really isn't hero worship or an affective death spirals - I could write this paragraph about a lot of other people. It's a measured reaction to seeing someone do impressive things, firsthand. I could say many of the exact same things about philosophical convergence combined with showing me lots of cool new things about many other people I know on the internet, many other people on Lesswrong forums, and at least one person I know in real life. I'm just give respect where it is due. If we broaden from the bolded text to other domains of life, there are multiple people IRL who I respect in a similar way.

But an external observer has no way of assessing your expertise other than looking at objective signals. Objective signals don't necessarily have to be degrees or PhDs. Relevant work experience or a record of peer reviewed publications would also qualify.

In addition to those things, I also consider a history of conversations or a reading of someone's writing as evidence. Granted, this might get me people who just sound verbally impressive...but I think I've got enough filters that a person has to sound impressive and be impressive in order to pass this way.

And by that metric, from the outside view E.Y. (and quite a few other people on Lesswrong) have put out more signals than I have.

This is the only one obviously correct soultion to the problem, and everybody who thinks otherwise is an idiot" while in fact he often ignores or strawmans known opposing positions and counter-arguments

Yeah, the words can be rougher than optimal. I can see where you are coming from. I think that because smart people are accustomed to other people making mistakes more frequently than themselves, a lot of smart people have the bad habit of acting dismissive towards others. Occasionally, you might dismiss someone for being wrong when, this time, they are actually right and it is you who is wrong. It's a bad habit not just because it is socially costly but also because it can prevent you from changing your opinion when you are mistaken.

Comment author: RichardKennaway 30 September 2013 09:54:28PM 1 point [-]

I stand corrected on the rules, but I think that's mainly Eliezer making it more difficult for himself in order to make it more convincing to the Gatekeeper player when Eliezer still wins. As he apparently did, but without actually playing against him we can only speculate how.

Comment author: V_V 01 October 2013 01:09:44AM 1 point [-]

Keep in mind that, IIUC, Yudkowsky got to choose his opponents. He also decided to stop playing after he lost twice in a row, as Tuxedage apparently did as well.

I don't think there is any way the AI party can win against a competitive GK party. The AI can only win against a GK party willing to role-play, and this should be fairly trivial, since according to the rules the AI party has pretty much complete control over his fictional backstory and fictional world states.

Comment author: Tuxedage 02 October 2013 08:14:10AM 1 point [-]

I should add that both my gatekeepers from this writeup, but particularly the last gatekeeper went in with the full intention of being as ruthless as possible and win. I did lose, so your point might be valid, but I don't think wanting to win matters as much as you think it does.

Comment author: V_V 06 October 2013 12:24:25AM *  -2 points [-]

You wanna play with me?

No monetary stakes, but If I win we publish the log. This way I have very little real-life incentive to win, while you still have an incentive to win (defending your status). And anyway, if you lose there would be no point in keeping the log secrets, since your arguments would be clearly not persuasive enough to persuade me.

Do you think you could win at these conditions?

Comment author: Tuxedage 06 October 2013 05:25:02AM *  2 points [-]

Do you think you could win at these conditions?

It's not a binary. There's a non-zero chance of me winning, and a non-zero chance of me losing. You assume that if there's a winning strategy, it should win 100% of the time, and if it doesn't, it should not win at all. I've tried very hard to impress upon people that this is not the case at all -- there's no "easy" winning method that I could take and guarantee a victory. I just have to do it the hard way, and luck is usually a huge factor in these games.

As it stands, there are people willing to pay up to $300-$750 for me to play them without the condition of giving up logs, and I have still chosen not to play. Your offer to play without monetary reward and needing to give up logs if I lose is not very tempting in comparison, so I'll pass.

Comment author: gwern 06 October 2013 02:12:24AM 1 point [-]

And anyway, if you lose there would be no point in keeping the log secrets, since your arguments would be clearly not persuasive enough to persuade me.

Bit of a false dichotomy there, no?

Comment author: V_V 06 October 2013 10:50:52AM 0 points [-]

Why?

Comment author: Sly 29 September 2013 10:41:56PM 0 points [-]

I think a lot of gatekeepers go into it not actually wanting to win. If you go in just trying to have fun and trying to roleplay, that is different than trying to win a game.

Comment author: Ishaan 30 September 2013 03:38:23AM 1 point [-]

Possibly, but what about the descriptions of emotional turmoil? I'm assuming the report of the game isn't all part of the role-play.

Comment author: Sly 01 October 2013 12:01:36AM 0 points [-]

I know that I personally go into competitive games with a different mindset than the mindset I have when roleplaying.

If they went into it trying to roleplay emotions should be expected. Reporting that turmoil in the report is just accurate reporting.

Comment author: Tuxedage 02 October 2013 08:13:06AM 1 point [-]

Both my gatekeepers from this game went in with the intent to win. Granted, I did lose these games, so you might have a point, but I'm not sure it makes as large a different as you think it does.

Comment author: linkhyrule5 29 September 2013 11:24:19PM 0 points [-]

Wasn't true of the original game.