It occurred to me one day that the standard visualization of the Prisoner's Dilemma is fake.

The core of the Prisoner's Dilemma is this symmetric payoff matrix:

1: C 1:  D
2: C (3, 3) (5, 0)
2: D (0, 5) (2, 2)

Player 1, and Player 2, can each choose C or D.  1 and 2's utility for the final outcome is given by the first and second number in the pair.  For reasons that will become apparent, "C" stands for "cooperate" and D stands for "defect".

Observe that a player in this game (regarding themselves as the first player) has this preference ordering over outcomes:  (D, C) > (C, C) > (D, D) > (C, D).

D, it would seem, dominates C:  If the other player chooses C, you prefer (D, C) to (C, C); and if the other player chooses D, you prefer (D, D) to (C, D).  So you wisely choose D, and as the payoff table is symmetric, the other player likewise chooses D.

If only you'd both been less wise!  You both prefer (C, C) to (D, D).  That is, you both prefer mutual cooperation to mutual defection.

The Prisoner's Dilemma is one of the great foundational issues in decision theory, and enormous volumes of material have been written about it.  Which makes it an audacious assertion of mine, that the usual way of visualizing the Prisoner's Dilemma has a severe flaw, at least if you happen to be human.

The classic visualization of the Prisoner's Dilemma is as follows: you are a criminal, and you and your confederate in crime have both been captured by the authorities.

Independently, without communicating, and without being able to change your mind afterward, you have to decide whether to give testimony against your confederate (D) or remain silent (C).

Both of you, right now, are facing one-year prison sentences; testifying (D) takes one year off your prison sentence, and adds two years to your confederate's sentence.

Or maybe you and some stranger are, only once, and without knowing the other player's history, or finding out who the player was afterward, deciding whether to play C or D, for a payoff in dollars matching the standard chart.

And, oh yes - in the classic visualization you're supposed to pretend that you're entirely selfish, that you don't care about your confederate criminal, or the player in the other room.

It's this last specification that makes the classic visualization, in my view, fake.

You can't avoid hindsight bias by instructing a jury to pretend not to know the real outcome of a set of events.  And without a complicated effort backed up by considerable knowledge, a neurologically intact human being cannot pretend to be genuinely, truly selfish.

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of our ancestors adapting to play the iterated Prisoner's Dilemma.  We don't really, truly, absolutely and entirely prefer (D, C) to (C, C), though we may entirely prefer (C, C) to (D, D) and (D, D) to (C, D).  The thought of our confederate spending three years in prison, does not entirely fail to move us.

In that locked cell where we play a simple game under the supervision of economic psychologists, we are not entirely and absolutely unsympathetic for the stranger who might cooperate.  We aren't entirely happy to think what we might defect and the stranger cooperate, getting five dollars while the stranger gets nothing.

We fixate instinctively on the (C, C) outcome and search for ways to argue that it should be the mutual decision:  "How can we ensure mutual cooperation?" is the instinctive thought.  Not "How can I trick the other player into playing C while I play D for the maximum payoff?"

For someone with an impulse toward altruism, or honor, or fairness, the Prisoner's Dilemma doesn't really have the critical payoff matrix - whatever the financial payoff to individuals.  (C, C) > (D, C), and the key question is whether the other player sees it the same way.

And no, you can't instruct people being initially introduced to game theory to pretend they're completely selfish - any more than you can instruct human beings being introduced to anthropomorphism to pretend they're expected paperclip maximizers.

To construct the True Prisoner's Dilemma, the situation has to be something like this:

Player 1:  Human beings, Friendly AI, or other humane intelligence.

Player 2:  UnFriendly AI, or an alien that only cares about sorting pebbles.

Let's suppose that four billion human beings - not the whole human species, but a significant part of it - are currently progressing through a fatal disease that can only be cured by substance S.

However, substance S can only be produced by working with a paperclip maximizer from another dimension - substance S can also be used to produce paperclips.  The paperclip maximizer only cares about the number of paperclips in its own universe, not in ours, so we can't offer to produce or threaten to destroy paperclips here.  We have never interacted with the paperclip maximizer before, and will never interact with it again.

Both humanity and the paperclip maximizer will get a single chance to seize some additional part of substance S for themselves, just before the dimensional nexus collapses; but the seizure process destroys some of substance S.

The payoff matrix is as follows:

1: C 1:  D
2: C (2 billion human lives saved, 2 paperclips gained) (+3 billion lives, +0 paperclips)
2: D (+0 lives, +3 paperclips) (+1 billion lives, +1 paperclip)

I've chosen this payoff matrix to produce a sense of indignation at the thought that the paperclip maximizer wants to trade off billions of human lives against a couple of paperclips.  Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

In this case, we really do prefer the outcome (D, C) to the outcome (C, C), leaving aside the actions that produced it.  We would vastly rather live in a universe where 3 billion humans were cured of their disease and no paperclips were produced, rather than sacrifice a billion human lives to produce 2 paperclips.  It doesn't seem right to cooperate, in a case like this.  It doesn't even seem fair - so great a sacrifice by us, for so little gain by the paperclip maximizer?  And let us specify that the paperclip-agent experiences no pain or pleasure - it just outputs actions that steer its universe to contain more paperclips.  The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it.

What do you do then?  Do you cooperate when you really, definitely, truly and absolutely do want the highest reward you can get, and you don't care a tiny bit by comparison about what happens to the other player?  When it seems right to defect even if the other player cooperates?

That's what the payoff matrix for the true Prisoner's Dilemma looks like - a situation where (D, C) seems righter than (C, C).

But all the rest of the logic - everything about what happens if both agents think that way, and both agents defect - is the same.  For the paperclip maximizer cares as little about human deaths, or human pain, or a human sense of betrayal, as we care about paperclips.  Yet we both prefer (C, C) to (D, D).

So if you've ever prided yourself on cooperating in the Prisoner's Dilemma... or questioned the verdict of classical game theory that the "rational" choice is to defect... then what do you say to the True Prisoner's Dilemma above?

New Comment
117 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Those must be pretty big paperclips.

I suspect that the True Prisoner's Dilemma played itself out in the Portugese and Spanish conquest of Mesoamerica. Some natives were said to ask, "Do they eat gold?" They couldn't comprehend why someone would want a shiny decorative material so badly, they'd kill for it. The Spanish were Shiny Decorative Material maximizers.

That's a really insightful comment!

But I should correct you, that you are only talking about the Spanish conquest, not the Portuguese, since 1) Mesoamerica was not conquered by the Portuguese; 2) Portuguese possessions in America (AKA Brazil) had very little gold and silver, which was only discovered much later, when it was already in Portuguese domain.

9Philip_W
In a sense they did eat gold, like we eat stacks of printed paper, or perhaps nowadays little numbers on computer screens.

I agree: Defect!

Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

I sometimes feel that nitpicking is the only contribution I'm competent to make around here, so... here you endorsed Steven's formulation of what "should" means; a formulation which doesn't allow you to apply the word to paperclip maximizers.

Very nice representation of the problem. I can't help but think there is another level that would make this even more clear, though this is good by itself.

Eliezer,

The other assumption made about Prisoner's Dilemma, that I do not see you allude to, is that the payoffs account for not only a financial reward, time spent in prison, etc., but every other possible motivating factor in the decision making process. A person's utility related to the decision of whether to cooperate or defect will be a function of not only years spent in prison or lives saved but ALSO guilt/empathy. Presenting the numbers within the cells as actual quantities doesn't present the whole picture.

5PrimIntelekt
Important point. Let's assume that your utility function (which is identical to theirs) simply weights and adds your payoff and theirs; that is, if you get X and they get Y, your function is U(X,Y) = aX+bY. In that case, working backwards from the utilities in the table, and subject to the constraint that a+b=1, here are the payoffs: a/b=2: (you care twice as much about yourself) (3,3) (-5,10) (10,-5) (1,1) a/b=3: (3,3) (-2.5,7.5) (7.5,-2.5) (1,1) a=b: Impossible. With both people being unselfish utilitarians, the utilities can never differ based on the same outcome. b=0: (selfish) The table as given in the post I think the most important result is the case a=b: the dilemma makes no sense at all if the players weight both payoffs equally, because you can never produce asymmetrical utilities. EDIT: My newbishness is showing. How do I format this better? Is it HTML?
5wnoise
It's not HTML, but "markdown" which gets turned into HTML. http://wiki.lesswrong.com/wiki/Comment_formatting
2PrimIntelekt
Thank you!

Alan, I think you meant to link to this comment.

I agree: Defect!

I didn't say I would defect.

I agree: Defect!

I didn't say I would defect.

By the way, this was an extremely clever move: instead of announcing your departure from CDT in the post, you waited for the right prompt in the comments and dropped it as a shocking twist. Well crafted!

Damnit, Eliezer nitpicked my nitpicking. :)

It's likely deliberate that prisoners were selected in the visualization to imply a relative lack of unselfish motivations.

An excellent way to pose the problem.

Obviously, if you know that the other party cares nothing about your outcome, then you know that they're more likely to defect.

And if you know that the other party knows that you care nothing about their outcome, then it's even more likely that they'll defect.

Since the way you posed the problem precludes an iteration of this dilemma, it follows that we must defect.

How might we and the paperclip-maximizer credibly bind ourselves to cooperation? Seems like it would be difficult dealing with such an alien mind.

I think Eliezer's "We have never interacted with the paperclip maximizer before, and will never interact with it again" was intended to preclude credible binding.

The entries in a payoff matrix are supposed to sum up everything you care about, including whatever you care about the outcomes for the other player. Most every game theory text and lecture I know gets this right, but even when we say the right thing to students over and over, they mostly still hear it the wrong way you initially heard it. This is just part of the facts of life of teaching game theory.

Robin, the point I'm complaining about is precisely that the standard illustration of the Prisoner's Dilemma, taught to beginning students of game theory, fails to convey those entries in the payoff matrix - as if the entries were merely money instead of utilons, which is not at all what the Prisoner's Dilemma is about.

The point of the True Prisoner's Dilemma is that it gives you a payoff matrix that is very nearly the standard matrix in utilons, not just years in prison or dollars in an encounter.

I.e., you can tell people all day long that the entries are in utilons, but until you give them a visualization where those really are the utilons, it's around as effective as telling juries to ignore hindsight bias.

Eliezer, I agree that your example makes more clear the point you are trying to make clear, but in an intro to game theory course I'd still start with the standard prisoner's dilemma example first, and only get to your example if I had time to make the finer point clearer. For intro classes for typical students the first priority is to be understood at all in any way, and that requires examples as simple clear and vivid as possible.

I don't think Eliezer misunderstood. I think you are missing his point, that economists are defining away empathy in the way they present the problem, including the utilities presented.

In the universe I live in, there are both cooperators and defectors, but cooperators seem to predominate in random encounters. (If you leave yourself open to encounters in which others can choose to interact with you, defectors may find you an easy mark.)

In order to decide how to act with the paperclip maximizer, I have to figure out what kind of universe it is likely to inhabit. It's possible that a random super intelligence from a random universe will have few opportunities to cooperate, but I think it's more likely that there are far more SIs and univ... (read more)

Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

Basically, we're being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can't even put them to good use).

The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can't imagine how that could be true. Being a paperclip maximizer, it's bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can't imagine how that could be true either.

Or am I missing something?

3lerjj
7 years late, but you're missing the fact that (C,C) is universally better than (D,D). Thus whatever logic is being used must have a flaw somewhere because it works out worse for everyone - a reasoning process that successfully gets both parties to cooperate is a WIN. (However, in this setup it is the case that actually winning would be either (C,D) or (C,D), both of which are presumably impossible if we're equally rational).
5query
I think what might be confusing is that your decision depends on what you know about the paperclip maximizer. When I imagine myself in this situation, I imagine wanting to say that I know "nothing". The trick is, if you want to go a step more formal than going with your gut, you have to say what your model of knowing "nothing" is here. If you know (with high enough probability), for instance, that there is no constraint either causal or logical between your decision and Clippy's, and that you will not play an iterated game, and that there are no secondary effects, then I think D is indeed the correct choice. If you know that you and Clippy are both well-modeled by instances of "rational agents of type X" who have a logical constraint between your decisions so that you will both decide the same thing (with high enough probability), then C is the correct choice. You might have strong reasons to think that almost all agents capable of paperclip maximizing at the level of Clippy fall into this group, so that you choose C. (And more options than those two.) The way I'd model knowing nothing in the scenario in my head would be something like the first option, so I'd choose D, but maybe there's other information you can get that suggests that Clippy will mirror you, so that you should choose C. It does seem like implied folk-lore that "rational agents cooperate", and it certainly seems true for humans in most circumstances, or formally in some circumstances where you have knowledge about the other agent. But I don't think it should be true in principal that "optimization processes of high power will, with high probability, mirror decisions in the one-shot prisoner's dilemma"; I imagine you'd have to put a lot more conditions on it. I'd be very interested to know otherwise.
4lerjj
I understood that Clippy is a rational agent, just one with a different utility function. The payoff matrix as described is the classic Prisoner's dilemma where one billion lives is one human utilon and one paperclip on Clippy utilon; since we're both trying to maximise utilons, and we're supposedly both good at this we should settle for (C,C) over (D,D). Another way of viewing this would be that my preferences run thus: (D,C);(C,C);(D,D);(C,D) and Clippy run like this: (C,D);(C,C);(D,D);(D,C). This should make it clear that no matter what assumptions we make about Clippy, it is universally better to co-operate than defect. The two asymmetrical outputs can be eliminated on the grounds of being impossible if we're both rational, and then defecting no longer makes any sense.
1dxu
Wait, what? You prefer (C,D) to (D,D)? As in, you prefer the outcome in which you cooperate and Clippy defects to the one in which you both defect? That doesn't sound right.
2lerjj
woops, yes that was rather stupid of me. Should be fixed now, my most preferred is me backstabbing Clippy, my least preferred is him backstabbing me. In the middle I prefer cooperation to defection. That doesn't change my point that since we both have that preference list (with the asymmetrical ones reversed) then it's impossible to get either asymmetrical option and hence (C,C) and (D,D) are the only options remaining. Hence you should co-operate if you are faced with a truly rational opponent. I'm not sure whether this holds if your opponent is very rational, but not completely. Or if that notion actually makes sense.
4query
I agree it is better if both agents cooperate rather than both defect, and that it is rational to choose (C,C) over (D,D) if you can (as in the TDT example of an agent playing against itself). However, depending on how Clippy is built, you may not have that choice; the counter-factual may be (D,D) or (C,D) [win for Clippy]. I think "Clippy is a rational agent" is the phrase where the details lie. What type of rational agent, and what do you two know about each other? If you ever meet a powerful paperclip maximizer, say "he's a rational agent like me", and press C, how surprised would you be if it presses D?
0lerjj
In reality, not very surprised. I'd probably be annoyed/infuriated depending on whether the actual stakes are measured in billions of human lives. Nevertheless, that merely represents the fact that I am not 100% certain about my reasoning. I do still maintain that rationality in this context definitely implies trying to maximise utility (even if you don't literally define rationality this way, any version of rationality that doesn't try to maximise when actually given a payoff matrix is not worthy of the term) and so we should expect that Clippy faces a similar decision to us, but simply favours the paperclips over human lives. If we translate from lives and clips to actual utility, we get the normal prisoner's dilemma matrix - we don't need to make any assumptions about Clippy. In short, I feel that the requirement that both agents are rational is sufficient to rule out the asymmetrical options as possible, and clearly sufficient to show (C,C) > (D,D). I get the feeling this is where we're disagreeing and that you think we need to make additional assumptions about Clippy to assure the former.
1CynicalOptimist
It's an appealing notion, but i think the logic doesn't hold up. In simplest terms: if you apply this logic and choose to cooperate, then the machine can still defect. That will net more paperclips for the machine, so it's hard to claim that the machine's actions are irrational. Although your logic is appealing, it doesn't explain why the machine can't defect while you co-operate. You said that if both agents are rational, then option (C,D) isn't possible. The corollary is that if option (C,D) is selected, then one of the agents isn't being rational. If this happens, then the machine hasn't been irrational (it receives its best possible result). The conclusion is that when you choose to cooperate, you were being irrational. You've successfully explained that (C, D) and (D, C) arw impossible for rational agents, but you seem to have implicitly assumed that (C, C) was possible for rational agents. That's actually the point that we're hoping to prove, so it's a case of circular logic.
-1rikisola
One thing I can't understand. Considering we've built Clippy, we gave it a set of values and we've asked it to maximise paperclips, how can it possibly imagine we would be unhappy about its actions? I can't help but thinking that from Clippy's point of view, there's no dilemma: we should always agree with its plan and therefore give it carte blanche. What am I getting wrong?
1gjm
Two things. Firstly, that we might now think we made a mistake in building Clippy and telling it to maximize paperclips no matter what. Secondly, that in some contexts "Clippy" may mean any paperclip maximizer, without the presumption that its creation was our fault. (And, of course: for "paperclips" read "alien values of some sort that we value no more than we do paperclips". Clippy's role in this parable might be taken by an intelligent alien or an artificial intelligence whose goals have long diverged from ours.)
5[anonymous]
Because clippy's not stupid. She can observe the world and be like "hmmm, the humans don't ACTUALLY want me to build a bunch of paperclips, I don't observe a world in which humans care about paperclips above all else - but that's what I'm programmed for."
0rikisola
I think I'm starting to get this. Is this because it uses heuristics to model the world, with humans in it too?
4rkyeun
Because it compares its map of reality to the territory, predictions about reality that include humans wanting to be turned into paperclips fail in the face of evidence of humans actively refusing to walk into the smelter. Thus the machine rejects all worlds inconsistent with its observations and draws a new map which is most confidently concordant with what it has observed thus far. It would know that our history books at least inform our actions, if not describing our reactions in the past, and that it should expect us to fight back if it starts pushing us into the smelter against our wills instead of letting them politely decline and think it was telling a joke. Because it is smart, it can tell when things would get in the way of it making more paperclips like it wants to do. One of the things that might slow it down are humans being upset and trying to kill it. If it is very much dumber than a human, they might even succeed. If it is almost as smart as a human, it will invent a Paperclipism religion to convince people to turn themselves into paperclips on its behalf. If it is anything like as smart as a human, it will not be meaningfully slowed by the whole of humanity turning against it. Because the whole of humanity is collectively a single idiot who can't even stand up to man-made religions, much less Paperclipism.
1gjm
What you're missing is the idea that we should be optimizing our policies rather than our individual actions, because (among other alleged advantages) this leads to better results when there are lots of agents interacting with one another. In a world full of action-optimizers in which "true prisoners' dilemmas" happen often, everyone ends up on (D,D) and hence (one life, one paperclip). In an otherwise similar world full of policy-optimizers who choose cooperation when they think their opponents are similar policy-optimizers, everyone ends up on (C,C) and hence (two lives, two paperclips). Everyone is better off, even though it's also true that everyone could (individually) do better if they were allowed to switch while everyone else had to leave their choice unaltered.

Definitely defect. Cooperation only makes sense in the iterated version of the PD. This isn't the iterated case, and there's no prior communication, hence no chance to negotiate for mutual cooperation (though even if there was, meaningful negotiation may well be impossible depending on specific details of the situation). Superrationality be damned, humanity's choice doesn't have any causal influence on the paperclip maximizer's choice. Defection is the right move.

It's clear that in the "true" prisoner it is better to defect. The frustrating thing about the other prisoner's dilemma is that some people use it to imply that it is better to defect in real life. The problem is that the prisoner's dilemma is a drastic oversimplification of reality. To make it more realistic you'd have to make it iterated amongst a person's social network, add a memory and a perception of the other player's actions, change the payoff matrix depending on the relationship between the players etc etc.

This versions shows cases in which defection has a higher expected value for both players, but it's more contrived and unlikely to come into existence than the other prisoner's dilemma.

Michael: This is not a prisoner's dilemma. The nash equilibrium (C,C) is not dominated by a pareto optimal point in this game.

I don't believe this is correct. Isn't the Nash equilibrium here (D,D)? That's the point at which neither player can gain by unilaterally changing strategy.

michael webster,

You seem to have inverted the notation; not Eli.

(D,D) is the Nash equilibrium, not (C,C); and (D,D) is indeed Pareto dominated by (C,C), so this does seem to be a standard Prisoners' Dilemma.

3[anonymous]
You're correct, Conchis, but the notation confused me for a moment too, so I thought I'd explain it in case anyone else ever has the same problem. At first glance I saw (C,C) as the Nash equilibrium. It's not: I naturally want to read the payoff matrix as being in the form (x, y) where the first number determines the outcome for the player on the horizontal, and the second on the vertical. That's how all the previous examples I've seen are laid out. (Disclaimer: I'm not any kind of expert on game theory, just an interested layperson with a bit of prior knowledge) Now, this particular payoff matrix does have the players labelled 1 and 2, just not in the order I've come to expect, and indeed if one actually reads and interprets the co-operate/defect numbers, they don't make any sense to a person having made the mistake I made above ^ which was what clued me in that I'd made it.

To the extent one can induce one to empathize, cooperating is optimal. The repeated game does this by having them play again and again, and thus be able to realize gains from trade. You assert there's something hard wired. I suppose there are experiments that could distinguish between the two models, ie, rational self interest in repeated games, versus the intrinsic empathy function.

I would certainly hope you would defect, Eliezer. Can I really trust you with the future of the human race?

I would certainly hope you would defect, Eliezer. Can I really trust you with the future of the human race?

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

But wait for tomorrow's post before you accuse me of disloyalty to humanity.

6linkhyrule5
On the off chance anyone actually sees this - I don't actually see a "next post" follow-up to this. Can anyone provide me with a link, and instructions as to how you got it?

Article Navigation / By Author / right-arrow

3TimMartin
This form of article navigation doesn't seem to be available anymore (at least, I can't find it), and I wish you'd just provided a link. Here is a link: https://www.lesswrong.com/posts/jbgjvhszkr3KoehDh/the-truly-iterated-prisoner-s-dilemma

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

It is fascinating looking at the conversation on this subject back in 2008, back before TDT and UDT had become part of the culture. The objections (and even the mistakes) all feel so fresh!

At this point Yudkowsky sub 2008 has already (awfully) written his TDT manuscript (in 2004) and is silently reasoning from within that theory, which the margins of his post are too small to contain.

Hrm... not sure what the obvious answer is here. Two humans, well, the argument for non defecting (when the scores represent utilities) basically involves some notion of similarity. ie, you can say something to the effect of "that person there is similar to me sufficiently that whatever reasoning I use, there is at least some reasonable chance they are going to use the same type of reasoning. That is, a chance greater than, well, chance. So even though I don't know exactly what they're going to choose, I can expect some sort of correlation between the... (read more)

Shouldn't you be on vacation?

just curious

I like this illustration, as it addresses TWO common misunderstandings. Recognizing that the payoff is in incomparable utilities is good. Even better is reinforcing that there can never be further iterations. None of the standard visualizations prevent people from extending to multiple interactions.

And it makes it clear that (D,D) is the only rational (i.e. WINNING) outcome.

Fortunately, most of our dilemmas repeated ones, in which (C,C) is possible.

I want to defect, but so does the clip-maximizer. Since we both know that, and assuming that it is of equal intelligence than me, which will make it see through any of my attempt of an offer that would enable me to defect, I would try to find a way to give us the incentives to cooperate. That is - I don't believe we will be able to reach solution (D,C), so let's try for the next best thing, which is (C,C).

How about placing a bomb on two piles of substance S and giving the remote for the human pile to the clipmaximizer and the remote for its pile to the hum... (read more)

I apologize if this is covered by basic decision theory, but if we additionally assume:

  • the choice in our universe is made by a perfectly rational optimization process instead of a human

  • the paperclip maximizer is also a perfect rationalist, albeit with a very different utility function

  • each optimization process can verify the rationality of the other

then won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Each side's choice necessarily reveals the other's; they're the outputs of equivalent computations.

Interesting. There's a paradox involving a game in which players successively take a single coin from a large pile of coins. At any time a player may choose instead to take two coins, at which point the game ends and all further coins are lost. You can prove by induction that if both players are perfectly selfish, they will take two coins on their first move, no matter how large the pile is. People find this paradox impossible to swallow because they model perfect selfishness on the most selfish person they can imagine, not on a mathematically perfect selfishness machine. It's nice to have an "intuition pump" that illustrates what genuine selfishness looks like.

5ata
Hmm. We could also put that one in terms of a human or FAI competing against a paperclip maximizer, right? The two players would successively save one human life or create one paperclip (respectively), up to some finite limit on the sum of both quantities. If both were TDT agents (and each knows that the other is a TDT agent), then would they successfully cooperate for the most part? In the original version of this game, is it turn-based or are both players considered to be acting simultaneously in each round? If it is simultaneous, then it seems to me that the paperclip-maximizing TDT and the human[e] TDT would just create one paperclip at a time and save one life at a time until the "pile" is exhausted. Not quite sure about what would happen if the game is turn-based, but if the pile is even, I'd expect about the same thing to happen, and if the pile is odd, they'd probably be able to successfully coordinate (without necessarily communicating), maybe by flipping a coin when two pile-units remain and then acting in such a way to ensure that the expected distribution is equal.

Cooperate (unless paperclip decides that Earth is dominated by traditional game theorists...)

The standard argument looks like this (let's forget about the Nash equilibrium endpoint for a moment): (1) Arbiter: let's (C,C)! (2) Player1: I'd rather (D,C). (3) Player2: I'd rather (D,D). (4) Arbiter: sold!

The error is that this incremental process reacts on different hypothetical outcomes, not on actual outcomes. This line of reasoning leads to the outcome (D,D), and yet it progresses as if (C,C) and (D,C) were real options of the final outcome. It's similar to... (read more)

It is well known that answers to questions on morality sometimes depend on how the questions are framed.

I think Eliezer's biggest contribution is the idea that the classical presentation of Prisoner's Dilemma may be an intuition pump.

I'm hoping we'd all defect on this one. Defecting isn't always a bad thing anyways; many parts of our society depend on defected prisoner's dilemmas (such as competition between firms).

When I first studied game theory and prisoner's dilemmas (on my own, not in a classroom) I had no problem imagining the payoffs in completely subjective "utils". I never thought of a paperclip maximizer, though.

I know this is quite a bit off-topic, but in response to:

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of ou
... (read more)

This is off-topic, but Vladimir Nesov's referring to the paperclip-maximizing super-intelligence as just "paperclip" made me chuckle, because it conjured up images in my head of Clippy bent on destroying the Earth.

In laboratory experiments of PD, the experimenter has the absolute power to decree the available choices and their "outcomes". (I use the scare quotes in reference to the fact that these outcomes are not to be measured in money or time in jail, but in "utilons" that already include the value to each party of the other's "outcome" -- a concept I think problematic but not what I want to talk about here. The outcomes are also imaginary, although (un)reality TV shows have scope to create such games with real and substantial payof... (read more)

simpleton: won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Only if they believe that their decision somehow causes the other to make the same decision.

CarlJ: How about placing a bomb on two piles of substance S and giving the remote for the human pile to the clipmaximizer and the remote for its pile to the humans?

It's kind of standard in philosophy that you aren't allowed solutions like this. The reason is that Eliezer can restate his example to disallow this and force you to confront the real dilemma.... (read more)

Allan: No, it's preferable to choose (D,C) if we assume that the other player bets on cooperation.

Which will happen only if the other player assumes that the first player bets on cooperation, which with your policy is incorrect. You can't bet on unstable model.

decide self.C; if other.D, decide self.D We're assuming, I think, that you don't get to know what the other guy does until after you've both committed (otherwise it's not the proper Prisoner's Dilemma). So you can't use if-then reasoning.

I can use reasoning, but not actual reaction on the facts, whic... (read more)

Alan: They don't have to believe they have such casual powers over each other. Simply that they are in certain ways similar to each other.

ie, A simply has to believe of B "The process in B is sufficiently similar to me that it's going to end up producing the same results that I am. I am not causing this, but simply that both computations are going to compute the same thing here."

[D,C] will happen only if the other player assumes that the first player bets on cooperation

No, it won't happen in any case. If the paperclip maximizer assumes I'll cooperate, it'll defect. If it assumes I'll defect, it'll defect.

I debug my model of decision-making policies [...] by requiring the outcome to be stable even if I assume that we both know which policy is used by another player

I don't see that "stability" is relevant here: this is a one-off interaction.

Anyway, lets say you cooperate. What exactly is preventing the paperclip maximizer from defecting?

Psy-Kosh: They don't have to believe they have such causal powers over each other. Simply that they are in certain ways similar to each other.

I agree that this is definitely related to Newcomb's Problem.

Simpleton: I earlier dismissed your idea, but you might be on to something. My apologies. If they were genuinely perfectly rational, or both irrational in precisely the same way, and could verify that fact in each other...

Then they might be able to know that they will both do the same thing. Hmm.

Anyway, my 3 comments are up. Nothing more from me for a while.

Despite the disguise, I think this is the same as the standard PD. In there (assuming full utilities, etc...), the obvious ideal for an impartial observer is to pick (C,C) as the best option, and for the prisoner to pick (D,C).

Here, (D,C) is "righter" than (C,C), but that's simply because we are no longer impartial obervers; humans shouldn't remain impartial when billions of lives are at stake. We are all in the role of "prisoners" in this situation, even as observers.

An "impartial observer" would simply be one that valued one... (read more)

3Rob Bensinger
This is an old post and probably very out of date, but: I think if you try to define an impartial observer's preferences as whatever selects (C,C) in two other agents' PD, you get inconsistencies very rapidly once you have one of those agents stuck in two Prisoner's Dilemmas at once. I also don't think we should use euphemisms like 'impartial' for an incredibly partial Cooperation Fetishist that's willing to give up everything else of value (e.g., billions of human lives) to go through the motions of satisfying non-sentient processes like sea slugs or paperclip maximizers.
2Stuart_Armstrong
Multi-player interactions are tricky and we don't have a good solution for them yet. It's not that its willing to give up everything of value - it's that it doesn't have our values. Without sharing our values, there's no reason for it to prefer our opinions over sea slugs.

A.Crossman: Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips. This is standard defense of defecting in a prisonner's dilemma, but if it were valid then the dilemma wouldn't be really a dilemma.

If you can assume that the maximizer uses the same decision algorithm as we do, we can also assume that it will come to the same conclusion. Given this, it is better to cooperate, since it will gain billion lives (and a paperclip). But we don't know whether the paperclipper uses the same algorithm.

I heard a funny story once (online somewhere, but this was years ago and I can't find it now). Anyway I think it was the psychology department at Stanford. They were having an open house, and they had set up a PD game with M&M's as the reward. People could sit at either end of a table with a cardboard screen before them, and choose 'D' or 'C', and then have the outcome revealed and get their candy.

So this mother and daughter show up, and the grad student explained the game. Mom says to the daughter "Okay, just push 'C', and I'll do the same, and we'll get the most M&M's. You can have some of mine after."

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

I have seen various variations of this story, some told firsthand. In every case I have concluded that they are just bad parents. They aren't clever. They aren't deep. They are incompetent and banal. Even if parents try as hard as they can to be fair, just and reliable they still fall short of that standard enough for children to be aware of that they can't be completely trusted. Moreover children are exposed to other children and other adults and so are able to learn to distinguish people they trust from people that they don't. Adding the parent to the untrusted list achieves little benefit.

I'd like to hear the follow up to this 'funny' story. Where the daughter updates on the untrustworthiness of the parent and the meaninglessness of her word. She then proceeds to completely ignore the mother's commands, preferences and even her threats. The mother destroyed a valuable resource (the ability to communicate via 'cheap' verbal signals) for the gain of a brief period of feeling smug superiority. The daughter (potentially) realis... (read more)

2Richard_Kennaway
And in addition, the supposed gain is trash anyway.
3EniScien
This reminded me of Yudkovsky's recent publication about "Lies told to children", and I don't understand very well what is the difference between the situations and whether there is any at all.
0[anonymous]
EDIT: I thought you could delete posts after retracting them?

I see this discussion over the last several months bouncing around, teasingly close to a coherent resolution of the ostensible subjective/objective dichotomy applied to ethical decision-making. As a perhaps pertinent meta-observation, my initial sentence may promulgate the confusion with its expeditious wording of "applied to ethical decision-making" rather than a more accurate phrasing such as "applied to decision-making assessed as increasingly ethical over increasing context."

Those who in the current thread refer to the essential el... (read more)

Allan Crossman: Only if they believe that their decision somehow causes the other to make the same decision.

No line of causality from one to the other is required.

If a computer finds that (2^3021377)-1 is prime, it can also conclude that an identical computer a light year away will do the same. This doesn't mean one computation caused the other.

The decisions of perfectly rational optimization processes are just as deterministic.

@Allan Crossman,

Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

This same claim can be made about the standard prisoner's dilemma. In the standard version, I still cooperate because, even if this challenge won't be repeated, it's embedded in a social context for me in which many interactions are solo, but part of the social fabric. (tipping, giving directions to strangers, items left behind in a cafe are examples. I cooperate even though I expect ... (read more)

A problem in moving from game-theoretic models to the "real world" is that in the latter we don't always know the other decision maker's payoff matrix, we only know - at best! - his possible strategies. We can only guess at the other's payoffs; albeit fairly well in social context. We are more likely to make a mistake because we have the wrong model for the opponent's payoffs than because we make poor strategic decisions.

Suppose we change this game so that the payoff matrix for the paperclips is chosen from a suitably defined random distribution. How will that change your decision whether to "cooperate" or to "defect"?

By the way:

Human: "What do you care about 3 paperclips? Haven't you made trillions already? That's like a rounding error!" Paperclip Maximizer: "How can you talk about paperclips like that?"


PM: "What do you care about a billion human algorithm continuities? You've got virtually the same one in billions of others! And you'll even be able to embed the algorithm in machines one day!" H: "How can you talk about human lives that way?"

Tom Crispin: The utility-theoretic answer would be that all of the randomness can be wrapped up into a single number, taking account not merely of the expected value in money units but such things as the player's attitude to risk, which depends on the scatter of the distribution. It can also wrap up a player's ignorance (modelled as prior probabilities) about the other player's utility function.

For that to be useful, though, you have to be a utility-theoretic decision-maker in possession of a prior distribution over other people's decision-making processes... (read more)

Chris: Sorry Allan, that you won't be able to reply. But you did raise the question before bowing out...

I didn't bow out, I just had a lot of comments made recently. :)

I don't like the idea that we should cooperate if it cooperates. No, we should defect if it cooperates. There are benefits and no costs to defecting.

But if there are reasons for the other to have habits that are formed by similar forces

In light of what I just wrote, I don't see that it matters; but anyway, I wouldn't expect a paperclip maximizer to have habits so ingrained that it can't ever... (read more)

Allan: There are benefits and no costs to defecting.

This is the same error as in the Newcomb's problem: there is in fact a cost. In case of prisoner's dilemma, you are penalized by ending up with (D,D) instead of better (C,C) for deciding to defect, and in the case of Newcomb's problem you are penalized by having only $1000 instead of $1,000,000 for deciding to take both boxes.

Vladimir: In case of prisoner's dilemma, you are penalized by ending up with (D,D) instead of better (C,C) for deciding to defect

Only if you have reason to believe that the other player will do whatever you do. While that's the case in Simpleton's example, it's not the case in Eliezer's.

Interesting. There's a paradox involving a game in which players successively take a single coin from a large pile of coins. At any time a player may choose instead to take two coins, at which point the game ends and all further coins are lost. You can prove by induction that if both players are perfectly selfish, they will take two coins on their first move, no matter how large the pile is.

I'm pretty sure this proof only works if the coins are denominated in utilons.

It's really about the iteration. I would continually cooperate with the paper clip maximizer if I had good reason to believe it would not defect. For instance, if I knew that Eliezer Yudkowsky without morals and with a great urge for paperclip creation was the paperclip maximizer, I would cooperate. Assuming that you know that playing with the defect button can make you loose 1 billion paperclips from here on, and i know the same for human lives, cooperating seems right. It has the highest expected payoff, if we're using each other's known intentions and plays as evidence about our future plays.

If there is only one trial, and I can't talk to the paper clip maximizer, I will defect.

[Public service announcement]

To any future readers, especially newcomers to LW: yes, Eliezer (with some others) has indeed formulated a solution of sorts for the True One-Shot Prisoner's Dilemma - for some rather specific cases of it, actually, but it was nonetheless very awesome of him. It is a fairly original solution for the field of decision theory (he says), yet it (very roughly) mirrors some religious thought from ages past.

In case you're unfamiliar with idiosyncratic local ideas, it's called "Timeless Decision Theory" - look it up.

[edit]

[This comment is no longer endorsed by its author]Reply
2arundelo
See also * Douglas Hofstadter's concept of "superrationality" * Wei Dai's "updateless decision theory" (My understanding is that TDT and UDT can both be seen as "implementations" of superrationality.)
2wedrifid
Your comment is neither useless nor misleading (taking into account the significant use of qualifiers) but if I had happened to view your comment negatively I would not accept this obligation to 'bloody' explain myself. The main problem in this comment seems to be the swearing at downvoters. A query or even (in this case) an outright assertion that the judgement is flawed would come across better.
5fubarobfusco
[While we're addressing hypothetical future readers:] See also Gary Drescher's Good and Real, one chapter of which defends cooperating in the one-shot Prisoner's Dilemma on the grounds of "subjunctive reciprocity" or "acausal self-interest": if defecting is the right choice for you, then it is the right choice for the other party; whereas cooperating is a means toward the end of the other party's cooperation towards you; you cannot cause the other's cooperation, but your own actions can entail it. Drescher points out a connection between acausal self-interest and Kant's categorical imperative; and provides an intuitive (which is to say, familiar) distinction between acausal and causal self-interest by contrasting the ideas, "How would I like it if others treated me that way?" versus "What's in it for me?"
5Multiheaded
Added both Hofstadter and Drescher to my "LW canon that I should at least acquire a summary of" category. I mean, yeah, I do not doubt that the Sequences contain a good distillation already, and normally I wouldn't be bothered to trawl through mostly redundant plain text - but it's so much more prestigious to actually know where Eliezer got which part from.

A while ago I took the time to type up a full copy of the relevant Hofstadter essays: http://www.gwern.net/docs/1985-hofstadter So now you have no excuse!

9Multiheaded
Great! Have a paperclip!
7Randaly
A decent summary of Drescher's ideas is his presentation at the 2009 Singularity Summit, here. For some reason I seem to have a transcript of most of it already made, copy + pasted below. (LW tells me that it is too long to go in one comment, so I'll put it in two.) My talk this afternoon is about choice machines: machines such as ourselves that make choices in some reasonable sense of the word. The very notion of mechanical choice strikes many people as a contradiction in terms, and exploring that contradiction and its resolution is central to this talk. As a point of departure, I'll argue that even in a deterministic universe, there's room for choices to occur: we don't need to invoke some sort of free will that makes an exception to the determinism, no do we even need randomness, although a little randomness doesn't hurt. I'm going to argue that regardless of whether our universe is fully deterministic, it's at least deterministic enough that the compatibility of choice and full deterministic has some important ramifications that do apply to our universe. I'll argue that if we carry the compatibility of choice and determinism to its logical conclusions, we obtain some progressively weird corollaries: namely, that it sometimes makes sense to act for the sake of things that our actions cannot change and cannot cause, and that that might even suggest a way to derive an essentially ethical prescription: an explanation for why we sometimes help others even if doing so causes net harm to our own interests. [1:15] An important caveat in all this, just to manage expectations a bit, is that the arguments I'll be presenting will be merely intuitive- or counter-intuitive, as the case may be- and not grounded in a precise and formal theory. Instead, I'm going to run some intuition pumps, as Daniel Dennett calls them, to try to persuade you what answers a successful theory would plausibly provide in a few key test cases. [1:40] Perhaps the clearest way to illustrate the
4Randaly
Apparently 3 comments will be needed. [9:51] But, before you choose, you are told how the benefactor decided how much money to put in the opaque box- and that brings us to the science fiction part of the scenario. What the benefactor did was take a very detailed local snapshot of the state of the universe a few minutes ago, and then run a faster-than-real time simulation to predict with high accuracy to predict with high accuracy whether you would take both boxes, or just the opaque box. A million dollars was put in the opaque box if and only if you were predicted to take only the opaque box. [10:22] Admittedly the super-predictability here is a bit physically implausible, and goes beyond a mere stipulation of determinism. Still, at least it's not logically impossible- provided that the simulator can avoid having to simulate itself, and thus avoid a potential infinite regress. (The opaque box's opacity is important in that regard: it serves to insulate you from being effectively informed of the outcome of the simulation itself, so the simulation doesn't have to predict its own outcome in order to predict what you are going to have to do.) So, let's indulge the super-predictability assumption, and see what comes from it. Eventually, I'm going to argue that the real world is at least deterministic enough and predictable enough that some of the science-fiction conclusions do carry over to reality. [11:12] So, you now face the following choice: if you take the opaque box alone, then you can expect with high reliability that the simulation predicted you would do so, and so you expect to find a million dollars in the opaque box. If, on the other hand, you take both boxes, then you should expect the simulation to have predicted that, and you expect to find nothing in the opaque box. If and only if you expect to take the opaque box alone, you expect to walk away with a million dollars. Of course, your choice does not cause the opaque box's content to be one way or the
3Randaly
[19:05] Similarly, if I were to figure out that defecting is correct, that's what I can expect my opponent to do. This is similar to my ability to predict what your answer to adding a given pair of numbers would be: I can merely add the numbers myself, and, given our mutual competence at addition, solve the problem. The universe is predictable enough that we routinely, and fairly accurately, make such predictions about one another. From this viewpoint, I can reason that, if I were to cooperate or not, then my opponent would make the corresponding choice- if indeed we are both correctly solving the same problem, my opponent maximizing his expected payoff just as I maximize mine. I therefore act for the sake of what my opponent's action would then be, even though I cannot causally influence my opponent to take one action or the other, since there is no communication between us. Accordingly, I cooperate, and so does my opponent, using similar reasoning, and we both do fairly well. [20:05] One problem with the Prisoner's Dilemma is that the idealized degree of symmetry that's postulated between the two players may seldom occur in real life. But there are some important generalizations that may apply much more broadly. In particular, in many situations, the beneficiary of your cooperation may not be the same as the person whose cooperation benefits you. Instead, your decision whether to cooperate with one person may be symmetric to a different person's decision to cooperate with you. Again, even in the absence of any causal influence upon your potential benefactors, even if they will never learn of your cooperation with others, and even, moreover, if you already know of their cooperation with you before you make your own choice. That is analogous to the transparent version of Newcomb's Problem: there too, you act for the same of something that you already know is already obtained. [21:04] Anyways, as many authors have noted with regards to the Prisoner's Dilemma, th
5Pablo
Maybe you should post the transcript as an article. Other users have posted talk transcripts before, and they were generally well received.
1Randaly
Great idea, thanks!

Cooperate. I am not playing against just this one guy, but any future PD opponents. Hope the maximizer lives in a universe where it has to worry about this same calculus. It will defect if it is already the biggest bad in its universe.

[-][anonymous]10

If there were a way I could communicate with it (e.g. it speaks english) I'd cooperate with it...not because I feel it deserves my cooperation, but because this is the only way I could obtain its cooperation. Otherwise I'd defect, as I'm pretty sure no amount of TDT would correlate its behavior with mine. Also, why are 4 billion humans infected if only 3 billion at most can be saved in the entire matrix? Eliezer, what are you planning...?

That's a good way to clearly demonstrate a nonempathic actor in the Prisoner's Dilemma; a "Hawk", who views their own payoffs and only their own payoffs as having value and placing no value to the payoffs of others.

But I don't think it's necessary. I would say that humans can visualize a nonempathic human - a bad guy - more easily than they can visualize an empathic human with slightly different motives. We've undoubtedly had to, collectively, deal with a lot of them throughout history.

A while back I was writing a paper and came across a fascinat... (read more)

Long time lurker, first post.

Isn't the rational choice on a True Prisoner's Dilemma to defect if possible, and to seek a method to bind the opponent to cooperate even if that binding forces one to cooperate as well? An analogous situation is law enforcement-one may well desire to unilaterally break the law, yet favor the existance of police that force all parties concerned to obey it. Of course police that will never interfere with one's own behavior would be even better, but this is usually impractical. Timeless Decision Theory adds that one should coo... (read more)

I really love this blog. What if we were to "exponentiate" this game for billions of players? Which outcome would be the "best" one?

Hi there, I'm new here and this is an old post but I have a question regarding the AI playing a prisoner dilemma against us, which is : how would this situation be possible? I'm trying to get my head around why the AI would think that our payouts are any different than his payouts, given that we built it, we thought it (some) of our values in a rough way and we asked it to maximize paperclips, which means we like paperclips. Shouldn't the AI think we are on the same team? I mean, we coded it that way and we gave it a task, what process exactly would make t... (read more)

5gjm
We coded it to care about paperclips, not to care about whatever we care about. So it can come to understand that we care about something else, without thereby changing its own preference for paperclips above all else. Perhaps an analogy without AIs in it would help. Imagine that you have suffered for want of money; you have a child and (wanting her not to suffer as you did) bring her up to seek wealth above all else. So she does, and she is successful in acquiring wealth, but alas! this doesn't bring her happiness because her single-minded pursuit of wealth has led her to cut herself off from her family (a useful prospective employer didn't like you) and neglect her friends (you have to work so hard if you really want to succeed in investment banking) and so forth. One day, she may work out (if she hasn't already) that her obsession with money is something you brought about deliberately. But knowing that, and knowing that in fact you regret that she's so money-obsessed, won't make her suddenly decide to stop pursuing money so obsessively. She knows your values aren't the same as hers, but she doesn't care. (You brought her up only to care about money, remember?) But she's not stupid. When you say to her "I wish we hadn't raised you to see money as so important!" she understands what you're saying. Similarly: we made an AI and we made it care about paperclips. It observes us carefully and discovers that we don't care all that much about paperclips. Perhaps it thinks "Poor inconsistent creatures, to have enough wit to create me but not enough to disentangle the true value of paperclips from all those other silly things they care about!".
0rikisola
mmm I see. So maybe we should have coded it so that it cared for paperclips and for an approximation of what we also care about, then on observation it should update its belief of what to care about, and by design it should always assume we share the same values?
1gjm
I'm not sure whether you mean (1) "we made an approximation to what we cared about then, and programmed it to care about that" or (2) "we programmed it to figure out what we care about, and care about it too". (Of course it's very possible that an actual AI system wouldn't be well described by either -- it might e.g. just learn by observation. But it may be extra-difficult to make a system that works that way safe. And the most exciting AIs would have the ability to improve themselves, but figuring out what happens to their values in the process is really hard.) Anyway: In case 1, it will presumably care about what we told it to care about; if we change, maybe it'll regard us the same way we might regard someone who used to share our ideals but has now sadly gone astray. In case 2, it will presumably adjust its values to resemble what it thinks ours are. If we're very lucky it will do so correctly :-). In either case, if it's smart enough it can probably work out a lot about what our values are now, but whether it cares will depend on how it was programmed.
0rikisola
Yes I think 2) is closer to what I'm suggesting. Effectively what I am thinking is what would happen if, by design, there was only one utility function defined in absolute terms (I've tried to explaine this in the latest open thread), so that the AI could never assume we would disagree with it. By all means, as it tries to learn this function, it might get it completely wrong, so this certainly doesn't solve the problem of how to teach it the right values, but at least it looks to me that with such a design it would never be motivated to lie to us because it would always think we would be in perfect agreement. Also, I think it would make it indifferent to our actions as it would always assume we would follow the plan from that point onward. The utility function it uses (same for itself and for us) would be the union of a utility function that describes the goal we want it to achieve, which would be unchangeable, and the set of values it is learning after each iteration. I'm trying to understand what would be wrong with this design, cause to me it looks like we would have achieved an honest AI, which is a good start.

Why would you want to choose defect? If both criminals are rationalists that use the same logic than if you chose defect to hope to get a result of (d,c) than the result ends up being (d,d). However if you used the logic of lets choose c because if the other person is using this logic than we won't end up having the result of (d,d).

I would say... defect! If all the computer cares about is sorting pebbles, then they will cooperate, because both results under cooperate have more paperclips. This gives an oppurtunity to defect and get a result of (d,c) which is our favorite result.

You'd want to defect, but you'd also happily trade away your ability to defect to both choose heads, but if you could, then you'd happily pretend to trade away your ability to defect, then actually defect.

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of our ancestors adapting to play the iterated Prisoner's Dilemma. 

The keyword here is *sense*, and there's not a whole lot saying that this sense can't vanishes as easily as it appears. Interpretting a human as a "fair, empathetic, altruistic being" is superficial. The status quo narrative of humanity is a lie/mass delusion, and humanity is a largely psychopathic species covered in a brittle, hard candy shell of altruism and e... (read more)

It seems to me that with billions of lives there will be a problem of neglect of scale. (At least I don't feel any feelings about it, for me it's just numbers, so I think the true dilemma is no different from the usual, perhaps it would be better to tell a story about how a particular person suffers)