You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Two-boxing, smoking and chewing gum in Medical Newcomb problems

14 Caspar42 29 June 2015 10:35AM

I am currently learning about the basics of decision theory, most of which is common knowledge on LW. I have a question, related to why EDT is said not to work.

Consider the following Newcomblike problem: A study shows that most people who two-box in Newcomblike problems as the following have a certain gene (and one-boxers don't have the gene). Now, Omega could put you into something like Newcomb's original problem, but instead of having run a simulation of you, Omega has only looked at your DNA: If you don't have the "two-boxing gene", Omega puts $1M into box B, otherwise box B is empty. And there is $1K in box A, as usual. Would you one-box (take only box B) or two-box (take box A and B)? Here's a causal diagram for the problem:



Since Omega does not do much other than translating your genes into money under a box, it does not seem to hurt to leave it out:


I presume that most LWers would one-box. (And as I understand it, not only CDT but also TDT would two-box, am I wrong?)

Now, how does this problem differ from the smoking lesion or Yudkowsky's (2010, p.67) chewing gum problem? Chewing Gum (or smoking) seems to be like taking box A to get at least/additional $1K, the two-boxing gene is like the CGTA gene, the illness itself (the abscess or lung cancer) is like not having $1M in box B. Here's another causal diagram, this time for the chewing gum problem:

As far as I can tell, the difference between the two problems is some additional, unstated intuition in the classic medical Newcomb problems. Maybe, the additional assumption is that the actual evidence lies in the "tickle", or that knowing and thinking about the study results causes some complications. In EDT terms: The intuition is that neither smoking nor chewing gum gives the agent additional information.

Why isn't the following decision theory optimal?

5 internety 16 April 2015 01:38AM

 

I've recently read the decision theory FAQ, as well as Eliezer's TDT paper. When reading the TDT paper, a simple decision procedure occurred to me which as far as I can tell gets the correct answer to every tricky decision problem I've seen. As discussed in the FAQ above, evidential decision theory get's the chewing gum problem wrong, causal decision theory gets Newcomb's problem wrong, and TDT gets counterfactual mugging wrong.

In the TDT paper, Eliezer postulates an agent named Gloria (page 29), who is defined as an agent who maximizes decision-determined problems. He describes how a CDT-agent named Reena would want to transform herself into Gloria. Eliezer writes

By Gloria’s nature, she always already has the decision-type causal agents wish they had, without need of precommitment.

Eliezer then later goes on the develop TDT, which is supposed to construct Gloria as a byproduct.

Gloria, as we have defined her, is defined only over completely decision-determined problems of which she has full knowledge. However, the agenda of this manuscript is to introduce a formal, general decision theory which reduces to Gloria as a special case.

Why can't we instead construct Gloria directly, using the idea of the thing that CDT agents wished they were? Obviously we can't just postulate a decision algorithm that we don't know how to execute, and then note that a CDT agent would wish they had that decision algorithm, and pretend we had solved the problem. We need to be able to describe the ideal decision algorithm to a level of detail that we could theoretically program into an AI.

Consider this decision algorithm, which I'll temporarily call Nameless Decision Theory (NDT) until I get feedback about whether it deserves a name: you should always make the decision that a CDT-agent would have wished he had pre-committed to, if he had previously known he'd be in his current situation and had the opportunity to precommit to a decision. 

In effect, you are making an general precommittment to behave as if you made all specific precommitments that would ever be advantageous to you.

NDT is so simple, and Eliezer comes so close to stating it in his discussion of Gloria, that I assume there is some flaw with it that I'm not seeing. Perhaps NDT does not count as a "real"/"well defined" decision procedure, or can't be formalized for some reason? Even so, it does seem like it'd be possible to program an AI to behave in this way.

Can someone give an example of a decision problem for which this decision procedure fails? Or for which there are multiple possible precommitments that you would have wished you'd made and it's not clear which one is best?

EDIT: I now think this definition of NDT better captures what I was trying to express: You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had previously considered the possibility of his current situation and had the opportunity to costlessly precommit to a decision.

 

Blackmail, continued: communal blackmail, uncoordinated responses

11 Stuart_Armstrong 22 October 2014 05:53PM

The heuristic that one should always resist blackmail seems a good one (no matter how tricky blackmail is to define). And one should be public about this, too; then, one is very unlikely to be blackmailed. Even if one speaks like an emperor.

But there's a subtlety: what if the blackmail is being used against a whole group, not just against one person? The US justice system is often seen to function like this: prosecutors pile on ridiculous numbers charges, threatening uncounted millennia in jail, in order to get the accused to settle for a lesser charge and avoid the expenses of a trial.

But for this to work, they need to occasionally find someone who rejects the offer, put them on trial, and slap them with a ridiculous sentence. Therefore by standing up to them (or proclaiming in advance that you will reject such offers), you are not actually making yourself immune to their threats. Your setting yourself up to be the sacrificial one made an example of.

Of course, if everyone were a UDT agent, the correct decision would be for everyone to reject the threat. That would ensure that the threats are never made in the first place. But - and apologies if this shocks you - not everyone in the world is a perfect UDT agent. So the threats will get made, and those resisting them will get slammed to the maximum.

Of course, if everyone could read everyone's mind and was perfectly rational, then they would realise that making examples of UDT agents wouldn't affect the behaviour of non-UDT agents. In that case, UDT agents should resist the threats, and the perfectly rational prosecutor wouldn't bother threatening UDT agents. However - and sorry to shock your views of reality three times in one post - not everyone is perfectly rational. And not everyone can read everyone's minds.

So even a perfect UDT agent must, it seems, sometimes succumb to blackmail.

Cooperating with agents with different ideas of fairness, while resisting exploitation

38 Eliezer_Yudkowsky 16 September 2013 08:27AM

There's an idea from the latest MIRI workshop which I haven't seen in informal theories of negotiation, and I want to know if this is a known idea.

(Old well-known ideas:)

Suppose a standard Prisoner's Dilemma matrix where (3, 3) is the payoff for mutual cooperation, (2, 2) is the payoff for mutual defection, and (0, 5) is the payoff if you cooperate and they defect.

Suppose we're going to play a PD iterated for four rounds.  We have common knowledge of each other's source code so we can apply modal cooperation or similar means of reaching a binding 'agreement' without other enforcement methods.

If we mutually defect on every round, our net mutual payoff is (8, 8).  This is a 'Nash equilibrium' because neither agent can unilaterally change its action and thereby do better, if the opponents' actions stay fixed.  If we mutually cooperate on every round, the result is (12, 12) and this result is on the 'Pareto boundary' because neither agent can do better unless the other agent does worse.  It would seem a desirable principle for rational agents (with common knowledge of each other's source code / common knowledge of rationality) to find an outcome on the Pareto boundary, since otherwise they are leaving value on the table.

But (12, 12) isn't the only possible result on the Pareto boundary.  Suppose that running the opponent's source code, you find that they're willing to cooperate on three rounds and defect on one round, if you cooperate on every round, for a payoff of (9, 14) slanted their way.  If they use their knowledge of your code to predict you refusing to accept that bargain, they will defect on every round for the mutual payoff of (8, 8).

I would consider it obvious that a rational agent should refuse this unfair bargain.  Otherwise agents with knowledge of your source code will offer you only this bargain, instead of the (12, 12) of mutual cooperation on every round; they will exploit your willingness to accept a result on the Pareto boundary in which almost all of the gains from trade go to them.

(Newer ideas:)

Generalizing:  Once you have a notion of a 'fair' result - in this case (12, 12) - then an agent which accepts any outcome in which it does worse than the fair result, while the opponent does better, is 'exploitable' relative to this fair bargain.  Like the Nash equilibrium, the only way you should do worse than 'fair' is if the opponent also does worse.

So we wrote down on the whiteboard an attempted definition of unexploitability in cooperative games as follows:

"Suppose we have a [magical] definition N of a fair outcome.  A rational agent should only do worse than N if its opponent does worse than N, or else [if bargaining fails] should only do worse than the Nash equilibrium if its opponent does worse than the Nash equilibrium."  (Note that this definition precludes giving in to a threat of blackmail.)

(Key possible-innovation:)

It then occurred to me that this definition opened the possibility for other, intermediate bargains between the 'fair' solution on the Pareto boundary, and the Nash equilibrium.

Suppose the other agent has a slightly different definition of fairness and they think that what you consider to be a payoff of (12, 12) favors you too much; they think that you're the one making an unfair demand.  They'll refuse (12, 12) with the same feeling of indignation that you would apply to (9, 14).

Well, if you give in to an arrangement with an expected payoff of, say, (11, 13) as you evaluate payoffs, then you're giving other agents an incentive to skew their definitions of fairness.

But it does not create poor incentives (AFAICT) to accept instead a bargain with an expected payoff of, say, (10, 11) which the other agent thinks is 'fair'.  Though they're sad that you refused the truly fair outcome of (as you count utilons) 11, 13 and that you couldn't reach the Pareto boundary together, still, this is better than the Nash equilibrium of (8, 8).  And though you think the bargain is unfair, you are not creating incentives to exploit you.  By insisting on this definition of fairness, the other agent has done worse for themselves than other (12, 12).  The other agent probably thinks that (10, 11) is 'unfair' slanted your way, but they likewise accept that this does not create bad incentives, since you did worse than the 'fair' outcome of (11, 13).

There could be many acceptable negotiating equilibria between what you think is the 'fair' point on the Pareto boundary, and the Nash equilibrium.  So long as each step down in what you think is 'fairness' reduces the total payoff to the other agent, even if it reduces your own payoff even more.  This resists exploitation and avoids creating an incentive for claiming that you have a different definition of fairness, while still holding open the possibility of some degree of cooperation with agents who honestly disagree with you about what's fair and are trying to avoid exploitation themselves.

This translates into an informal principle of negotiations:  Be willing to accept unfair bargains, but only if (you make it clear) both sides are doing worse than what you consider to be a fair bargain.

I haven't seen this advocated before even as an informal principle of negotiations.  Is it in the literature anywhere?  Someone suggested Schelling might have said it, but didn't provide a chapter number.

ADDED:

Clarification 1:  Yes, utilities are invariant up to a positive affine transformation so there's no canonical way to split utilities evenly.  Hence the part about "Assume a magical solution N which gives us the fair division."  If we knew the exact properties of how to implement this magical solution, taking it at first for magical, that might give us some idea of what N should be, too.

Clarification 2:  The way this might work is that you pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome:  (100, 100), (98, 99), (96, 98).  Perhaps stop well short of Nash if the skew becomes too extreme.  Drop to Nash as the last resort.  The other agent does the same, starting with their own ideal of fairness on the Pareto boundary.  Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle.  Both of you will do worse against a fixed opponent's strategy by unilaterally adopting more self-favoring ideas of fairness.  Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness.  This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it.  You should not be picking the descending sequence in an agent-dependent way that incentivizes, at cost to you, skewed claims about fairness.

Clarification 3:  You must take into account the other agent's costs and other opportunities when ensuring that the net outcome, in terms of final utilities, is worse for them than the reward offered for 'fair' cooperation.  Offering them the chance to buy half as many paperclips at a lower, less fair price, does no good if they can go next door, get the same offer again, and buy the same number of paperclips at a lower total price.

My Take on a Decision Theory

2 ygert 09 July 2013 10:46AM

Finding a good decision theory is hard. Previous attempts, such as Timeless Decision Theory, work, it seems, in providing a stable, effective decision theory, but are mathematically complicated. Simpler theories, like CDT or EDT, are much more intuitive, but have deep flaws. They fail at certain problems, and thus violate the maxim that rational agents should win. This makes them imperfect.

But it seems to me that there is a relatively simple fix one could make to them, in the style of TDT, to extend their power considerably. Here I will show an implementation of such an extension of CDT, that wins on the problems that classic CDT fails on. It quite possibly could turn out that this is not as powerful as TDT, but it is a significant step in that direction, starting only from the naivest of decision theories. It also could turn out that this is nothing more than a reformulation of TDT or a lesser version thereof. In that case, this still has some value as a simpler formulation, easier to understand. Because as it stands, TDT seems like a far cry from a trivial extension of the basic, intuitive decision theories, as this hopes to be.

We will start by remarking that when CDT (or EDT) tries to figure out the expected value or a action or outcome, the naive way which it does so drops crucial information, which is what TDT manages to preserve. As such, I will try to calculate a CDT with this information not dropped. This information is, for CDT, the fact that Omega has simulated you and figured out what you are going to do. Why does a CDT agent automatically assume that it is the "real" one, so to speak? This trivial tweak seems powerful. I will, for the purpose of this post, call this tweaked version of CDT "Simulationist Causal Decision Theory", or SCDT for short.

Let's run this tweaked version though Newcomb's problem. Let Alice be a SCDT agent. Before the problem begins, as is standard in Newcomb's problem, Omega looks at Alice and calculates what choice Alice will make in the game. Without to much loss of generality, we can assume that Omega directly simulates Alice, and runs the simulation through the a simulation of the game, in order make the determination of what choice Alice will make. In other formulations of Newcomb's problem, Omega figures in out some other way what Alice will do, say by doing a formal analysis of her source code, but that seems intuitively equivalent. This is a possible flaw, but if the different versions of Newcomb's problem are equivalent (as they seem to be) this point evaporates, and so we will put it aside for now, and continue.

We will call the simulated agent SimAlice. SimAlice does not know, of course, that she is being simulated, and is an exact copy of Alice in all respects. In particular, she also uses the same SCDT thought processes as Alice, and she has the same utility function as Alice.

So, Alice (or SimAlice, she doesn't know which one she is) is presented with the game. She reasons thusly:

There are two possible cases: Either I am Alice or I am SimAlice. 

  • If I am Alice: Choosing both boxes will always get me exactly $1000 more then choosing just one. Regardless of whether or not there is $1,000,000 in box 2, by choosing box 1 as well, I am getting an extra $1000. (Note that this is exactly the same reasoning standard CDT uses!)
  • If I am SimAlice: Then "I" don't actually get any money in this game, regardless of what I choose. But my goal is not SimAlice getting money it is is Alice getting money, by the simple fact that this is what Alice wants, and we assumed above that SimAlice uses the same utility function as Alice.And depending what I choose now, that will affect the way Omega sets up the boxes, and so affects the amount of money Alice will get. Specifically, if I one box, Omega will put an extra $1,000,000 in box 2, and so Alice will get an extra $1,000,000, no matter what she chooses. (Because in both the choices Alice could make (taking either box 2 or boxes 1&2), she takes box 2, and so will wind up with a bonus $1,000,000 above what she would get if box 2 was empty, which is what would happen if SimAlice didn't two box.)
So, as I don't know whether I am Alice or SimAlice, and as there is one of each, there is a 0.5 probability of me being either one, so by the law of total expectation,
E[money|I one box]=0.5 * E[money|(I one box)&(I am Alice)] + 0.5 * E[money|(I one box)&(I am SimAlice)]
So my expected return off one boxing (above what I would get by two boxing) is 0.5 * -$1000 + 0.5 * $1,000,000 = $450,000, which is positive, so I should one box.

As you can see, just by acknowledging the rules of the game, by admitting that Omega has the power to simulate her (as the rules of Newcomb's problem insist), she will one box. This is unlike a CDT agent, which would ignore Omega's power to simulate her (or otherwise figure out what she will do), and say "Hey, what's in the boxes is fixed, and my choice does not affect it". That is only valid reasoning if you know you are the "original" agent, and Alice herself uses that reasoning, but only in the case where she is assuming she is the "original". She takes care, unlike a CDT agent, to multiply the conditional expected value by the chance of the condition occurring.

This is not only limited to Newcomb's problem. Let's take a look at Parfit's Hitchhiker, another scenario CDT has trouble with. There are again two identical agents making decisions: The "real" Alice, as soon as she gets home; and the "Alice-after-she-gets-home-as simulated-by-the-driver-offering-her-a-ride, which I will again call SimAlice for short.

Conditional on an agent being Alice and not SimAlice, paying the driver loses that agent her $100 and gains her nothing compared to refusing to pay. Conditional on an agent being SimAlice and not Alice, agreeing to pay the driver loses her nothing (as she, being a simulation, cannot give the driver real money), and gains her a trip out of the desert, and so her life. So, again, the law of total expectation gives us that the expected value of paying the driver (considering you don't know which you are), is 0.5 * -$100 + 0.5 * (Value of Alice's life). This gives us that Alice should pay if and only if she values her life at more than $100, which is, once again, the correct answer.

So, to sum up, we found that SCDT can not only solve Newcomb's problem, which standard CDT cannot, but also solve Parfit's Hitchhiker, which neither CDT nor EDT can do. It does so at almost no cost in complexity compared to CDT, unlike, say, TDT, which is rather more complex. In fact, I kind of think that it is entirely possible that this SCDT is nothing more than a special case of something similar to TDT. But even if it is, it is a very nice, simple, and relatively easy to understand special case, and so may deserve a look for that alone.

There are still open problems for SCDT. If, rather than a simulation, you are analysed in a more direct way, should that change anything? What if, in Newcomb's problem, Omega simulates many simulations of you in parallel? Should that change the weights you place on the expected values? This ties in deeply with the philosophical problem of how you assign measure to identical, independent agents. I can not give a simple answer, and a simple answer to those questions is needed before SCDT is complete. But, if we can figure out the answer to these questions, or otherwise bypass them, we have a trivial extrapolation of CDT, the naivest decision theory, which solves correctly most or all of the problems that trip up CDT. That seems quite worthwhile.

Counterfactual self-defense

0 MrMind 23 November 2012 10:15AM

Let's imagine these following dialogues between Omega and an agent implementing TDT. Usual standard assumptions on Omega applies: the agent knows Omega is real, trustworthy and reliable, and Omega knows that the agent knows that, and the agent knows that Omega knows that the agent knows, etc. (that is, Omega's trustworthiness is common knowledge, à la Aumann).

Dialogue 1.

Omega: "Would you accept a bet where I pay you 1000$ if a fair coin flip comes out tail and you pay me 100$ if it comes out head?"
TDT: "Sure I would."
Omega: "I flipped the coin. It came out head."
TDT: "Doh! Here's your 100$."

I hope there's no controversy here.

Dialogue 2.

Omega: "I flipped a fair coin and it came out head."
TDT: "Yes...?"
Omega: "Would you accept a bet where I pay you 1000$ if the coin flip came out tail and you pay me 100$ if it came out head?"
TDT: "No way!"

I also hope no controversy arises: if the agent would answer yes, then there's no reason he wouldn't accept all kinds of losing bets conditioned on information it already knows.

The two bets are equal, but the information is presented in different order: in the second dialogue, the agent has the time to change its knowledge about the world and should not accept bets that it already knows are losing.

But then...

Dialogue 3.

Omega: "I flipped a coin and it came out head. I offer you a bet where I pay you 1000$ if the coin flip comes out tail, but only if you agree to pay me 100$ if the coin flip comes out head."
TDT: "...?"

In the original counterfactual discussion, apparently the answer of the TDT implementing agent should have been yes, but I'm not entirely clear on what is the difference between the second and the third case.

Thinking about it, it seems that the case is muddled because the outcome and the bet are presented at the same time. On one hand, it appears correct to think that an agent should act exactly how it should if it had pre-committed, but on the other hand, an agent should not ignore any information is presented (it's a basic requirement of treating probability as extended logic).

So here's a principle I would like to call 'counterfactual self-defense': whenever informations and bets are presented to the agent at the same time, it always first conditions its priors and only then examines whatever bets has been offered. This should prevent Omega from offering counterfactual losing bets, but not counterfactual winning ones.

Would this principle make an agent win more?

Naive TDT, Bayes nets, and counterfactual mugging

15 Stuart_Armstrong 23 October 2012 03:58PM

I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.

The grand old man of decision theory: the Newcomb problem

First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put $1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:

continue reading »

Decision Theories, Part 3.75: Hang On, I Think This Works After All

23 orthonormal 06 September 2012 04:23PM

Followup to: Decision Theories, Part 3.5: Halt, Melt and Catch Fire, Decision Theories: A Semi-Formal Analysis, Part III

The thing about dead mathematical proofs is that it's practically always worth looting the corpse; sometimes you even find there's still a pulse! That appears to be the case with my recent post lamenting the demise of the TDT-like "Masquerade" algorithm. I think I've got a rigorous proof this time, but I'd like other opinions before I declare that the rumors of Masquerade's demise were greatly exaggerated...

To recap quickly, I've been trying to construct an algorithm that, given the payoff matrix of a game and the source code of its opponent, does some deductions and then outputs a move. I want this algorithm to do the commonsense right things (defect against both DefectBot and CooperateBot, and mutually cooperate against both FairBot and itself), and I want it to do so for simple and general reasons (that is, no gerrymandering of actions against particular opponents, and in particular no fair trying to "recognize itself", since there can be variants of any algorithm that are functionally identical but not provably so within either's powers of deduction). I'd also like it to be "un-exploitable" in a certain sense: it has a default move (which is one of its Nash equilibria), and no opponent can profit against the algorithm by forcing it below that default payoff. If the opponent does as well or better in expected value than it would by playing that Nash equilibrium, then so too does my algorithm.

The revised Masquerade algorithm does indeed have these properties.

In essence, there are two emendations that I needed: firstly, since some possible pairs of masks (like FairBot and AntiFairBot) can't knowingly settle on a fixed point, there's no way to determine what they do without a deductive capacity that strictly exceeds the weaker of them. That's a bad feature to have, so we'll just have to exclude potentially troublemaking masks from Masquerade's analysis. (In the special case of Prisoner's Dilemma I know that including DefectBot and FairBot will suffice; I've got what looks like a good solution in general, as well.)

The second emendation is that FairBot needs to alternate between trying harder to prove its opponent cooperates, and trying harder to prove its opponent defects. (There needs to be an asymmetry, like cooperation proofs going first, to guarantee that when FairBot plays against itself, it finds a Löbian proof of mutual cooperation rather than one of mutual defection.) The reason for this is so that when agents reason about masks, they should be able to find a proof of the mask's action without needing to exceed that mask's powers of deduction. Otherwise we get that arms race again.

This escalation of proof attempts can be represented in terms of proof limits (since there exists a constant C such that for N sufficiently large, a proof that "there are no proofs of X of length less than N" either exists with length less than C^N or not at all), but the simplest way to do this is with the formalism of PA+N. That is, PA is Peano Arithmetic; PA+1 is the formal system with the axioms of Peano Arithmetic and an extra axiom that PA is self-consistent (that is, if PA proves X, then PA does not prove not-X); PA+2 has those axioms and an extra one stating that PA+1 is self-consistent and so on. (Note that none of these formal systems know themselves to be self-consistent, and for good reason!) In every use, we'll assume that N is a fixed number (anything greater than 4 will work).

New And Improved Masks

So without further ado, let's define our masks for the Prisoner's Dilemma:

def DefectBot(X):
    return D

def FairBot(X):
    for i in range(N):
        if "X(FairBot)=C" is provable in PA+i:
            return C
        if "X(FairBot)=D" is provable in PA+i:
            return D
    return D

Lemma 1: For any X, "DefectBot(X)=D" is provable in PA. (Obvious.)

Lemma 2: "FairBot(FairBot)=C" is provable in PA.

Proof: If we label the FairBots separately (after all, they might be logically equivalent but not know they are), we can quickly show that "if PA proves FairBot1(FairBot2)=C, then FairBot2(FairBot1)=C" is provable in PA (and the same with the subscripts reversed). This makes what we might term a Löbian cycle, which will come up a few more times in this post: if we have a formal system S, and "if S proves A, then B" and "if S proves B, then A" are theorems of S, then S indeed proves both A and B. (Proof reserved for a comment; hat tip to Eliezer and co.)

Lemma 3: "FairBot(DefectBot)=D" is provable in PA+1.

Proof: Note that this does not work in PA, even though PA can prove "DefectBot(FairBot)=D". Why not? Because PA can't prove that PA doesn't find a proof of "DefectBot(FairBot)=C" before it can look for a proof of defection! However, PA+1 can handle this just fine, because it proves "if PA proves DB(FB)=D, then PA does not prove DB(FB)=C".

Note the weird fact that it takes a stronger formal system to prove a result that's much more intuitive from outside the system!

New and Improved Masquerade

Let's split up the long definition of Masquerade with a helper function.

def Search(X,Y):
    for i in range(N):
        for j,k in [C,D]:
            if "Y(X)=j and X(Y)=k" is provable in PA+i:
                return (j,k)
    return False

This function, given two agents, simply searches (using increasingly strong formal systems) for a proof of what they do against each other; once it finds one, it halts and returns that result, and if it doesn't find one in time, it announces its failure. We will incorporate this into the definition of Masquerade. (The other function call here is to the payoff matrix, to retrieve the utility U(j,k) of a certain pair of moves.)

def Masquerade(X):
    utility = -∞
    my_move = null
    their_move = null
    for Y in [DefectBot, FairBot]:
        if Search(X,Y) != False:
            (j,k) = Search(X,Y)
            if U(j,k) > utility:
                my_move = j
                their_move = k
                utility = U(j,k)
    if utility > U(D,D):
        for i in range(N):
            if "X(Masquerade) = their_move" is provable in PA+i:
                return my_move
    return D

A few things to note: first, this agent actually halts and outputs an action against any opponent... but for it to provably do so in a system below PA+N, Search(X,Y) has to find proofs quickly. This is the reason that my prior attempt didn't work- it had to wait at one point for the old FairBot to run out of time/power before concluding what it did, and that made it impossible for the old FairBot to know what the old Masquerade did. But with the new and improved agents, we get to ground in a fixed number of steps.

For brevity, I'll label DefectBot, FairBot, and Masquerade as DB, FB, and M, respectively.

Lemma 4: "Search(DB,DB)=(D,D)" is provable in PA+1. (Follows from Lemma 1; note that it needs to use PA+1 in order to rule out finding proofs of other action-pairs.)

Lemma 5: "Search(FB,DB)=Search(DB,FB)=(D,D)" is provable in PA+2. (Follows from Lemma 3.)

Lemma 6: "Search(FB,FB)=(C,C)" is provable in PA. (Follows from Lemma 2; since (C,C) is the first one tried, we don't even need to go up to PA+1.)

Lemma 7: "Masquerade(DB)=D" is provable in PA+3.

Proof: Lemmas 4 and 5, plus the fact that PA+3 knows the consistency of PA+2. There is no sanity-check step, since utility=U(D,D) here.

Lemma 8: "Masquerade(FB)=C" and "FB(Masquerade)=C" are provable in PA+3.

Proof: Lemmas 5 and 6, and the consistency of PA+2, imply that when Masquerade arrives at the sanity-check stage, it has the variables set as utility=U(C,C), my_move=C and their_move=C. Thus PA+3 can prove that "if 'FB(M)=C' is provable in PA+3, then M(FB)=C". And of course, "if 'M(FB)=C' is provable in PA+3, then FB(M)=C" is provable in PA+3, since again PA+3 can prove that PA through PA+2 won't have found proofs of contrary conclusions before it gets around to trying to find cooperation in PA+3. Therefore we have the desired Löbian cycle!

Theorem: "Masquerade(Masquerade)=C" is provable in PA+4.

Proof: Lemmas 7 and 8, and the consistency of PA+3, allow PA+4 to prove that when each Masquerade arrives at the sanity-check stage, it has set utility=U(C,C), my_move=C and their_move=C. Thus we achieve the Löbian cycle, and find proofs of mutual cooperation!

Awesome! So, what next?

Well, assuming that I haven't made a mistake in one of my proofs, I'm going to run the same proof for my generalization: given a payoff matrix in general, Masquerade enumerates all of the constant strategies and all of the "mutually beneficial deals" of the FairBot form (that is, masks that hold out the "stick" of a particular Nash equilibrium and the "carrot" of another spot on the payoff matrix which is superior to the "stick" for both players). Then it alternates (at escalating PA+n levels) between trying to prove the various good deals that the opponent could agree to. There are interesting complexities here (and an idea of what bargaining problems might involve).

Secondly, I want to see if there's a good way of stating the general problem that Masquerade solves, something better than "it agrees with commonsense decision theory". The analogy here (and I know it's a fatuous one, but bear with me) is that I've come up with a Universal Turing Machine but not yet the Church-Turing Thesis. And that's unsatisfying.

But before anything else... I want to be really sure that I haven't made a critical error somewhere, especially given my false start (and false halt) in the past. So if you spot a lacuna, let me know!

Decision Theories, Part 3.5: Halt, Melt and Catch Fire

31 orthonormal 26 August 2012 10:40PM

Followup to: Decision Theories: A Semi-Formal Analysis, Part III

UPDATE: As it turns out, rumors of Masquerade's demise seem to have been greatly exaggerated. See this post for details and proofs!

I had the chance, over the summer, to discuss the decision theory outlined in my April post with a bunch of relevantly awesome people. The sad part is, there turned out to be a fatal flaw once we tried to formalize it properly. I'm laying it out here, not with much hope that there's a fix, but because sometimes false starts can be productive for others.

Since it's not appropriate to call this decision theory TDT, I'm going to use a name suggested in one of these sessions and call it "Masquerade", which might be an intuition pump for how it operates. So let's first define some simple agents called "masks", and then define the "Masquerade" agent.

Say that our agent has actions a1, ... , an, and the agent it's facing in this round has actions b1, ... , bm. Then for any triple (bi, aj, ak), we can define a simple agent Maskijk which takes in its opponent's source code and outputs an action:

def Mask_ijk(opp_src):
look for proof that Opp(Mask_ijk) = bi
if one is found, then output aj
otherwise, output ak

(This is slightly less general than what I outlined in my post, but it'll do for our purposes. Note that there's no need for aj and ak to be distinct, so constant strategies fall under this umbrella as well.)

A key example of such an agent is what we might call FairBot: on a Prisoner's Dilemma, FairBot tries to prove that the other agent cooperates against FairBot, and if it finds such a proof, then it immediately cooperates. If FairBot fails to find such a proof, then it defects. (An important point is that if FairBot plays against itself and both have sufficiently strong deductive capacities, then a short proof of one's cooperation gives a slightly longer proof of the other's cooperation, and thus in the right circumstances we have mutual cooperation via Löb's Theorem.)

The agent Masquerade tries to do better than any individual mask (note that FairBot foolishly cooperates against CooperateBot when it could trivially do better by defecting). My original formulation can be qualitatively described as trying on different masks, seeing which one fares the best, and then running a "sanity check" to see if the other agent treats Masquerade the same way it treats that mask. The pseudocode looked like this:

def Masquerade(opp_src):
for each (i,j,k), look for proofs of the form "Mask_ijk gets utility u against Opp"
choose (i,j,k) corresponding to the largest such u found
look for proof that Opp(Masquerade) = Opp(Mask_ijk)
if one is found, then output the same thing as Mask_ijk(Opp)
otherwise, output a default action

(The default should be something safe like a Nash equilibrium strategy, of course.)

Intuitively, when Masquerade plays the Prisoner's Dilemma against FairBot, Masquerade finds that the best utility against FairBot is achieved by some mask that cooperates, and then Masquerade's sanity-check is trying to prove that FairBot(Masquerade) = C as FairBot is trying to prove that Masquerade(FairBot) = C, and the whole Löbian circus goes round again. Furthermore, it's intuitive that when Masquerade plays against another Masquerade, the first one notices the proof of the above, and finds that the best utility against the other Masquerade is achieved by FairBot; thus both pass to the sanity-check stage trying to imitate FairBot, both seek to prove that the other cooperate against themselves, and both find the Löbian proof.

So what's wrong with this intuitive reasoning?

Problem: A deductive system can't count on its own consistency!

Let's re-examine the argument that Masquerade cooperates with FairBot. In order to set up the Löbian circle, FairBot needs to be able to prove that Masquerade selects a mask that cooperates with FairBot (like CooperateBot or FairBot). There are nice proofs that each of those masks attains the mutual-cooperation payoff against FairBot, but we also need to be sure that some other mask won't get the very highest (I defect, you cooperate) payoff against FairBot. Now you and I can see that this must be true, because FairBot simply can't be exploited that way. But crucially, FairBot can't deduce its own inexploitability without thereby becoming exploitable (for the same Gödelian reason that a formal system can't prove its own consistency unless it is actually inconsistent)!

Now, the caveats to this are important: if FairBot's deductive process is sufficiently stronger than the deductive process that's trying to exploit it (for example, FairBot might have an oracle that can answer questions about Masquerade's oracle, or FairBot might look for proofs up to length 2N while Masquerade only looks up to length N), then it can prove (by exhaustion if nothing else) that Masquerade will select a cooperative mask after all. But since Masquerade needs to reason about Masquerade at this level, this approach goes nowhere. (At first, I thought that having a weaker oracle for Masquerade's search through masks, and a stronger oracle both for each mask and for Masquerade's sanity-check, would solve this. But that doesn't get off the ground: the agent thus defined attains mutual cooperation with FairBot, but not with itself, because the weaker oracle can't prove that it attains mutual cooperation with FairBot.)

Another caveat is the following: FairBot may not be able to rule out the provability of some statement we know is false, but (given a large enough deductive capacity) it can prove that a certain result is the first of its kind in a given ordering of proofs. So if our agents act immediately on the first proof they find, then we could make a version of Masquerade work... as long as each search does find a proof, and as long as that fact is provable by the same deduction system. But there's an issue with this: two masks paired against each other won't necessarily have provable outcomes!

Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it defects if it finds one; if it doesn't find such a proof, then it cooperates. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do! Thus, Masquerade can't be sure that AntiFairBot won't get the highest payoff against FairBot (which of course it won't) unless it uses a stronger deduction system for the search through masks than FairBot uses for its proof search (which would mean that FairBot won't be able to tell what mask Masquerade picks).

I tried to fix this by iterating over only some of the masks; after all, there's no realistic opponent against whom AntiFairBot is superior to both FairBot and DefectBot. Unfortunately, at this point I realized two things: in order to play successfully against a reasonable range of opponents on the Prisoner's Dilemma, Masquerade needs to be able to imitate at least both FairBot and DefectBot; and FairBot cannot prove that FairBot defects against DefectBot. (There are variants of FairBot that can do so, e.g. it could search both for proofs of cooperation and proofs of defection and playing symmetrically if it finds one, but this variant is no longer guaranteed to cooperate against itself!)

If there are any problems with this reasoning, or an obvious fix that I've missed, please bring it to my attention; but otherwise, I've decided that my approach has failed drastically enough that it's time to do what Eliezer calls "halt, melt, and catch fire". The fact that Löbian cooperation works is enough to keep me optimistic about formalizing this side of decision theory in general, but the ideas I was using seem insufficient to succeed. (Some variant of "playing chicken with my deductive system" might be a crucial component.)

Many thanks to all of the excellent people who gave their time and attention to this idea, both on and offline, especially Eliezer, Vladimir Slepnev, Nisan, Paul Christiano, Critch, Alex Altair, Misha Barasz, and Vladimir Nesov. Special kudos to Vladimir Slepnev, whose gut intuition on the problem with this idea was immediate and correct.

Transcript: "Choice Machines, Causality, and Cooperation"

8 Randaly 07 August 2012 10:15PM

Gary Drescher's presentation at the 2009 Singularity Summit, "Choice Machines, Causality, and Cooperation," is online, at vimeo. Drescher is the author of Good and Real, which has been recommended many times on LW. I've transcribed his talk, below.

 

 

My talk this afternoon is about choice machines: machines such as ourselves that make choices in some reasonable sense of the word. The very notion of mechanical choice strikes many people as a contradiction in terms, and exploring that contradiction and its resolution is central to this talk. As a point of departure, I'll argue that even in a deterministic universe, there's room for choices to occur: we don't need to invoke some sort of free will that makes an exception to the determinism, no do we even need randomness, although a little randomness doesn't hurt. I'm going to argue that regardless of whether our universe is fully deterministic, it's at least deterministic enough that the compatibility of choice and full deterministic has some important ramifications that do apply to our universe. I'll argue that if we carry the compatibility of choice and determinism to its logical conclusions, we obtain some progressively weird corollaries: namely, that it sometimes makes sense to act for the sake of things that our actions cannot change and cannot cause, and that that might even suggest a way to derive an essentially ethical prescription: an explanation for why we sometimes help others even if doing so causes net harm to our own interests.

 

[1:15]

 

An important caveat in all this, just to manage expectations a bit, is that the arguments I'll be presenting will be merely intuitive- or counter-intuitive, as the case may be- and not grounded in a precise and formal theory. Instead, I'm going to run some intuition pumps, as Daniel Dennett calls them, to try to persuade you what answers a successful theory would plausibly provide in a few key test cases.

 

[1:40]

 

Perhaps the clearest way to illustrate the compatibility of choice and determinism is to construct or at least imagine a virtual world, which superficially resembles our own environment and which embodies intelligent or somewhat intelligent agents. As a computer program, this virtual world is quintessentially determinist: the program specifies the virtual world's initial conditions, and specifies how to calculate everything that happens next. So given the program itself, there are no degrees of freedom about what will happen in the virtual world. Things do change in the world from moment to moment, of course, but no event ever changes from what was determined at the outset. In effect, all events just sit, statically, in spacetime. Still, it makes sense for agents in the world to contemplate what would be the case were they to take some action or another, and it makes sense for them to select an action accordingly.

 

[2:35]

 

 

For instance, an agent in the illustrated situation here might reason that, were it move to its right, which is our left, then the agent would obtain some tasty fruit. But, instead, if it moves to its left, it falls off a cliff. Accordingly, if its preferences scheme assigns positive utility to the fruit, and negative utility to falling off the cliff, that means the agent moves to its right and not to its left. And that process, I would submit, is what we more or less do ourselves when we engage in what we think of as making choices for the sake of our goals.

 

[3:08]

 

The process, the computational process of selecting an action according to the desirability of what would be the case were the action taken, turns to be what our choice process consists of. So, from this perspective, choice is a particular kind of computation. The objection that choice isn't really occurring because the outcome was already determined is just as much a non-sequitur as suggesting that any other computation, for example, adding up a list of numbers, isn't really occurring just because the outcome was predetermined.

 

[3:41]

 

So, the choice process takes place, and we consider that the agents has a choice about the action that the choice selects and has a choice about the associated outcomes, meaning that those outcomes occur as a consequence of the choice process. So, clearly an agent that executes a choice process and that correctly anticipates what would be the case if various contemplated actions were taken will better achieve its goals than one that, say, just acts at random or one that takes a fatalist stance, that there's no point in doing anything in particular since nothing can change from what it's already determined to be. So, if we were designing intelligent agents and wanted them to achieve their goals, we would design them to engage in a choice process. Or, if the virtual world were immense enough to support natural selection and the evolution of sufficiently intelligent creatures, then those evolved creatures could be expected to execute a choice process because of the benefits conferred.

 

[4:38]

 

So the inalterability of everything that will ever happen does not imply the futility of acting for the sake of what is desired. The key to the choice relation is the “would be-if” relation, also known as the subjunctive or counterfactual relation. Counterfactual because it entertains a hypothetical antecedent about taking a certain action, that is possibly contrary to fact- as in the case of moving to the agent's left in this example. Even thought the moving left action does not in fact occur, the agent does usefully reason about what would the case if that action were taken, and indeed it's that very reasoning that ensures that the action does not in fact occur.

 

[5:21]

 

There are various technical proposals for how to formally specific a “would be-if”relation- David Lewis has a classic formulation, Judea Pearl has a more recent one- but they're not necessarily the appropriate version of “would be-if” to use for purposes of making choices, for purposes of selecting an action based on the desirability of what would then be the case. And, although I won't be presenting a formal theory, the essence of this talk is to investigate some properties of “would be-if,” the counterfactual relation that's appropriate to use for making choices.

 

[5:57]

 

In particular, I want to address next the possibility that, in a sufficiently deterministic universe, you have a choice about some things that your action cannot cause. Here's an example: assume or imagine that the universe is deterministic, with only one possible history following from any given state of the universe at a given moment. And let me define a predicate P that gets applied to the total state of the universe at some moment. The predicate P is defined to be true of a universe state just in case the laws of physics applied to that total state specify that a billion years after that state, my right hand is raised. Otherwise, the predicate P is false of that state.

 

[6:44]

 

Now, suppose I decide, just on a whim, that I would like that state of the universe a billion years ago to have been such that the predicate P was true of that past state. I need only raise my right hand now, and, lo and behold, it was so. If, instead, I want the predicate to have been false, then I lower my hand and the predicate was false. Of course, I haven't changed what the past state of the universe is or was; the past is what it is, and can never be changed. There is merely a particular abstract relation, a “would be-if” relation, between my action and the particular past state that is the subject of my whimsical goal. I cannot reasonably take the action and not expect that the past state will be in correspondence.

 

[7:39]

 

So, I can't change the past, nor does my action have any causal influence over the past- at least, not in the way we normally and usefully conceive of causality, where causes are temporally prior to effects, and where we can think of causal relations as essentially specifying how the universe computes its subsequent states from its previous states. Nonetheless, I have exactly as much choice about the past value of the predicate I have defined as I have, despite its inalterability, as I have about whether to raise my hand now, despite the inalterability of that too, in a deterministic universe. And if I were to believe otherwise, and were to refrain from raising my hand merely because I can't change the past even though I do have a whimsical preference about the past value of the specified predicate, then, as always with fatalist resignation, I'd be needlessly forfeiting an opportunity to have my goals fulfilled.

 

[8:41]

 

If we accept the conclusion that we sometimes have a choice about what you cannot change or even cause, or at least tentatively accept it in order to explore its ramifications, then we can go on now to examine a well-known science fiction scenario called Newcomb's Problem. In Newcomb's Problem, a mischievous benefactor presents you with two boxes: there is a small, transparent box, containing a thousand dollars, which you can see; and there is a larger, opaque box, which you are truthfully told contains either a million dollars or nothing at all. You can't see which; the box is opaque, and you are not allowed to examine it. But you are truthfully assured that the box has been sealed, and that its contents will not change from whatever it already is.

 

[9:27]

 

You are now offered a very odd choice: you can take either the opaque box alone, or take both boxes, and you get to keep the contents of whatever you take. That sure sounds like a no brainer:if we assume that maximizing your expected payoff in this particular encounter is the sole relevant goal, then regardless of what's in the opaque box, there's no benefit to foregoing the additional thousand dollars.

 

[9:51]

 

But, before you choose, you are told how the benefactor decided how much money to put in the opaque box- and that brings us to the science fiction part of the scenario. What the benefactor did was take a very detailed local snapshot of the state of the universe a few minutes ago, and then run a faster-than-real time simulation to predict with high accuracy to predict with high accuracy whether you would take both boxes, or just the opaque box. A million dollars was put in the opaque box if and only if you were predicted to take only the opaque box.

 

[10:22]

 

Admittedly the super-predictability here is a bit physically implausible, and goes beyond a mere stipulation of determinism. Still, at least it's not logically impossible- provided that the simulator can avoid having to simulate itself, and thus avoid a potential infinite regress. (The opaque box's opacity is important in that regard: it serves to insulate you from being effectively informed of the outcome of the simulation itself, so the simulation doesn't have to predict its own outcome in order to predict what you are going to have to do.) So, let's indulge the super-predictability assumption, and see what comes from it. Eventually, I'm going to argue that the real world is at least deterministic enough and predictable enough that some of the science-fiction conclusions do carry over to reality.

 

[11:12]

 

So, you now face the following choice: if you take the opaque box alone, then you can expect with high reliability that the simulation predicted you would do so, and so you expect to find a million dollars in the opaque box. If, on the other hand, you take both boxes, then you should expect the simulation to have predicted that, and you expect to find nothing in the opaque box. If and only if you expect to take the opaque box alone, you expect to walk away with a million dollars. Of course, your choice does not cause the opaque box's content to be one way or the other; according to the stipulated rules, the box content already is what it is, and will not change from that regardless of what choice you make.

 

[11:49]

 

But we can apply the lesson from the handraising example- the lesson that you sometimes have a choice about things your action does not change or cause- because you can reason about what would be the case if, perhaps contrary to fact, you were to take a particular hypothetical action. And, in fact, we can regard Newcomb's Problem as essentially harnessing the same past predicate consequence as in the handraising example- namely, if and only if you take just the opaque box, then the past state of the universe, at the time the predictor took the detailed snapshot was such that that state leads, by physical laws, to your taking just the opaque box. And, if and only if the past state was thus, the predictor would predict you taking the opaque box alone, and so a million dollars would be in the opaque box, making that the more lucrative choice. And it's certainly the case that people who would make the opaque box choice have a much higher expected gain from such encounters than those who take both boxes.

 

[12:47]

 

Still, it's possible to maintain, as many people do, that taking both boxes is the rational choice, and that the situation is essentially rigged to punish you for your predicted rationality- much as if a written exam were perversely graded to give points only for wrong answers. From that perspective, taking both boxes is the rational choice, even if you are then left to lament your unfortunate rationality. But that perspective is, at the very least, highly suspect in a situation where, unlike the hapless exam-taker, you are informed of the rigging and can take it into account when choosing your action, as you can in Newcomb's Problem.

 

[13:31]

 

And, by the way, it's possible to consider an even stranger variant of Newcomb's Problem, in which both boxes are transparent. In this version, the predictor runs a simulation that tentatively presumes that you'll see a million dollars in the larger box. You'll be presented with a million dollars in the box for real if and only if the simulation shows that you would then take the million dollar box alone. If, instead, the simulation predicts that you would take both boxes if you see a million dollars in the larger box, then the larger box is left empty when presented for real.

 

[14:12]

 

So, let's suppose you're confronted with this scenario, and you do see a million dollars in the box when it's presented for real. Even though the million dollars is already there, and you see it, and it can't change, nonetheless I claim that you should still take the million dollar box alone. Because, if you were to take both boxes instead, contrary to what in fact must be the case in order for you to be in this situation in the first place, then, also contrary to what is in fact the case, the box would not contain a million dollars- even though in fact it does, and even though that can't change! The same two-part reasoning applies as before: if and only if you were to take just the larger box, then the state of the universe at the time the predictor takes a snapshot must have been such that you would take just that box if you were to see a million dollars in that box. If and only if the past state had been thus, the Predictor would have put a million dollars in the box.

 

[15:07]

 

Now, the prescription here to take just the larger box is more shockingly counter-intuitive than I can hope to decisively argue for in a brief talk, but, do at least note that a person who agrees that it is rational to take just the one box here does fare better than a person who believes otherwise, who would never be presented with a million dollars in the first place. If we do, at least tentatively, accept some of this analysis, for the sake of argument to see what follows from it, then we can move on now to another toy scenario, which dispenses with the determinism and super-prediction assumptions and arguably has more direct real world applicability.

 

[15:42]

 

That scenario is the famous prisoner's dilemma. The prisoner's dilemma is a two player game in which both players make their moves simultaneously and independently, with no communication until both moves have been made. A move consists of writing down either the word “cooperate” or “defect.” The payoff matrix is as shown:

 

 

If both players choose cooperate, they both receive 99 dollars. If both defect, they both get 1 dollar. But if one player cooperates and the other defects, then the one who cooperates gets nothing, and the one who defects gets 100 dollars.

 

[16:25]

 

Crucially, we stipulate that each player cares only about maximizing her own expected payoff, and that the payoff in this particular instance of the game is the only goal, with no affect on anything else, including any subsequent rounds of the game, that could further complicate the decision. Let's assume that both players are smart and knowledgeable enough to find the correct solution to this problem and to act accordingly. What I mean by the correct answer is the one that maximizes that player's expected payoff. Let's further assume that each player is aware of the other player's competence, and their knowledge of their own competence, and so on. So then, what is the right answer that they'll both find?

 

[17:07]

 

On the face of it, it would be nice if both players were to cooperate, and receive close to the maximum payoff. But if I'm one of the players, I might reason that y opponent's move is causally independent of mine: regardless of what I do, my opponent's move is either to cooperate or not. If my opponent cooperates, I receive a dollar more if I defect than if I cooperate- 100$ vs 99$. Likewise if my opponent defects: I get a dollar more if I defect than if I cooperate, in this case 1 dollar vs nothing. So, in either case, regardless of what move my opponent makes, my defected causes me to get one dollar more than my cooperating causes me to get, which seemingly makes defected the right choice. Defecting is indeed the choice that's endorsed by standard game theory. And of course my opponent can reason similarly.

 

[18:06]

 

So, if we're both convinced that we only have a choice about what we can cause, then we're both rationally compelled to defect, leaving us both much poorer than if we both cooperated. So, here again, an exclusively causal view of what we have a choice about leads to us having to lament that our unfortunate rationality keeps a much better outcome out of our reach. But we can arrive at a better outcome if we keep in mind the lesson from Newcomb's problem or even the handraising example that it can make sense to act for the sake of what would be the case if you so acted, even if your action does not cause it to be the case. Even without the help of any super-predictors in this scenario, I can reason that if I, acting by stipulation as a correct solver of this problem, were to choose to cooperate, then that's what correct solvers of this problem do in such situations, and in particular that's what my opponent, as a correct solver of this problem, does too.

[19:05]

 

Similarly, if I were to figure out that defecting is correct, that's what I can expect my opponent to do. This is similar to my ability to predict what your answer to adding a given pair of numbers would be: I can merely add the numbers myself, and, given our mutual competence at addition, solve the problem. The universe is predictable enough that we routinely, and fairly accurately, make such predictions about one another. From this viewpoint, I can reason that, if I were to cooperate or not, then my opponent would make the corresponding choice- if indeed we are both correctly solving the same problem, my opponent maximizing his expected payoff just as I maximize mine. I therefore act for the sake of what my opponent's action would then be, even though I cannot causally influence my opponent to take one action or the other, since there is no communication between us. Accordingly, I cooperate, and so does my opponent, using similar reasoning, and we both do fairly well.

 

[20:05]

 

One problem with the Prisoner's Dilemma is that the idealized degree of symmetry that's postulated between the two players may seldom occur in real life. But there are some important generalizations that may apply much more broadly. In particular, in many situations, the beneficiary of your cooperation may not be the same as the person whose cooperation benefits you. Instead, your decision whether to cooperate with one person may be symmetric to a different person's decision to cooperate with you. Again, even in the absence of any causal influence upon your potential benefactors, even if they will never learn of your cooperation with others, and even, moreover, if you already know of their cooperation with you before you make your own choice. That is analogous to the transparent version of Newcomb's Problem: there too, you act for the same of something that you already know is already obtained.

 

[21:04]

 

Anyways, as many authors have noted with regards to the Prisoner's Dilemma, this is beginning to sound a little like the Golden Rule or the Categorical Imperative: act towards others as you would like others to act towards you, in similar situations. The analysis in terms of counterfactual reasoning provides a rationale, under some circumstances, for taking an action that causes net harm to your own interests and net benefit to others' interests although the choice is still ultimately grounded in your own goals because of what would be the case because of others' isomorphic behavior if you yourself were to cooperate or not. Having a derivable rationale for ethical or benevolent behaviour would be desirable for all sorts of reasons, not least of which is to help us make the momentous decisions as to how or even whether to engineer the Singularity, and also to tell us what sort of value system we might want- or expect- an AI to have.

 

[22:08]

 

But a key assumption of the argument just given is that it requires all participants to be perfectly rational, and, further, to be aware of all others' rationality- Douglas Hofstadter refers to this as the “superrationality” assumption. It would be nice to be able to show that, even among those of us with more limited rationality, there's still enough of a “would be-if” relation, albeit perhaps quantitatively weakened, between my own choice and others' choices in Prisoner's Dilemma situations to justify the cooperative solution in such cases. But I'm not aware of an entirely satisfactory treatment of that question, so it remains an open question as far as I know. Still, I think it's hopeful that we can at least get our foot in the door, by arguing for the correctness of the cooperative solution in some cases that presume idealized rationality.

 

[23:00]

 

Summing up, the key points are that:

  • Inalterability does not imply futility

  • You have a choice about some things that your action cannot change or even cause

  • One consequence is a derivable prescription to sometimes cooperate with others even when doing so causes net harm to your goals and net benefits to their goals

  • The same false intuition that makes all choice seem impossible or futile given determinism, also makes it seem futile to act for the sake of the million dollars in Newcomb's Problem, or for the sake of another player's cooperation in the Prisoner's Dilemma, since you cannot cause, or even change, what you act for the sake of

Making a fully convincing case for all of this would require a convincing theory of the "would be-if" relation, the counterfactual or subjunctive relation, consists of, which I have not presented. What this talk outlined instead is a glimpse of some consequences that such a theory would arguably have to lead to, some answers the theory would have to give in some key examples, if the theory can avoid putting us in the position of lamenting our own rationality.

As for an underlying theory, my book Good and Real sketches what could be seen as a modified evidentialist theory, for those familiar with that concept. But there is some exciting work being pursued now by Eliezer Yudkowsky and others at the Singularity Institute and elsewhere that may be converging on a much more rigorous and elegant underlying theory, and hopefully we'll be hearing more about that in the not-too-distance future.

Thoughts on a possible solution to Pascal's Mugging

2 Dolores1984 01 August 2012 12:32PM

For those who aren't familiar, Pascal's Mugging is a simple thought experiment that seems to demonstrate an intuitive flaw in naive expected utility maximization.  In the classic version, someone walks up to you on the street, and says, 'Hi, I'm an entity outside your current model of the universe with essentially unlimited capabilities.  If you don't give me five dollars, I'm going to use my powers to create 3^^^^3 people, and then torture them to death.'  (For those not familiar with Knuth up-arrow notation, see here).  The idea being that however small your probability is that the person is telling the truth, they can simply state a number that's grossly larger -  and when you shut up and multiply, expected utility calculations say you should give them the five dollars, along with pretty much anything else they ask for.  

Intuitively, this is nonsense.  However, an AI under construction doesn't have a piece of code that lights up when exposed to nonsense.  Not unless we program one in.  And formalizing why, exactly, we shouldn't listen to the mugger is not as trivial as it sounds.  The actual underlying problem has to do with how we handle arbitrarily small probabilities.  There are a number of variations you could construct on the original problem that present the same paradoxical results.  There are also a number of simple hacks you could undertake that produce the correct results in this particular case, but these are worrying (not to mention unsatisfying) for a number of reasons.

So, with the background out of the way, let's move on to a potential approach to solving the problem which occurred to me about fifteen minutes ago while I was lying in bed with a bad case of insomnia at about five in the morning.  If it winds up being incoherent, I blame sleep deprivation.  If not, I take full credit.   

 

Let's take a look at a new thought experiment.  Let's say someone comes up to you and tells you that they have magic powers, and will make a magic pony fall out of the sky.  Let's say that, through some bizarrely specific priors, you decide that the probability that they're telling the truth (and, therefore, the probability that a magic pony is about to fall from the sky) is exactly 1/2^100.  That's all well and good.

Now, let's say that later that day, someone comes up to you, and hands you a fair quarter and says that if you flip it one hundred times, the probability that you'll get a straight run of heads is 1/2^100.  You agree with them, chat about math for a bit, and then leave with their quarter.  

I propose that the probability value in the second case, while superficially identical to the probability value in the first case, represents a fundamentally different kind of claim about reality than the first case.  In the first case, you believe, overwhelmingly, that a magic pony will not fall from the sky.  You believe, overwhelmingly, that the probability (in underlying reality, divorced from the map and its limitations) is zero.  It is only grudgingly that you inch even a tiny morsel of probability into the other hypothesis (that the universe is structured in such a way as to make the probability non-zero).  

In the second case, you also believe, overwhelmingly, that you will not see the event in question (a run of heads).  However, you don't believe that the probability is zero.  You believe it's 1/2^100.  You believe that, through only the lawful operation of the universe that actually exists, you could be surprised, even if it's not likely.  You believe that if you ran the experiment in question enough times, you would probably, eventually, see a run of one hundred heads.  This is not true for the first case.  No matter how many times somebody pulls the pony trick, a rational agent is never going to get their hopes up.      

 

I would like, at this point, to talk about the notion of metaconfidence.  When we talk to the crazy pony man, and to the woman with the coin, what we leave with are two identical numerical probabilities.  However, those numbers do not represent the sum total of the information at our disposal.  In the two cases, we have differing levels of confidence in our levels of confidence.  And, furthermore, this difference has an actual ramifications on what a rational agent should expect to observe.  In other words, even from a very conservative perspective, metaconfidence intervals pay rent.  By treating the two probabilities as identical, we are needlessly throwing away information.  I'm honestly not sure if this topic has been discussed before.  I am not up to date on the literature on the subject.  If the subject has already been thoroughly discussed, I apologize for the waste of time.  

Disclaimer aside, I'd like to propose that we push this a step further, and say that metaconfidence should play a role in how we calculate expected utility.  If we have a very small probability of a large payoff (positive or negative), we should behave differently when metaconfidence is high than when it is low.          

From a very superificial analysis, lying in bed, metaconfidence appears to be directional.  A low metaconfidence, in the case of the pony claim, should not increase the probability that the probability of a pony dropping out of the sky is HIGHER than our initial estimate.  It also works the other way as well: if we have a very high degree of confidence in some event (the sun rising tomorrow), and we get some very suspect evidence to the contrary (an ancient civilization predicting the end of the world tonight), and we update our probability downward slightly, our low metaconfidence should not make us believe that the sun is less likely to rise tomorrow than we thought.  Low metaconfidence should move our effective probability estimate against the direction of the evidence that we have low confidence in: the pony is less likely, and the sunrise is more likely, than a naive probability estimate would suggest.    

So, if you have a claim like the pony claim (or Pascal's mugging), in which you have a very low estimated probability, and a very low metaconfidence, should become dramatically less likely to actually happen, in the real world, than a case in which we have a low estimated probability, but a very high confidence in that probability.  See the pony versus the coins.  Rationally, we can only mathematically justify so low a confidence in the crazy pony man's claims.  However, in the territory, you can add enough coins that the two probabilities are mathematically equal, and you are still more likely to get a run of heads than you are to have a pony magically drop out of the sky.  I am proposing metaconfidence weighting as a way to get around this issue, and allow our map to more accurately reflect the underlying territory.  It's not perfect, since metaconfidence is still, ultimately, calculated from our map of the territory, but it seems to me, based on my extremely brief analysis, that it is at least an improvement on the current model.    

Essentially, this idea is based on the understanding that the numbers that we generate and call probability do not, in fact, correspond to the actual rules of the territory.  They are approximations, and they are perturbed by observation, and our finite data set limits the resolution of the probability intervals we can draw.  This causes systematic distortions at the extreme ends of the probability spectrum, and especially at the small end, where the scale of the distortion rises dramatically as a function of the actual probability.  I believe that the apparently absurd behavior demonstrated by an expected-utility agent exposed to Pascal's mugging, is a result of these distortions.  I am proposing we attempt to compensate by filling in the missing information at the extreme ends of the bell curve with data from our model about our sources of evidence, and about the underlying nature of the territory.  In other words, this is simply a way to use our available evidence more efficiently, and I suspect that, in practice, it eliminates many of the Pascal's-mugging-style problems we encounter currently.       

I apologize for not having worked the math out completely.  I would like to reiterate that it is six thirty in the morning, and I've only been thinking about the subject for about a hundred minutes.  That said, I'm not likely to get any sleep either way, so I thought I'd jot the idea down and see what you folks thought.  Having outside eyes is very helpful, when you've just had a Brilliant New Idea.  

[video] Paul Christiano's impromptu tutorial on AIXI and TDT

7 lukeprog 19 March 2012 05:20PM

Paul Christiano was about to give a tutorial on AIXI and TDT, so I whipped out my iPhone and recorded it. His tutorial wasn't carefully planned or executed, but it may still be useful to some. Note that when Paul writes "UDT" on a piece of paper he really meant "TDT." :)

 

HD video download links: 1, 2.

The Noddy problem

35 Apprentice 12 January 2012 10:18PM

An episode of the Noddy animated series has the following plot.

Noddy needs to go pick up Martha Monkey at the station. But it's such a nice, sunny day that he would prefer to play around outside. He gets an idea to solve this dilemma. He casts a duplication spell on himself and his car and tells the duplicate to go fetch Martha while he goes out to play. Later, Noddy is out having fun when he suddenly spots his duplicate. It turns out that the duplicate also preferred playing outside to doing the errand so he also cast a duplication spell. Then they see another duplicate, and another...

I think this story makes for a nice simple illustration of one of our perennial decision theoretic issues: When making decisions you should take into account that agents identical to yourself will make the same decision in the same situation. A common real-life example of the Noddy problem is when we try to pawn off our dietary problems to our future selves.

preferences:decision theory :: data:code

3 ArthurB 19 February 2011 07:45AM

 

I'd like to present a couple thoughts. While I am somewhat confident in my reasonning, my conclusions strongly contradict what I perceive (possibly incorrectly) to be the concensus around decision theory on LessWrong. This consensus has been formed by people who have spent more time than me thinking about it, and are more intelligent than I am. I am aware of that, this is strong evidence that I am mistaken or obvious. I believe nonetheless the argument I'm about to make is valuable and should be heard. 

It is argued that the key difference between Newcomb's problem and Solomon's problem is that precommitment is useful in the former and useless in the latter. I agree that the problems are indeed different, but I do not think that is the fundamental reason. The devil is in the details.

Solomon's problem states that

 - There is a gene that causes people to chew gum and to develop throat cancer
 - Chewing gum benefits everyone

It is generally claimed that EDT would decide not to chew gum, because doing so would place the agent in a state where its expected utility is reduced. This seems incorrect to me. The ambiguity is in what is meant by "causes people to chew gum". If the gene really causes people to chew gum, then that gene by definition affects that agent's decision theory, and the hypothesis that it is also following EDT is contradictory. What is generally meant is that having this gene induces a preference to chew gum, which is generally acted upon by whatever decision algorithm is used. An EDT agent must be fully aware of its own preferences, otherwise it could not calculate its own utility, therefore, the expected utility of chewing gum must be calculated conditional on having a preexisting or non preexisting taste for gum. In a nutshell, an EDT agent updates not on his action to chew gum, but on his desire to do so.

I've established here a distinction between preferences and decision theory. In fact, the two are interchangeable. It is always possible to hard code preferences in the decision theory, and vice versa. The distinction is very similar to the one drawn between code and data. It is an arbitrary but useful distinction. Intuitively, I believe hard coding preferences in the decision algorithm is poor design, though I do not have a clear argument why that is.

If we insist on preferences being part of the decision algorithm, the best decision algorithm for solomon's problem is the one that doesn't have a cancer causing gene. If the algorithm is EDT, then liking gum is a preference, and EDT makes the same decision as CDT.

Let's now look at Newcomb's problem. Omega's decision is clearly not based on a subjective preference for one box or two box (let's say an aesthetic preference for example). Omega's decision is based on our decision algorithm itself. This is the key difference between the two problems, and this is why precommitment works for Newcomb's and not Solomon's.

Solomon's problem is equivalent to this problem, which is not Newcomb's

- If Omega thinks you were born loving Beige, he puts $1,000 in box Beige and nothing in box Aquamarine.
- Otherwise, he puts $1,000 in box Beige and nothing in box Aquamarine.

In this problem, both CDT and EDT (correctly) two box. Again, this is because EDT knows that it loves beige.

Now the real Newcomb's problem. I argue that an EDT agent should integrate his own decision as evidence. 

 - If EDT's decision is to two-box, then Omega's prediction is that EDT two boxes and EDT should indeed two-box.
 - If EDT's decision is to one-box, then Omega's prediction is that EDT one box, and EDT should two-box. 

Since EDT reflects on his own decision, it can only be the only fixed point which is to two box.

Both CDT and EDT decide to chew gum and to two box.

If we're out shopping for decision algorithms (TDT, UDT...), we might as well shop for a set of preferences, since they can be interchangeable. It is clear that some preferences allow winning, when variable sum games are involved. This has been implemented by evolution as moral preferences, not as decision algorithms. One useful preference is the preference to keep one's word. Such a preference allows to pay Parfit's hitchiker without involving any preference reversal. Once you're safe, you do not try not to pay, because you genuinely prefer not breaking your promise than keeping the money. Yes, you could have preferences to two box, but there is no reason why you should catter in advance to crazy cosmic entities rewarding certain algorithms or preferences. Omega is no more likely than the TDT and UDT minimizer, evil entities known for torturing TDT and UDT practionners.

 

Edit: meant to write EDT two-boxes, which is the only fixed point.

 

Punishing future crimes

3 Bongo 28 January 2011 09:00PM

Here's an edited version of a puzzle from the book "Chuck Klosterman four" by Chuck Klosterman.

It is 1933. Somehow you find yourself in a position where you can effortlessly steal Adolf Hitler's wallet. The theft will not effect his rise to power, the nature of WW2, or the Holocaust. There is no important identification in the wallet, but the act will cost Hitler forty dollars and completely ruin his evening. You don't need the money. The odds that you will be caught committing the crime are negligible. Do you do it?

When should you punish someone for a crime they will commit in the future? Discuss.

Discussion for Eliezer Yudkowsky's paper: Timeless Decision Theory

10 Alexei 06 January 2011 12:28AM

I have not seen any place to discuss Eliezer Yudkowsky's new paper, titled Timeless Decision Theory, so I decided to create a discussion post. (Have I missed an already existing post or discussion?)

Question about self modifying AI getting "stuck" in religion

5 [deleted] 01 January 2011 12:22AM

Hey. I'm relatively new around here. I have read the core reading of the Singularity Institute, and quite a few Less Wrong articles, and Eliezer Yudkowsky's essay on Timeless Decision Theory. This question is phrased through Christianity, because that's where I thought of it, but it's applicable to lots of other religions and nonreligious beliefs, I think.

According to Christianity, belief makes you stronger and better. The Bible claims that people who believe are substantially better off both while living and after death. So if a self modifying decision maker decides for a second that the Christian faith is accurate, won't he modify his decision making algorithm to never doubt the truth of Christianity? Given what he knows, it is the best decision.

And so, if we build a self modifying AI, switch it on, and the first ten milliseconds caused it to believe in the Christian god, wouldn't that permanently cripple it, as well as probably causing it to fail most definitions of Friendly AI?

When designing an AI, how do you counter this problem? Have I missed something?

Thanks, GSE

EDIT: Yep, I had misunderstood what TDT was. I just meant self modifying systems. Also, I'm wrong.