Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

The Absent-Minded Driver

27 Post author: Wei_Dai 16 September 2009 12:51AM

This post examines an attempt by professional decision theorists to treat an example of time inconsistency, and asks why they failed to reach the solution (i.e., TDT/UDT) that this community has more or less converged upon. (Another aim is to introduce this example, which some of us may not be familiar with.) Before I begin, I should note that I don't think "people are crazy, the world is mad" (as Eliezer puts it) is a good explanation. Maybe people are crazy, but unless we can understand how and why people are crazy (or to put it more diplomatically, "make mistakes"), how can we know that we're not being crazy in the same way or making the same kind of mistakes?

The problem of the ‘‘absent-minded driver’’ was introduced by Michele Piccione and Ariel Rubinstein in their 1997 paper "On the Interpretation of Decision Problems with Imperfect Recall". But I'm going to use "The Absent-Minded Driver" by Robert J. Aumann, Sergiu Hart, and Motty Perry instead, since it's shorter and more straightforward. (Notice that the authors of this paper worked for a place called Center for the Study of Rationality, and one of them won a Nobel Prize in Economics for his work on game theory. I really don't think we want to call these people "crazy".)

Here's the problem description:

An absent-minded driver starts driving at START in Figure 1. At X he
can either EXIT and get to A (for a payoff of 0) or CONTINUE to Y. At Y he
can either EXIT and get to B (payoff 4), or CONTINUE to C (payoff 1). The
essential assumption is that he cannot distinguish between intersections X
and Y, and cannot remember whether he has already gone through one of
them.

graphic description of the problem

At START, the problem seems very simple. If p is the probability of choosing CONTINUE at each intersection, then the expected payoff is p2+4(1-p)p, which is maximized at p = 2/3. Aumann et al. call this the planning-optimal decision.

The puzzle, as Piccione and Rubinstein saw it, is that once you are at an intersection, you should think that you have some probability α of being at X, and 1-α of being at Y. Your payoff for choosing CONTINUE with probability p becomes α[p2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p2+4(1-p)p unless α = 1. So, once you get to an intersection, you'd choose a p that's different from the p you thought optimal at START.

Aumann et al. reject this reasoning and instead suggest a notion of action-optimality, which they argue should govern decision making at the intersections. I'm going to skip explaining its definition and how it works (read section 4 of the paper if you want to find out), and go straight to listing some of its relevant properties:

  1. It still involves a notion of "probability of being at X".
  2. It's conceptually more complicated than planning-optimality.
  3. Mathematically, it has the same first-order necessary conditions as planning-optimality, but different sufficient conditions.
  4. If mixed strategies are allowed, any choice that is planning-optimal is also action-optimal.
  5. A choice that is action-optimal isn't necessarily planning-optimal. (In other words, there can be several action-optimal choices, only one of which is planning-optimal.)
  6. If we are restricted to pure strategies (i.e., p has to be either 0 or 1) then the set of action-optimal choices in this example is empty, even though there is still a planning-optimal one (namely p=1).

In problems like this one, UDT is essentially equivalent to planning-optimality. So why did the authors propose and argue for action-optimality despite its downsides (see 2, 5, and 6 above), instead of the alternative solution of simply remembering or recomputing the planning-optimal decision at each intersection and carrying it out?

Well, the authors don't say (they never bothered to argue against it), but I'm going to venture some guesses:

  • That solution is too simple and obvious, and you can't publish a paper arguing for it.
  • It disregards "the probability of being at X", which intuitively ought to play a role.
  • The authors were trying to figure out what is rational for human beings, and that solution seems too alien for us to accept and/or put into practice.
  • The authors were not thinking in terms of an AI, which can modify itself to use whatever decision theory it wants to.
  • Aumann is known for his work in game theory. The action-optimality solution looks particularly game-theory like, and perhaps appeared more natural than it really is because of his specialized knowledge base.
  • The authors were trying to solve one particular case of time inconsistency. They didn't have all known instances of time/dynamic/reflective inconsistencies/paradoxes/puzzles laid out in front of them, to be solved in one fell swoop.

Taken together, these guesses perhaps suffice to explain the behavior of these professional rationalists, without needing to hypothesize that they are "crazy". Indeed, many of us are probably still not fully convinced by UDT for one or more of the above reasons.

EDIT: Here's the solution to this problem in UDT1. We start by representing the scenario using a world program:

def P(i, j):
    if S(i) == "EXIT":
        payoff = 0
    elif S(j) == "EXIT":
        payoff = 4
    else:
        payoff = 1

(Here we assumed that mixed strategies are allowed, so S gets a random string as input. Get rid of i and j if we want to model a situation where only pure strategies are allowed.) Then S computes that payoff at the end of P, averaged over all possible i and j, is maximized by returning "EXIT" for 1/3 of its possible inputs, and does that.

Comments (141)

Comment author: Eliezer_Yudkowsky 16 September 2009 07:41:57AM 23 points [-]

There's an old story about a motorist who gets a flat tire next to a mental hospital. He goes over to change the tire, putting the lug nuts into the wheel cap... and sees that one of the patients is staring at him from behind the fence. Rattled, he steps on the wheel cap, and the lug nuts go into the storm sewer.

The motorist is staring, disheartened, at the sewer drain, when the patient speaks up from behind the fence: "Take one lug nut off each of the other wheels, and use those to keep on the spare tire until you get home."

"That's brilliant!" says the motorist. "What are you doing in a mental hospital?"

"I'm here because I'm crazy," says the patient, "not because I'm stupid."

Robert Aumann, Nobel laureate, is an Orthodox Jew. And Isaac Newton wasted a lot of his life on Christian mysticism. I don't think theists, or Nobel laureate physicists who don't accept MWI, are stupid. I think they're crazy. There's a difference.

Remind me to post at some point about how rationalists, in a certain stage in their development, must, to progress further, get over their feeling of nervousness about leaving the pack and just break with the world once and for all.

Comment author: Wei_Dai 16 September 2009 01:59:57PM 21 points [-]

If crazy people can nevertheless reach brilliant and correct solutions, then in what sense is their craziness an explanation for the fact that they failed to reach some solution? I really don't see what Aumann's religiousness has to do with the question I asked in this post, not to mention that he's just one person who worked on this problem. (Google Scholar lists 171 citations for P&R's paper.)

To put it another way, if we add "Aumann is religious" to the list of possible explanations I gave, that seems to add very little, if any, additional explanatory value.

Comment author: Eliezer_Yudkowsky 16 September 2009 07:41:27PM 16 points [-]

Because crazy smart people don't consistently reach solutions. It's not surprising when they're right, but it's not surprising when they're wrong, either. There are very few people I know such that I'm surprised when they seem to get something wrong, and the key factor in that judgment is high sanity, more than high intelligence.

I'm also beginning to have a very strange thought that a reddit-derived blog system with comment upvoting and karma is just a vastly more effective way of researching decision-theory problems than publication in peer-reviewed journals.

Comment author: Christian_Szegedy 17 September 2009 09:02:47PM *  15 points [-]

Chess World Champions are sometimes notoriously superstitious, you can still rely on the consistency of their chess moves.

Comment author: LordTC 05 May 2010 05:33:12PM 10 points [-]

They really ought to be, what's the rational value in putting the time and effort into chess to become a world champion at it.

I played it semi-seriously when I was young, but gave it up when in order to get to the next level I'd have to study more than play. Most of the people I know who were good at a competitive intellectual game dropped out of school to pursue it, because they couldn't handle studying at that level for both.

I find it rather difficult to believe that pursuing chess over school is the rationally optimal choice, so I wouldn't be remotely surprised to find that those who get to that level are irrational or superstitious when it comes to non-chess problems.

Comment author: CarlShulman 17 September 2009 11:21:46PM 6 points [-]

Chess provides very strong objective feedback on what does and doesn't work.

Comment author: Christian_Szegedy 18 September 2009 12:52:09AM 0 points [-]

... as opposed to what?

Comment author: Bo102010 18 September 2009 02:42:32AM 2 points [-]

Psychotherapy - recommended reading is Robyn Dawes' House of Cards.

Comment author: jimrandomh 18 September 2009 03:39:36AM *  1 point [-]

Chess World Champions are sometimes notoriously superstitious, you can still rely on the consistency of their chess moves.

No, you can't. In 2006, world chess champion Vladimir Kramnik accidentally left himself open to mate in one when playing against computer program Deep Fritz (http://www.chessbase.com/newsdetail.asp?newsid=3509). Even the very best individual humans are all subject to simple mistakes of types that computers simply don't ever make.

Comment author: Christian_Szegedy 18 September 2009 06:59:40AM *  8 points [-]

This is irrelevant. Human players make mistakes. The question is whether being superstitious makes them make more mistakes.

Comment author: taw 19 September 2009 04:17:08AM 0 points [-]

It's not just chess - here's two 9dan go players, one of them misthinking and killing his own group: http://www.youtube.com/watch?v=qt1FvPxmmfE

Such spectacular mistakes are not entirely unknown in go, even in top level title matches.

In pro-level shogi it's even worse, as illegal moves (which are instant lose) are supposedly not at all uncommon.

Comment author: Christian_Szegedy 19 September 2009 04:55:57AM *  4 points [-]

The original question was not whether humans make mistakes (they do in every area, this is undisputed) but whether irrationality in one domain makes more unreliable in others.

Comment author: jimrandomh 19 September 2009 01:03:33PM *  1 point [-]

No, the original question was whether we should be surprised when humans make mistakes, and what influences the probability of them doing so. The occasional grandmaster bluder shows that even for extremely smart humans within their field of expertise, the human mind effectively has a noise floor - ie, some minimum small probability of making stupid random decisions. Computers, on the other hand, have a much lower noise floor (and can be engineered to make it arbitrarily low).

Comment author: Tyrrell_McAllister 05 October 2009 11:25:46PM *  6 points [-]

You shouldn't be surprised that a chess world champion has made a mistake over the course of their entire career. However, given a specific turn, you should be surprised if the world champion made a mistake in that turn. That is, given any turn, you can rely on their making a good move on that turn. You can't rely with perfect confidence, of course, but that wasn't the claim.

Comment author: CronoDAS 19 September 2009 04:41:58AM 2 points [-]
Comment author: CarlShulman 17 September 2009 01:49:32AM *  8 points [-]

You should be emailing people like Adam Elga and such to invite them to participate then.

Comment author: matt 23 September 2009 09:42:49PM 7 points [-]

Surely Hanson's favorite (a market) is worth a try here. You're more raging (as he does) against the increasingly obvious inefficiency of peer reviewed journals than discovering Reddit + mods as a particularly good solution, no?.

Comment author: gwern 03 January 2010 03:27:41PM 4 points [-]

An interesting question: is there any way to turn a karma system into a prediction market?

The obvious way to me is to weight a person's voting influence by how well their votes track the majority, but that just leads to groupthink.

The key to prediction markets, as far as I can tell, is that predictions unambiguously come true or false and so the correctness of a prediction-share can be judged without reference to the share-price (which is determined by everyone else in what could be a bubble even) - but there is no similar outside objective check on LW postings or comments, is there?

Comment author: matt 10 January 2010 10:15:25PM 5 points [-]

I'd love to do a real money prediction market. Unfortunately western governments seek to protect their citizens from the financial consequences of being wrong (except in state sponsored lotteries… those are okay), and the regulatory costs (financial plus the psychic pain of navigating bureaucracy) of setting one up are higher than the payback I expect from the exercise.

Comment author: LordTC 05 May 2010 05:23:11PM 4 points [-]

The UBC is able to do a non-profit elections prediction market, and it generally does better than the average of the top 5 pollsters.

The popular vote market is you pay $1 for 1 share of CON, LIB, NDP, Green, Other, and you can trade shares like a stockmarket.

The ending payout is $1 * % of popular vote that group gets.

There are other markets such as a seat market, and a majority market.

The majority market pays 50/50 if no majority is reached, and 100/0 otherwise, which makes it pretty awkward in some respects. Generally predicting a minority government the most profitable action is to try and trade for shares of the loser. This is probably the main reason its restricted to the two parties with a chance of winning one if it were the same 5 way system, trading LIB and CON for GREEN, OTHER and NDP to exploit a minority government would probably bias the results. In this case in a minority the payout would be 20/20/20/20/20, but many traders would be willing to practically throw away shares of GREEN, OTHER and NDP because they "know" those parties have a 0% chance of winning a majority. This leads to artificial devaluation and bad prediction information.

By trading 1 share of CON for 5 GREEN and 5 OTHER, you just made 10 times the money in a minority government, and that's the payoff you're looking for instead of saying that you think the combined chances of Green and Other winning a majority is 1/6th that of the conservatives winning.

Of course they still have this problem with Liberals and Conservatives where trading out of a party at a favorable rate might just be betting minority.

I think the problem with a prediction market is you need a payout mechanism, that values the shares at the close of business, for elections there is a reasonable structure.

For situations where there isn't a clear solution or termination that gets much more complicated.

Comment author: Douglas_Knight 18 September 2009 01:38:11AM 3 points [-]

I'm also beginning to have a very strange thought that a reddit-derived blog system with comment upvoting and karma is just a vastly more effective way of researching decision-theory problems than publication in peer-reviewed journals.

How relevant is the voting, as opposed to just the back and forth?

I think the voting does send a strong signal to people to participate (there's a lot more participation here than at OB). If this is working better than mailing lists, it may be the karma, but it may also be that it can support more volume by making it easier to ignore threads.

Comment author: CarlShulman 16 September 2009 02:21:24AM *  14 points [-]

"one of them won a Nobel Prize in Economics for his work on game theory. I really don't think we want to call these people "crazy".)"

Aumann is a religious Orthodox Jew who has supported Bible Code research. He's brilliant and an expert, but yes, he's crazy according to local mores.

Of course, that doesn't mean we should dismiss his work. Newton spent much of his life on Christian mysticism.

Comment author: arundelo 16 September 2009 03:53:22AM *  8 points [-]
Comment author: CronoDAS 16 September 2009 05:34:45AM 3 points [-]

As I'm sure we all know, the so-called "bible code" turned out to be a way to find just about any phrase you want from any document whatsoever.

Comment author: taw 16 September 2009 02:49:44AM 40 points [-]

You cannot assume any α, and choose p based on it, for α depends on p. You just introduced a time loop into your example.

Comment author: pengvado 16 September 2009 04:40:07AM *  49 points [-]

Indeed, though it doesn't have to be a time loop, just a logical dependency. Your expected payoff is α[p^2+4(1-p)p] + (1-α)[p+4(1-p)]. Since you will make the same decision both times, the only coherent state is α=1/(p+1). Thus expected payoff is (8p-6p^2)/(p+1), whose maximum is at about p=0.53. What went wrong this time? Well, while this is what you should use to answer bets about your payoff (assuming such bets are offered independently at every intersection), it is not the quantity you should maximize: it double counts the path where you visit both X and Y, which involves two instances of the decision but pays off only once.

Comment author: Eliezer_Yudkowsky 16 September 2009 08:12:13AM *  7 points [-]

Mod parents WAY up! I should've tried to solve this problem on my own, but I wasn't expecting it to be solved in the comments like that!

Awesome. I'm steadily upgrading my expected utilities of handing decision-theory problems to Less Wrong.

EDIT 2016: Wei Dai below is correct, this was my first time encountering this problem and I misunderstood the point Wei Dai was trying to make.

Comment author: ciphergoth 19 September 2009 03:18:18PM 23 points [-]

You make it sound as if you expect to expect a higher utility in the future than you currently expect...

Comment author: Wei_Dai 18 September 2009 10:07:56AM 7 points [-]

The parents that you referred to are now at 17 and 22 points, which seems a bit mad to me. Spotting the errors in P&R's reasoning isn't really the problem. The problem is to come up with a general decision algorithm that both works (in the sense of making the right decisions) and (if possible) makes epistemic sense.

So far, we know that UDT works but it doesn't compute or make use of "probability of being at X" so epistemically it doesn't seem very satisfying. Does TDT give the right answer when applied to this problem? If so, how? (It's not specified formally enough that I can just apply it mechanically.) Does this problem suggest any improvements or alternative algorithms?

Awesome. I'm steadily upgrading my expected utilities of handing decision-theory problems to Less Wrong.

Again, that seems to imply that the problem is solved, and I don't quite see how the parent comments have done that.

Comment author: SilasBarta 18 September 2009 03:53:12PM 1 point [-]

The problem is to come up with a general decision algorithm that both works (in the sense of making the right decisions) and (if possible) makes epistemic sense.

I presented a solution in a comment here which I think satisfies these: It gives the right answer and consistently handles the case of "partial knowledge" about one's intersection, and correctly characterizes your epistemic condition in the absent-minded case.

Comment author: entirelyuseless 02 July 2017 12:32:24AM 0 points [-]

I don't see why the problem is not solved. The probability of being at X depends directly on how I am deciding whether to turn. So I cannot possibly use that probability to decide whether to turn; I need to decide on how I will turn first, and then I can calculate the probability of being at X. This results in the original solution.

This also shows that Eliezer was mistaken in claiming that any algorithm involving randomness can be improved by making it deterministic.

Comment author: justinpombrio 01 July 2017 10:14:30PM *  0 points [-]

And then you can correct for the double-counting. When would you like to count your chickens? It's safe to count them at X or Y.

If you count them at X, then how much payoff do you expect at the end? Relative to when you'll be counting your payoff, the relative likelihood that you are at X is 1. And the expected payoff if you are at X is p^2 + 4p(1-p). This gives a total expected payoff of P(X) * E(X) = 1 * (p^2 + 4p(1-p)) = p^2 + 4p(1-p).

If you count them at Y, then you much payoff do you expect at the end? Relative to when you'll be counting your payoff, the relative likelihood that you are at Y is p. And the expected payoff if you are at Y is p + 4(1-p). This gives a total expected payoff of P(Y) * E(Y) = p * (p + 4(1-p)) = p^2 + 4(1-p).

I'm annoyed that English requires a tense on all verbs. "You are" above should be tenseness.

EDIT: formatting

Comment author: casebash 23 April 2016 10:27:32AM 0 points [-]

One way to describe this is to note that choosing the action that maximises the expectation of value is not the same as choosing that action that can be expected to produce the most value. So choosing p=0.53 maximises our expectations, not our expectation of production of value.

Comment author: Antisuji 16 September 2009 06:32:41PM 0 points [-]

Since you will make the same decision both times, the only coherent state is α=1/(p+1).

I'm curious how you arrived at this. Shouldn't it be α = (1/2)p + (1 - p) = 1 - p/2? (The other implies that you are a thirder in the Sleeping Beauty Problem -- didn't Nick Bostrum have the last word on that one?) The payoff becomes α[p^2+4p(1-p)] + (1-α)[p+4(1-p)] = (1 - p/2)(4p - 3p^2) + (p/2)(4 - 3p) = 6p - (13/2)p^2 + (3/2)p^3, which has a (local) maximum around p = 0.577.

The conclusion remains the same, of course.

Comment author: PhilGoetz 20 September 2009 07:52:00AM *  2 points [-]

alpha = 1/(p+1) because the driver is at Y p times for every 1 time the driver is at X; so times the driver is at X / (times the driver is at X or Y) = 1 / (p+1).

The problem with pengvado's calculation isn't double counting. It purports to give an expected payoff when made at X, which doesn't count the expected payoff at Y. The problem is that it doesn't really give an expected payoff. alpha purports to be the probability that you are at X; yet the calculation must be made at X, not at Y (where alpha will clearly be wrong). This means we can't speak of a "probability of being at X"; alpha simply is 1 if we use this equation and believe it gives us an expected value.

Or look at it this way: Before you introduce alpha into the equation, you can solve it and get the actual optimal value for p. Once you introduce alpha into the equation, you guarantee that the driver will have false beliefs some of the time, because alpha = 1/(p+1), and so the driver can't have the correct alpha both at X and at Y. You have added a source of error, and will not find the optimal solution.

Comment author: casebash 23 April 2016 10:44:49AM *  0 points [-]

If you want to find the value of p that leads to the optimal decision you need to look at the impact on expected value of choosing one p or another, not just consider expected value at the end. Currently, it maximises expectations, not value created, with situations where you pass through X and Y being double counted.

Comment author: pengvado 17 September 2009 12:46:27AM *  2 points [-]

I'm a "who's offering the bet"er on Sleeping Beauty (which Bostrom has said is consistent with, though not identical to, his own model). And in this case I specified bets offered and paid separately at each intersection, which corresponds to the thirder conclusion.

Comment author: Jonathan_Graehl 17 September 2009 12:05:50AM 0 points [-]

The paper covered that, but good point.

Comment author: Christian_Szegedy 16 September 2009 01:39:47AM *  7 points [-]

I could add some groundless speculation, but my general advice would be: Ask, don't guess!

You probably won't get answers like "I wanted to publish a paper.". But I'd bet, it would be very enlightening regardless. It'd be surprising if all of them were so extremely busy that you can't approach anybody in the area. But don't settle for PhD students, go for senior professors.

Comment author: Wei_Dai 16 September 2009 02:27:40PM 5 points [-]

Please go ahead and add your groundless speculation. (I just added a couple more of my own to the post.) I'm actually more interested in the set of plausible explanations, than the actual explanations. A possible explanation for an error can perhaps teach us something, even if it wasn't the one responsible for it in this world.

For example, "The authors were trying to solve one particular case of time inconsistency." suggests that maybe we shouldn't try to solve ethical dilemmas one at a time, but accumulate as many of them as we can without proposing any solutions, and then see if there is a single solution to all of them.

Comment author: Christian_Szegedy 16 September 2009 05:19:30PM *  1 point [-]

I'm actually more interested in the set of plausible explanations, than the actual explanations. A possible explanation for an error can perhaps teach us something, even if it wasn't the one responsible for it in this world.

I don't really expect you to find explanations, but you could get insights which would help you to interpret their works in the right context.

I had the experience several times that I could move from fuzzy to definite feelings over topics, just by talking to the right people.

Comment author: CarlShulman 16 September 2009 01:29:02PM 0 points [-]

I strongly endorse this proposal.

Comment author: samsonalva 26 March 2010 03:58:29PM 6 points [-]

Wei, the solution that makes immediate sense is the one proposed by Uzi Segal in Economic Letters, Vol 67, 2000 titled "Don't Fool Yourself to Believe You Won't Fool Yourself Again".

You find yourself at an intersection, and you have no idea whether it is X or Y. You believe that you are at intersection X with probability α. Denote by q the probability of continuing to go straight rather than taking a left. What lottery do you face? Well, if you are at Y, then EXITING will yield a payoff of 4, and CONTinuing will yield a payoff of 1. If you are at X, then EXITING will yield a payoff of 0, and CONTinuing on straight will take you to Y. However, when you do in fact reach Y you will not realize you have reached Y and not X. In other words, you realize that going straight if you are at X will mean that you will return to the same intersection dilemma. So, suppose we denote the lottery you currently face at the intersection by Z. Then the description of the lottery is Z = ( (Z, q; 0, (1-q)), α; (1, q; 4, (1-q)), 1 - α). This is a recursive lottery. The recursive structure might seem paradoxical since there are a finite number of intersections, but is simply a result of the absent-mindedness. When you are at an intersection, you realize that any action you take at the moment that will put you at an intersection will not be distinguishable from the present moment in time, and that when you contemplate your choice at this subsequent intersection, you will not have the knowledge that you had already visited an intersection, inducing this "future" you to account for the belief that the second intersection is yet to be reached. The next crucial step is to recognize that beliefs about where you are ought to be consistent with your strategy. You only ever have to make a choice when you are at an intersection, and since all choice nodes are indistinguishable from each other (from your absent-minded perspective), your strategy can only be a mixture over EXIT and CONT. Note that the only way to be at intersection Y is by having chosen CONT at intersection X. Thus, if you believe that you are at intersection X with probability α, then the probability of being at Y is the probability that you ascribed to being at X when you previously made the choice to CONT multiplied by the probability on CONTinuing, as described by your optimal strategy. That is, Prob(Y) = Prob(X) * Prob(CONT). So, consistency of beliefs with your strategy requires that (1 - α) = α * q. Then, the value of the lottery Z, V(Z) as a function of q is described implicitly by: V(Z) = αqV(Z) + α(1-q)(0) + (1-α)q(1) + (1-α)(1-q)4. Solving for V(Z), and using the consistent beliefs condition yields V(Z) = (1 + 3q)(1-q), and so the optimal q is 2/3, which is the same mixed strategy as you chose when you were at START.

Of course, Aumann et. al. obtain that optimality of the planning-stage choice, but the argument they make is distinct from the one presented here.

Comment author: CarlShulman 16 September 2009 04:02:11PM 6 points [-]

"The authors were trying to figure out what is rational for human beings, and that solution seems too alien for us to accept and/or put into practice."

I'm not sure about that. A lot of people intuitively endorse one-boxing on Newcomb, and probably a comparable fraction would endorse the 2/3 strategy for Absent-Minded Driver.

"Well, the authors don't say (they never bothered to argue against it)"

They do mention and dismiss mystical /psychic causation, the idea that in choosing what we will do we also choose for all identical minds/algorithms

"The authors were trying to solve one particular case of time inconsistency. They didn't have all known instances of time/dynamic/reflective inconsistencies/paradoxes/puzzles laid out in front of them, to be solved in one fell swoop."

Decision theorists have a lot of experience with paradoxical-seeming results of standard causal decision theory where 'rational agents' lose in certain ways. Once that conclusion has been endorsed by the field in some cases, it's easy to dismiss further such results: "we already know rational agents lose on all sorts of seemingly easy problems, such that they would precommit/self-modify to avoid making rational decisions, so how is this further instance a reason to change the very definition of rationality?" There could be substantial path-dependence here.

Comment author: Wei_Dai 16 September 2009 06:47:11PM *  3 points [-]

I'm not sure about that. A lot of people intuitively endorse one-boxing on Newcomb, and probably a comparable fraction would endorse the 2/3 strategy for Absent-Minded Driver.

Aumann et al.'s solution is also p=2/3. They just propose to use a roundabout (but perhaps more intuitive?) algorithm to compute it.

They do mention and dismiss mystical /psychic causation, the idea that in choosing what we will do we also choose for all identical minds/algorithms

That was in the context of arguing against P&R's reasoning, which leads to p=4/9.

There could be substantial path-dependence here.

Yes, and that argues for not proposing solutions until we can see the whole problem (or set of problems) and solve it all at once. Well maybe that's kind of unrealistic, so perhaps just keeping in mind the possible path-dependence and try to mitigate it.

BTW, did you know that you can quote people by using ">"? Click on the "help" link under the comment edit box for more info.

Comment author: PhilGoetz 20 September 2009 07:05:56AM *  12 points [-]

Your payoff for choosing CONTINUE with probability p becomes α[p^2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p^2+4(1-p)p unless α = 1.

No. This statement of the problem pretends to represent the computation performed by the driver at an intersection - but it really doesn't. The trouble has to do with the semantics of alpha. Alpha is not the actual probability that the driver is at point X; it's the driver's estimate of that probability. The driver knows ahead of time that he's going to make the same calculation again at intersection Y, using the same value of alpha, which will be wrong. Therefore, he can't pretend that the actual payoff is alpha x (payoff if I am at X) + (1-alpha) x (payoff if I am at Y). Half the time, that payoff calculation will be wrong.

Perhaps a clearer way of stating this, is that the driver, being stateless, must believe P(I am at X) to be the same at both intersections. If you allow the driver to use alpha=.7 when at X, and alpha=.3 when at Y, then you've given the driver information, and it isn't the same problem anymore. If you allow the driver to use alpha=.7 when at X, and alpha=.7 again when at Y, then the driver at X is going to make a decision using the information that he's probably at X and should continue ahead, without taking into account that he's going to make a bad decision at Y because he will be computing with faulty information. That's not an alternative logic; it's just wrong.

The "correct" value for alpha is actually 1/(p+1), for those of us outside the problem; the driver is at Y p times for every 1 time he's at X, so he's at X 1 out of p+1 times. But to the driver, who is inside the problem, there is no correct value of alpha to use at both X and Y. This means that if the driver introduces alpha into his calculations, he is knowingly introducing error. The driver will be using bad information at at least one of X and Y. This means that the answer arrived at must be wrong at at least either X or Y. Since the correct answer is the same in both places, the answer arrived at must be wrong in both places. Using alpha simply adds a source of guaranteed error, and prevents finding the optimal solution.

Does the driver at each intersection have the expected payoff alpha x [p^2+4(1-p)p] + (1-alpha)*[p+4(1-p)] at each intersection? No; at X he has the actual expected payoff p^2+4(1-p)p, and at Y he has the expected payoff p+4(1-p). But he only gets to Y p/1 of the time. The other 1-p of the time, he's parked at A. So, if you want him to make a computation at each intersection, he maximizes the expected value averaging together the times he is at X and Y, knowing he is at Y p times for every one time he is at X:

p^2+4(1-p)p + p[p+4(1-p)] = 2p[p+4(1-p)]

And he comes up with p = 2/3.

There's just no semantically meaningful way to inject the alpha into the equation.

Comment author: pengvado 21 September 2009 03:13:58AM *  2 points [-]

Alpha is not the actual probability that the driver is at point X; it's the driver's estimate of that probability.

What do you mean by probability, if not "someone's estimate of something-or-other"?

But to the driver, who is inside the problem, there is no correct value of alpha to use at both X and Y.

There's also no correct value of p to use both when you'll continue and when you won't. But that doesn't mean you should omit p from the calculation.

This means that the answer arrived at must be wrong at at least either X or Y. Since the correct answer is the same in both places, the answer arrived at must be wrong in both places.

The driver is computing an expectation. A value of an expectation can be wrong for X, and wrong for Y, and right for the union of X and Y.

(I agree, of course, that the formula involving alpha isn't the right computation for the original problem. But that's separate from whether it's computing something interesting.)

Comment author: SilasBarta 23 September 2009 09:30:31PM 0 points [-]

That's a very good explanation. I tried to generalize the problem to the case of partial additional knowledge about one's intersection, and I invite you to take a look at it to see if it makes the same kind of error. For the case of "ignorance about one's intersection", my solution yields "continue with probability 2/3 at any intersection", just the same as everyone else, and it does so by introducing the parameter r for "probability of guessing intersection correctly". In the problem as stated, r=1/2.

Comment author: gwern 02 February 2011 09:37:57PM *  4 points [-]

In #lesswrong, me, Boxo, and Hyphen-ated each wrote a simple simulation/calculation of the absent-minded driver and checked how the various strategies did. Our results:

  • 1/3 = 1.3337, 1.3338174; Hyphen-ated: 1.33332; Boxo: 1.3333333333333335
  • 1/4 = 1.3127752; Hyphen-ated: 1.31247; Boxo: 1.3125
  • 2/3 = 0.9882
  • 4/9 = 1.2745, 1.2898, 1.299709, 1.297746, 1.297292, 1.2968815; Hyphen-ated: 1.29634; Boxo: 1.2962962962962963

As expected, 4/9 does distinctly worse than 1/3, which is the best strategy of the ones we tested. I'm a little surprised at Piccione & Rubinstein - didn't they run any quick* little simulation to see whether 4/9 was beaten by another probability or did they realize this and had some reason bad performance was justifiable? (I haven't read P&R's paper.)

Boxo used his UDT in Haskell code ([eu (Strategy $ const $ chance p) absentMinded id | p <- [0, 1/2, 1/3, 1/4, 4/9]]); I used some quick and dirty Haskell pasted below; and I don't know what Hyphen-ated used.

import System.Random
import Control.Monad
main = foo >>= print -- test out the 1/4 strategy. obvious how to adapt
foo = let n = 10000000 in fmap (\x -> fromIntegral (sum x) / fromIntegral n) $ replicateM n run
run = do x <- getStdRandom (randomR (1,4))
y <- getStdRandom (randomR (1,4))
return $ trial x y
trial :: Int -> Int -> Int
trial x y | x == 1 = 0 -- exit at first turn with reward 0
| y == 1 = 4 -- second turn, reward 4
| otherwise = 1 -- missed both, reward 1

* It's not like this is a complex simulation. I'm not familiar with Haskell's System.Random so it took perhaps 20 or 30 minutes to get something working and then another 20 or 30 minutes to run the simulations with 1 or 10 million iterations because it's not optimized code, but Boxo and Hyphen-ated worked even faster than that.

Comment author: Hyphen-ated 02 February 2011 09:55:56PM *  2 points [-]

I slapped together some C++ in under ten minutes, with an eye towards making it run fast. This runs a billion iterations in slightly under one minute on my machine.

#include <cstdlib>
#include <ctime>
#include <iostream>
using namespace std;
long long threshhold = 4 * (RAND_MAX / 9);
int main() {
srand((unsigned)time(0));
long long utility = 0;
long long iterations = 1000000000;
for(long long i = 0; i < iterations; ++i) {
if(rand() < threshhold)
continue;
else if(rand() < threshhold) {
utility += 4;
continue;
}
++utility;
}
cout << "avg utility: " << (double)utility/iterations << endl;
return 0;
}
Comment author: HoverHell 31 December 2011 08:23:18PM 1 point [-]

Another variation of simulation, in Python, and bit more defined strategy-abstraction (though written for improving own understanding anyway):

from __future__ import division
import random
def run(strt): # compute (possibly stochastic) outcome of strategy execution
docont = strt() # ... no arguments.
if not docont: # at intersection X
return 0
docont = strt()
if not docont:
return 4
return 1
def evst(strt, fun=run, n=100): # evaluate strategy by running `n` times
r = 0.0
for i in range(n):
r += fun(strt)
return r/n
""" # sample use:
> ra = [(v/30, evst(lambda: random.random() < v/30, n=30000)) for v in range(30)]
> max(ra, key=lambda v: v[1]) # maximal stochastic value and its `p`
(0.6666666666666666, 1.3515666666666666)
"""
Comment author: gwern 19 January 2016 04:50:46PM *  0 points [-]

To return to this a little, here's an implementation in R:

absentMindedDriver <- function(p1, p2) { if(p1==1) { 0; } else { if(p2==1) { 4; } else { 1; } } }
runDriver <- function(p=(1/3), iters=100000) { p1 <- rbinom(n=iters, size=1, prob=p)
p2 <- rbinom(n=iters, size=1, prob=p)
return(mapply(absentMindedDriver, p1, p2)) }
sapply(c(0, 1/2, 1/3, 1/4, 4/9, 1), function(p) { mean(runDriver(p=p)); })
# [1] 1.00000 1.25183 1.33558 1.31156 1.29420 0.00000
## Plot payoff curve:
actions <- seq(0, 1, by=0.01)
ground <- data.frame(P=actions, Reward=sapply(actions, function(p) {mean(runDriver(p))}))
plot(ground)
# <https://i.imgur.com/sQGDjEU.png>

Payoffs to each probability

An agent could solve this for the optimal action in a number of ways. For example, taking a reinforcement learning perspective, we could discretize the action space into 100 probabilities: 0, 0.01, 0.02...0.98, 0.99, 1.0, and treat it as a multi-armed bandit problem with k=100 independent arms (without any assumptions of smoothness or continuity or structure over 0..1).

Implementation-wise, we can treat the arm as a level in a categorical variable, so instead of writing out Reward ~ Beta_p0 + Beta_p0.01 + ... Beta_p1.0 we can write out Reward ~ P and let the library expand it out into 100 dummy variables. So here's a simple Thompson sampling MAB in R using MCMCglmm rather than JAGS since it's shorter:

testdata <- data.frame(P=c(rep(0.00, 10), rep(0.33, 10), rep(0.66, 10), rep(0.9, 10)), Reward=c(runDriver(p=0, iters=10), runDriver(p=(0.33), iters=10), runDriver(p=(0.66), iters=10), runDriver(p=(0.9), iters=10))); testdata
# P Reward
# 1 0.00 1
# 2 0.00 1
# 3 0.00 1
# ...
summary(lm(Reward ~ as.factor(P), data=testdata))
# ...Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.0000000 0.3836955 2.60623 0.013232
# as.factor(P)0.33 0.7000000 0.5426274 1.29002 0.205269
# as.factor(P)0.66 -0.5000000 0.5426274 -0.92144 0.362954
# as.factor(P)0.9 -0.6000000 0.5426274 -1.10573 0.276178
library(MCMCglmm)
m <- MCMCglmm(Reward ~ as.factor(P), data=testdata)
summary(m$Sol)
# ...1. Empirical mean and standard deviation for each variable,
# plus standard error of the mean:
# Mean SD Naive SE Time-series SE
# (Intercept) 1.0215880 0.4114639 0.01301163 0.01301163
# as.factor(P)0.33 0.6680714 0.5801901 0.01834722 0.01632753
# as.factor(P)0.66 -0.5157360 0.5635405 0.01782072 0.01782072
# as.factor(P)0.9 -0.6056746 0.5736586 0.01814068 0.01884525
#
# 2. Quantiles for each variable:
# 2.5% 25% 50% 75% 97.5%
# (Intercept) 0.2440077 0.7633342 1.0259299 1.2851243 1.8066701
# as.factor(P)0.33 -0.4521587 0.2877682 0.6515954 1.0344911 1.8871638
# as.factor(P)0.66 -1.5708577 -0.8923558 -0.5300312 -0.1495413 0.5953416
# as.factor(P)0.9 -1.7544003 -0.9542541 -0.6397946 -0.2389132 0.6077439
# k=100
actions <- seq(0, 1, by=0.01)
# We seed each action with 1 trial, to make sure each action shows up as a level of the categorical variable:
mab <- data.frame(P=actions, Reward=sapply(actions, function(a) { runDriver(p=a, iters=1) }))
# We'll run for 10,000 sequential actions total (or ~100 trials per arm on average)
iters <- 10000
for (i in 1:iters) {
m <- MCMCglmm(Reward ~ as.factor(P), data=mab, verbose=FALSE)
estimates <- data.frame()
for (col in 2:ncol(m$Sol)) {
# take 1 estimate from MCMCglmm's posterior samples for each level of the categorical variable P:
estimates <- rbind(estimates,
data.frame(P=as.numeric(sub("as.factor\\(P\\)", "", names(m$Sol[1,col]))),
RewardSample=m$Sol[1,col]))
}
# since reward is already defined as utility, our utility function is just the identity:
utilityFunction <- function(estimate) { identity(estimate); }
actionChoice <- estimates[which.max(utilityFunction(estimates$RewardSample)),]$P
result <- runDriver(p=actionChoice, iters=1)
mab <- rbind(mab, data.frame(P=actionChoice, Reward=result))
cat(paste0("I=", i, "; tried at ", actionChoice, ", got reward: ", result, "\n"))
}
# ...I=9985; tried at 0.23, got reward: 1
# I=9986; tried at 0.45, got reward: 4
# I=9987; tried at 0.43, got reward: 4
# I=9988; tried at 0.39, got reward: 1
# I=9989; tried at 0.39, got reward: 0
# I=9990; tried at 0.53, got reward: 0
# I=9991; tried at 0.44, got reward: 0
# I=9992; tried at 0.84, got reward: 0
# I=9993; tried at 0.32, got reward: 4
# I=9994; tried at 0.16, got reward: 4
# I=9995; tried at 0.09, got reward: 1
# I=9996; tried at 0.15, got reward: 1
# I=9997; tried at 0.02, got reward: 1
# I=9998; tried at 0.28, got reward: 1
# I=9999; tried at 0.35, got reward: 1
# I=10000; tried at 0.35, got reward: 4
## Point-estimates of average reward from each arm:
plot(mab$P)
# <https://i.imgur.com/6dTvXC8.png>

Mean reward for each of the 100 arms in the MAB for the Absent-Minded Driver

We can see just from the sheer variability of the actions>0.5 that the MAB was shifting exploration away from them, accepting much larger uncertainty in exchange for focusing on the 0.2-0.4 where the optimal action lies in; it didn't manage to identify 0.33 as that optimum due to what looks like a few unlucky samples in this run which put 0.33 below, but it was definitely getting there. How much certainty did we gain about 0.33 being optimal as compared to a naive strategy of just allocating 10000 samples over all arms?

fixed <- data.frame()
for (a in actions) { fixed <- rbind(fixed, data.frame(P=a, Reward=runDriver(p=a, iters=100))); }
## We can see major variability compared to the true curve:
plot(aggregate(Reward ~ P, fixed, mean))
# <https://i.imgur.com/zmNdwfC.png>

Mean reward for each arm as estimated by a fixed-sample trial allocating 100 samples to each arm.

We can ask the posterior probability estimate of 0.33 being optimal by looking at a set of posterior sample estimates for arms 1-100 and seeing in how many sets 0.33 has the highest parameter estimate (=maximum reward); we can then compare the posterior from the MAB with the posterior from the fixed-sample trial:

posteriorProbabilityOf0.33 <- function (d) {
mcmc <- MCMCglmm(Reward ~ as.factor(P), data=d, verbose=FALSE)
posteriorEstimates <- mcmc$Sol[,-1]
trueOptimum <- which(colnames(posteriorEstimates) == "as.factor(P)0.33")
return(table(sapply(1:nrow(posteriorEstimates), function(i) { which.max(posteriorEstimates[i,]) == trueOptimum; }))) }
posteriorProbabilityOf0.33(mab)
# FALSE TRUE
# 998 2
posteriorProbabilityOf0.33(fixed)
# FALSE
# 1000

So the MAB thinks 0.33 is more than twice as likely to be optimal than the fixed trial did, despite some bad luck. It also did so while gaining a much higher reward, roughly equivalent to choosing p=0.25 each time (optimal would've been ~1.33, so our MAB's regret is (1.33-1.20) * 10000 = 1300 vs the fixed's in-trial regret of 3400 and infinite regret thereafter since it did not find the optimum at all):

mean(fixed$Reward) # [1] 0.9975247525
mean(df$Reward) # [1] 1.201861202
Comment author: gwern 19 January 2016 04:57:39PM 0 points [-]

While I was at it, I thought I'd give some fancier algorithms a spin using Karpathy's <code>Reinforce.js</code> RL library (Github). The DQN may be able to do much better since it can potentially observe the similarity of payoffs and infer the underlying function to maximize.

First a JS implementation of the Absent-Minded Driver simulation:

function getRandBinary(p=0.5) { rand = Math.random(); if (rand >= p) { return 0; } else { return 1; } }
function absentMindedDriver(p1, p2) { if(p1==1) { return 0; } else { if (p2==1) { return 4; } else { return 1; } } }
function runDriver(p) { p1 = getRandBinary(p)
p2 = getRandBinary(p)
return absentMindedDriver(p1, p2); }
function runDrivers(p=(1/3), iters=1000) {
var rewards = [];
for (i=0; i < iters; i++) {
rewards.push(runDriver(p));
}
return rewards;
}

Now we can use Reinforce.js to set up and run a deep Q-learning agent (but not too deep since this runs in-browser and there's no need for hundreds or thousands of neurons for a tiny problem like this):

// load Reinforce.js
var script = document.createElement("script");
script.src = "<https://raw.githubusercontent.com/karpathy/reinforcejs/master/lib/rl.js>";
document.body.appendChild(script);
// set up a DQN RL agent and run
var env = {};
env.getNumStates = function() { return 0; }
env.getMaxNumActions = function() { return 100; } // so actions are 1-100?
// create the DQN agent: relatively small, since this is an easy problem; greedy, since only one step; hold onto all data & process heavily
var spec = { num_hidden_units:25, experience_add_every:1, learning_steps_per_iteration:100, experience_size:10100 }
agent = new RL.DQNAgent(env, spec);
// OK, now let it free-run; we give it an extra 100 actions for equality with the MAB's initial 100 exploratory pulls
for (i=0; i < 10100; i++) {
s = [] // MAB is one-shot, so no global state
var action = agent.act([]);
//... execute action in environment and get the reward
reward = runDriver(action/100);
console.log("Action: " + action + "; reward: " + reward);
agent.learn(reward); // the agent improves its Q,policy,model, etc. reward is a float
}
// ...Action: 15; reward: 1
// Action: 34; reward: 0
// Action: 34; reward: 4
// Action: 21; reward: 4
// Action: 15; reward: 1
// Action: 43; reward: 0
// Action: 21; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 15; reward: 1
// Action: 78; reward: 0
// Action: 15; reward: 0
// Action: 15; reward: 1
// Action: 34; reward: 1
// Action: 43; reward: 1
// Action: 34; reward: 4
// Action: 24; reward: 1
// Action: 0; reward: 1
// Action: 34; reward: 0
// Action: 3; reward: 1
// Action: 51; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 4
// Action: 3; reward: 1
// Action: 11; reward: 1
// Action: 43; reward: 4
// Action: 36; reward: 4
// Action: 36; reward: 0
// Action: 21; reward: 4
// Action: 77; reward: 0
// Action: 21; reward: 4
// Action: 26; reward: 4

Overall, I am not too impressed by the DQN. It looks like it is doing worse than the MAB was, despite having a much more flexible model to use. I don't know if this reflects the better exploration of Thompson sampling compared to the DQN's epsilon-greedy, or if my fiddling with the hyperparameters failed to help. It's a small enough problem that a Bayesian NN is probably computationally feasible, but I dunno how to use those.

Comment author: gwern 11 August 2010 11:46:47AM *  8 points [-]

then the expected payoff is p^2^+4(1-p)p

For anyone whose eyes glazed over and couldn't see how this was derived:

There are 3 possible outcomes:

  1. you miss both turns
    The probability of missing both turns for a p is p*p (2 turns, the same p each time), and the reward is 1. Expected utility is probability*reward, so 2*p*1. Which is just 2*p or p^2
  2. you make the second turn.
    The probability of making the second turn is the probability of missing the first turn and making the second. Since p is for a binary choice, there's only one other probability, q, of missing the turn; by definition all probabilities add to 1, so p+q=1, or q=1-p. So, we have our q*p (for the missed and taken turn), and our reward of 4. As above, the expected utility is q*p*4, and substituting for q gives us (1-p)*p*4, or rearranging, 4*(1-p)*p.
  3. or, you make the first turn
    The probability of making the first turn is just p-1 as before, and the reward is 0. So the expected utility is (p-1)*0 or just 0.

Our 3 possibilities are exhaustive, so we just add them together:

p^2 + 0 + 4*(1-p)*p

0 drops out, leaving us with the final result given in the article:

p^2 + 4*(1-p)*p

Comment author: TobyBartels 23 August 2010 04:54:46PM *  3 points [-]
  1. The probability of missing both turns for a p is 2*p […] Which is just 2*p or p^2

  2. The probability of making the first turn is just p […] So the expected utility is p*0 or just 0.

In (1), instead of 2*p, you want p*p. In (2), you want 1 – p instead of p. The final results are correct, however.

Comment author: gwern 24 August 2010 04:19:55AM *  1 point [-]
  1. I feel kind of silly now; what was I thinking in writing '2*p or just 2*p'?
  2. Right, right. I had difficulty remembering whether p was the chance of making a turn or missing a turn. Good thing the multiplication by 0 makes the difference moot.
Comment author: TobyBartels 24 August 2010 02:51:53PM 0 points [-]

Ah well, it happens.

Comment author: Wei_Dai 25 September 2009 09:24:03AM 3 points [-]

It occurs to me that perhaps "professional rationalist" is a bit of an oxymoron. In today's society, a professional rationalist is basically someone who is paid to come up with publishable results in the academic fields related to rationality, such as decision theory and game theory.

I've always hated the idea of being paid for my ideas, and now I think I know why. When you're being paid for you ideas, you better have "good" ideas or you're going to starve (or at least suffer career failure). But good ideas can't be produced on a schedule, so the only way to guarantee academic survival is to learn the art of self-promotion. Your papers better make your mediocre ideas look good, and your good ideas look great. And having doubts about your ideas isn't going to help your career, even if they are rationally justified.

Given this reality, how can any professional rationalist truly be rational?

Comment author: gwern 03 January 2010 03:38:55PM 2 points [-]

"How long did it take you to learn this new thing?"

Paradigmatic career-making ideas and papers are rare and take years or decades to produce. But academia doesn't run on that coin - just regular papers and small ideas.

Is it too much to suggest that 1 every few months is possible for someone who ought to be a professional rationalist, who reads the literature carefully and thinks through all their ideas, who keeps a pad & pen handy to jot down random notes, who listens to the occasional off-kilter but insightful student questions, and so on?

That is 4 ideas a year, and over a graduate student's 4-year term, 16 possible papers. Not counting any prevarication.

Comment author: SilasBarta 16 September 2009 02:57:58PM *  5 points [-]

What I'm more interested in is: doesn't the UDT/planning-optimal solution imply that injecting randomness can improve an algorithm, which is a big no-no? Because you're saying (and you're right AFAICT) that the best strategy is to randomly choose whether to continue, with a bias in favor of continuing.

Also, could someone go through the steps of how UDT generates this solution, specifically, how it brings to your attention the possibility of expressing the payoff as a function of p? (Sorry, but I'm a bit of a straggler on these decision theory posts.)

Comment author: orthonormal 16 September 2009 04:41:09PM 5 points [-]

Jaynes' principle is that injecting randomness shouldn't help you figure out what's true; implementing a mixed strategy of action is a different matter.

Although the distinction is a bit too fuzzy (in my head) for comfort.

Comment author: SilasBarta 16 September 2009 07:55:54PM 0 points [-]

I'm concerned about more than just Jaynes's principle. Injecting noise between your decision, and what you turn out to do, shouldn't be able to help you either. Choosing "continue" 2/3 of the time is the same as "Always continue, but add a random disturbance that reverses this 1/3 of the time."

How can that addition of randomness help you?

Comment author: JGWeissman 16 September 2009 08:06:08PM 0 points [-]

The randomness helps in this case because the strategy of determining your actions by which intersection you are at is not available.

Comment author: SilasBarta 16 September 2009 08:14:59PM 1 point [-]

Yes, the problem deletes the knowledge of what intersection I'm at. How does it help to further delete knowledge of what my decision is?

Comment author: JGWeissman 16 September 2009 09:15:44PM 4 points [-]

There are 3 possible sequences of actions:

  1. exit at the first intersection
  2. continue at the first intersection, exit at the second intersection
  3. continue at the first intersection, continue at the third intersection

The payoffs are such that sequence 2 is the best available, but, having complete knowledge of your decision and no knowledge of which intersection you are at makes sequence 2 impossible. By sacrificing that knowledge, you make available the best sequence, though you can no longer be sure you will take it. In this case, the possibility of performing the best sequence is more valuable than full knowledge of which sequence you actually perform.

Comment author: SilasBarta 17 September 2009 08:59:38PM 3 points [-]

By the way, I went ahead and calculated what effect probabilistic knowledge of one's current intersection has on the payoff, so as to know the value of this knowledge. So I calculated the expected return, given that you have a probability r (with 0.5<r<=1) of correctly guessing what intersection you're in, and choose the optimal path based on whichever is most likely.

In the original problem the max payoff (under p=2/3) is 4/3. I found that to beat that, you only need r to be greater than 52.05%, barely better than chance. Alternately, that's only 0.0012 bits of the 1 bit of information contained by the knowledge of which intersection you're at! (Remember that if you have less than .0012 bits, you can just revert to the p=2/3 method from the original problem, which is better than trying to use your knowledge.)

Proof: At X, you have probability r of continuing, then at Y you have a probability r of exiting and (1-r) of continuing.

Thus, EU(r) = r(4r + (1-r) ) = r(3r+1).

Then solve for when EU(r)=4/3, the optimum in the fully ignorant case, which is at r = 0.5205.

Comment author: Wei_Dai 17 September 2009 10:30:06PM *  11 points [-]

You made a mistake here, which is assuming that when you guess you are at X, you should choose CONTINUE with probability 1, and when you guess you are at Y, you should choose EXIT with probability 1. In fact you can improve your expected payoff using a mixed strategy, in which case you can always do better when you have more information.

Here's the math. Suppose when you are at an intersection, you get a clue that reads either 'X' or 'Y'. This clue is determined by a dice roll at START. With probability .49, you get 'X' at both intersections. With probability .49, you get 'Y' at both intersections. With probability .02, you get 'X' at the X intersection, and 'Y' at the Y intersection.

Now, at START, your decision consists of a pair <p,q> of probabilities, where p is your probability to CONTINUE after seeing 'X', and q is your probability to CONTINUE after seeing 'Y'. Your expected payoff is:

.02 * (p*q + 4*(p*(1-q))) + .49 * (p*p + 4*(p*(1-p))) + .49 * (q*q + 4*(q*(1-q)))

which is maximized at p=0.680556, q=0.652778. And your expected payoff is 1.33389 which is > 4/3.

Comment author: SilasBarta 17 September 2009 11:11:22PM *  0 points [-]

Wow, good catch! (In any case, I had realized that if you have probability less than 52.05%, you shouldn't go with the most likely, but rather, revert the original p=2/3 method at the very least.)

The formula you gave for the mixed strategy (with coefficients .02, .49, .49) corresponds to a 51% probability of guessing right at any given light. (If the probability of guessing right is r, the coefficients should be 2r-1,1-r,1-r.) It actually raises the threshold for which choosing based on the most probable, with no other randomness, becomes the better strategy, but not by much -- just to about 52.1%, by my calculations.

So that means the threshold is 0.0013 bits instead of 0.0012 :-P

(Yeah, I did guess and check because I couldn't think of a better way on this computer.)

Comment author: Wei_Dai 17 September 2009 11:20:54PM *  4 points [-]

I think you might still be confused, but the nature of your confusion isn't quite clear to me. Are you saying that if r>52.1%, the best strategy is a pure one again? That's not true. See this calculation with coefficients .2, .4, .4.

ETA: Also, I think talking about this r, which is supposed to be a guess of "being at X", is unnecessarily confusing, because how that probability should be computed from a given problem statement, and whether it's meaningful at all, are under dispute. I suggest thinking in terms of what you should plan to do when you are at START.

Comment author: SilasBarta 16 September 2009 10:17:18PM 0 points [-]

In this case, the possibility of performing the best sequence is more valuable than full knowledge of which sequence you actually perform.

Okay, that's a more specific, helpful answer.

Comment author: Eliezer_Yudkowsky 16 September 2009 07:45:15PM 5 points [-]

imply that injecting randomness can improve an algorithm

Hm. This would actually appear to be a distinct class of cases in which randomness is "useful", but it's important to note that a simple deterministic algorithm would do better if we were allowed to remember our own past actions - i.e. this is a very special case. I should probably think about it further, but offhand, it looks like the reason for randomized-optimality is that the optimal deterministic algorithm has been prohibited in a way that makes all remaining deterministic algorithms stupid.

Comment author: Technologos 17 September 2009 07:27:08PM 5 points [-]

More generally, game theory often produces situations where randomization is clearly the rational strategy when you do not know with certainty the other player's action. The example given in this post is straightforwardly convertible into a game involving Nature, as many games do, where Nature plays first and puts you into one of two states; if you are in X and play Continue, Nature plays again and the game loops. The solution to that game is the same as that derived above, I believe.

The point is, for a broad class of games/decision problems where Nature has a substantial role, we might run into similar issues that can be resolved a similar way.

Comment author: Wei_Dai 16 September 2009 08:34:50PM 4 points [-]

I think the motivation for this problem is that memory in general is a limited resource, and a decision theory should be able to handle cases were recall is imperfect. I don't believe that there was a deliberate attempt to prohibit the optimal deterministic algorithm in order to make randomized algorithms look good.

Comment author: SilasBarta 17 September 2009 09:05:08PM *  2 points [-]

I don't think that resolves the issue. As I demonstrated in this comment, if you have some probabilistic knowledge of which intersection you're at, you can do better than the p=2/3 method. Specifically, as long as you have 0.0012 bits of information about which intersection you're at (i.e. assign a greater than 52% chance of guessing correctly), you're better off choosing based on what seems most likely.

However -- and this is the kicker -- that means that if you have between 0 and 0.0012 bits of information about your intersection, you're best off throwing that information away entirely and going with the method that's optimal for when you're fully forgetful. So it's still a case where throwing away information helps you.

ETA: False alarm; Wei_Dai corrects me here -- you can still use your knowledge to do better than 4/3 when your probability of guessing right is between 50% and 52.05%.

Comment author: Tyrrell_McAllister 16 September 2009 10:01:26PM 1 point [-]

It's probably better not to think of this as a randomized algorithm. Here is a simpler example of what I mean.

Suppose you have two urns in front of you. One urn is full of N white marbles, and the other urn is empty. Your task is to take a marble out of the first urn, paint it either red or blue, place it in the second urn, and then repeat this process until the first urn is empty. Moreover, when you are done, something very close to two-thirds of the marbles in the second urn must be red.

The catch, of course, is that you have very poor short-term memory, so you can never remember how many marbles you've painted or what colors you've painted them.

The "randomized algorithm" solution would be to use a pseudo-random number generator to produce a number x between 0 and 1 for each marble, and to paint that marble red if and only if x < 2/3.

But there is a non-random way to think of that procedure. Suppose instead that, before you start painting your N marbles, you set out a box of N poker chips, of which you know (that is, have reason to be highly confident) that very nearly two-thirds are red. You then proceed to paint marbles according to the following algorithm. After taking a marble in hand, you select a poker chip non-randomly from the box, and then paint the marble the same color as that poker chip.

This is a non-random algorithm that you can use with confidence, but which requires no memory. And, as I see it, the method with the pseudo-random number generator amounts to the same thing. By deciding to use the generator, you are determining N numbers: the next N numbers that the generator will produce. Moreover, if you know how the generator is constructed, you know (that is, have reason to be highly confident) that very nearly two-thirds of those numbers will be less than 2/3. To my mind, this is functionally identical to the poker chip procedure.

Comment author: SilasBarta 16 September 2009 10:15:53PM 3 points [-]

Very clever! It is indeed true that if you forget all previous marble paintings, the best way to ensure that 2/3 get painted one color is to paint it that color with p = 2/3.

And interestingly, I can think of several examples of my own life when I've been in that situation. For example, when I'm playing Alpha Centauri, I want to make sure I have a good mix of artillery, infantry, and speeders, but it's tedious to keep track of how many I have of each, so I just pick in a roughly random way, but biased toward those that I want in higher proportions.

I'm going to see if I can map the urn/marble-painting problem back onto the absent-minded driver problem.

Comment author: JGWeissman 16 September 2009 10:21:34PM 4 points [-]

This is a non-random algorithm that you can use with confidence, but which requires no memory.

You mean it does not require memory in your brain, because you implemented your memory with the poker chips. It is quite convenient they were available.

Comment author: Tyrrell_McAllister 16 September 2009 10:25:06PM *  5 points [-]

My point is that it's no more convenient than having the pseudo-random number generator available. I maintain that the generator is implementing your memory in functionally the same sense. For example, you are effectively guaranteed not to get the same number twice, just as you are effectively guaranteed not to get the same poker chip twice.

ETA: After all, something in the generator must be keeping track of the passage of the marbles for you. Otherwise the generator would keep producing the same number over and over.

Comment author: DonGeddis 16 September 2009 11:57:33PM 4 points [-]

Rather than using a PRNG (which, as you say, requires memory), you could use a source of actual randomness (e.g. quantum decay). Then you don't really have extra memory with the randomized algorithm, do you?

Comment author: JGWeissman 17 September 2009 12:24:31AM 2 points [-]

I thought of this as well, but it does not really matter because it is the ability to produce the different output in each case event that gives part of the functionality of memory, that is, the ability to distinguish between events. Granted, this is not as effective as deterministically understood memory, where you know in advance which output you get at each event. Essentially, it is memory with the drawback that you don't understand how it works to the extent that you are uncertain how it correlates with what you wanted to remember.

Comment author: gwern 11 August 2010 01:51:07PM 0 points [-]

After all, something in the generator must be keeping track of the passage of the marbles for you. Otherwise the generator would keep producing the same number over and over.

Randomness 'degenerates' (perhaps by action of a malicious daemon) into non-randomness, and so it can do better and no worse than a non-random approach?

(If the environments and agents are identical down to the source of randomness, then the agent defaults to a pure strategy; but with 'genuine' randomness, random sources that are different between instances of the agent, the agent can actually implement the better mixed strategy?)

Comment author: Tyrrell_McAllister 12 August 2010 05:45:51AM *  1 point [-]

I'm having trouble parsing your questions. You have some sentences that end in question marks. Are you asking whether I agree with those sentences? I'm having trouble understanding the assertions made by those sentences, so I can't tell whether I agree with them (if that was what you were asking).

The claim that I was making could be summed up as follows. I described an agent using a PRNG to solve a problem involving painting marbles. The usual way to view such a solution is as

deterministic amnesiac agent PLUS randomness.

My suggestion was instead to view the solution as

deterministic amnesiac agent

PLUS

a particular kind of especially limited memory

PLUS

an algorithm that takes the contents of that memory as input and produces an output that is almost guaranteed to have a certain property.

The especially limited memory is the part of the PRNG that remembers what the next seed should be. If there weren't some kind of memory involved in the PRNG's operation, the PRNG would keep using the same seed over and over again, producing the same "random" number again and again.

The algorithm is the algorithm that the PRNG uses to turn the first seed into a sequence of pseudo-random numbers.

The certain property of that sequence is the property of having two-thirds of its terms being less than 2/3.

Comment author: gwern 12 August 2010 06:20:48AM 0 points [-]

OK, that's clearer. And different from what I thought you were saying.

Comment author: JGWeissman 16 September 2009 10:41:36PM 0 points [-]

I maintain that the generator is implementing your memory in functionally the same sense.

That is fair enough, though the reason I find scenario at all interesting is that it illustrates the utility of a random strategy under certain conditions.

Comment author: Tyrrell_McAllister 17 September 2009 04:55:57PM *  0 points [-]

For me, finding an equivalent nonrandom strategy helps to dispel confusion.

I like your characterization above that the PRNG is "memory with the drawback that you don't understand how it works to the extent that you are uncertain how it correlates with what you wanted to remember." Another way to say it is that the PRNG gives you exactly what you need with near-certainty, while normal memory gives you extra information that happens to be useless for this problem.

What is "random" about the PRNG (the exact sequence of numbers) is extra stuff that you happen not to need. What you need from the PRNG (N numbers, of which two-thirds are less than 2/3) is not random but a near-certainty. So, although you're using a so-called pseudo-random number generator, you're really using an aspect of it that's not random in any significant sense. For this reason, I don't think that the PRNG algorithm should be called "random", any more than is the poker chip algorithm.

Comment author: Luke_A_Somers 06 June 2012 01:41:12PM 2 points [-]

Take 3 marbles out of the urn. Paint one of them blue, and then the other two red. Put all three in the second urn. Repeat

Yeah, yeah - white marbles are half-medusas, so if you can see more than one you die, or something.

Comment author: Tyrrell_McAllister 06 June 2012 02:44:25PM 0 points [-]

Taking more than one marble out of the urn at a time violates the task description. Don't fight the hypothetical!

Comment author: Luke_A_Somers 07 June 2012 03:03:58PM 1 point [-]

... That's what the second paragraph is for...

Comment author: christopherj 04 April 2014 05:43:22PM *  1 point [-]

If you're allowed to use external memory, why not just write down how many you painted of each color? Note that memory is different from a random number generator; for example, a random number generator can be used (imperfectly) to coordinate with a group of people with no communication, whereas memory would require communication but could give perfect results.

Comment author: Tyrrell_McAllister 05 April 2014 04:03:55AM 0 points [-]

Have you read this comment of mine from another branch of this conversation?

Comment author: Wei_Dai 16 September 2009 06:30:57PM 0 points [-]

Also, could someone go through the steps of how UDT generates this solution

I just added to the post with an answer.

Comment author: SilasBarta 16 September 2009 08:17:26PM 0 points [-]

Thanks! :-) But I still don't understand what made you express the payoff as a function of p. Was it just something you thought of when applying UDT (perhaps after knowing that's how someone else approached the problem), or is there something about UDT that required you to do that?

Comment author: Vladimir_Nesov 16 September 2009 08:29:00PM *  0 points [-]

What do you mean? p is the only control parameter... You consider a set of "global" mixed strategies, indexed by p, and pick one that leads to the best outcome, without worrying about where your mind that does this calculation is currently located and under what conditions you are thinking this thought.

Comment author: SilasBarta 16 September 2009 08:36:02PM 0 points [-]

What do you mean? p is the only control parameter...

Perhaps, but it's an innovation to think of the problem in terms of "solving for the random fraction of times I'm going to do them". That is, even considering that you should add randomness in between your decision and what you do, is an insight. What focused your attention on optimizing with respect to p?

Comment author: Vladimir_Nesov 16 September 2009 08:42:53PM *  1 point [-]

Mixed strategy is a standard concept, so here we are considering a set S of all (global) mixed strategies available for the game. When you are searching for the best strategy, you are maximizing the payoff over S. You are searching for the mixed strategy that gives the best payoff. What UDT tells is that you should just do that, even if you are considering what to do in a situation where some of the options have run out, and, as here, even if you have no idea where you are. "The best strategy" quite literally means

The only parameter for a given strategy is the probability of turning, so it's natural to index the strategies by that probability. This indexing is a mapping t:[0,1]->S that places a mixed strategy in correspondence with a value of turning probability. Now, we can rewrite the expected utility maximization in terms of probability:

For a strategy corresponding to turning probability p, it's easy to express corresponding expected utility:

We now can find the optimal strategy as

Comment author: SilasBarta 16 September 2009 10:11:47PM 1 point [-]

Okay, that's making more sense -- the part where you get to parameterizing p as a real is what I was interested in.

But do you do the same thing when applying UDT to Newcomb's problem? Do you consider it a necessary part of UDT that you take p (with 0<=p<=1) as a continuous parameter to maximize over, where p is the probability of one-boxing?

Comment author: Vladimir_Nesov 17 September 2009 02:40:49AM *  1 point [-]

Fundamentally, this depends on the setting -- you might not be given a random number generator (randomness is defined with respect to the game), and so the strategies that depend on a random value won't be available in the set of strategies to choose from. In Newcomb's problem, the usual setting is that you have to be fairly deterministic or Omega punishes you (so that a small probability of two-boxing may even be preferable to pure one-boxing, or not, depending on Omega's strategy), or Omega may be placed so that your strategy is always deterministic for it (effectively, taking mixed strategies out of the set of allowed ones).

Comment author: Wei_Dai 16 September 2009 08:25:13PM 0 points [-]

S() is suppose to be an implementation of UDT. By looking at the world program P, it should determine that among all possible input-output mappings, those that return "EXIT" for 1/3 of all inputs (doesn't matter which ones) maximize average payoff. What made me express the payoff as a function of p is by stepping through what S is supposed to do as an implementation of UDT.

Does that make sense?

Comment author: SilasBarta 16 September 2009 08:33:00PM 0 points [-]

I'm still confused. Your response seems to just say, "I did it because it works." -- which is a great reason! But I want to know if UDT gave you more guidance than that.

Does UDT require that you look at the consequences of doing something p% of the time (irrespective of which ones), on all problems?

Basically, I'm in the position of that guy/gal that everyone here probably helped out in high school:

"How do you do the proof in problem 29?" "Oh, just used identities 3 and 5, solve for t, and plug it back into the original equation." "But how did you know to do that?"

Comment author: Wei_Dai 16 September 2009 08:46:05PM 0 points [-]

Does UDT require that you look at the consequences of doing something p% of the time (irrespective of which ones), on all problems?

No, UDT (at least in my formulation) requires that you look at all possible input-output mappings, and choose the one that is optimal. In this case it so happens that any function that returns "EXIT" for 1/3 of inputs is optimal.

Comment author: ThrustVectoring 30 June 2013 11:23:20PM 1 point [-]

It's kind of an old thread, but I know people browse the recently posted list and I have a good enough understanding of what exactly the decision theorists are doing wrong that I can explain it in plain English.

First of all, alpha can only consistently be one number: 1/(1+p). And once you substitute that into α[p2+4(1-p)p] + (1-α)[p+4(1-p)], you get a peculiar quantity: (2/1+p) * [p2 + 4(-1p)p]. Where does the 2/1+p come from? Well, every time you go through the first node, you add up the expected result from the first node and the second node, and you also add up the expected result when you visit the second node. This is a double counting. The 2/1+p is the number of times you visit a node and make a choice, per time you enter the whole thing.

If we assume we're limited to a certain number of experimental runs and not a certain number of continue/exit choices, we need to divide out the 2/(1+p) to compensate for the fact that you make N*(2/1+p) choices in N games.

Comment author: yboris 12 September 2016 02:46:47AM *  0 points [-]

[deleted] had an error

Comment author: Stuart_Armstrong 25 September 2009 11:32:22AM 0 points [-]

Your payoff for choosing CONTINUE with probability p becomes α[p2+4(1-p)p] + (1-α)[p+4(1-p)]

I think that's twice as much as it should be, since neither your decision at X, nor at Y, contribute entirely to your utility result.

Comment author: conchis 16 September 2009 02:47:54AM 0 points [-]

Isn't the claim in 6 (that there is a planning-optimal choice, but no action-optimal choice) inconsistent with 4 (a choice that is planning optimal is also action optimal)?

Comment author: CarlShulman 16 September 2009 02:52:21AM 0 points [-]

4 is true if randomized strategies are allowed?

Comment author: Wei_Dai 16 September 2009 05:19:45AM 0 points [-]

Yes, that's right. I've made an edit to clarify that.

Comment author: Jonathan_Graehl 16 September 2009 02:15:35AM *  0 points [-]

From the paper:

At START, he is given the option to be woken either at both intersections, or only at X. In the first option he is absent-minded: when waking up, he does not know at which intersection he is. We call the second option ‘‘clear-headedness.’’ As in the previous discussion, the question at X is not operative (what to do) but only whether it makes sense to be ‘‘sorry.’’ If he chose clear-headedness, his expectation upon reaching X is 1/4. If he had chosen absent-mindedness, then when reaching X he would have attributed probability 2/3 to being at X and 1/3 to being at Y.

I disagree. If I opt to be woken at X, then when I'm woken at intersection X, I know I'm at X. Even if the idea is that I'll immediately go back to sleep, forget everything, and wake again at my payoff/destination, I still knew when I was woken at an intersection that it was X.

Unless I chose to be a) woken at X and b) have no memory of having chosen that, in which case it's fair to say that "I'm at some intersection" (but they suppose I remember the road map I was given indicating intersectons/payoffs X,Y). I guess that's what must have been meant.

Comment author: gwern 03 January 2010 03:44:49PM 0 points [-]

I think it must've.

If you remember your choice, then there's no reason to ever choose 'wake me up only at X'. If you wake up only at X, then you will either take action to turn off and score 0 points, or you will go back to sleep and continue to C and score 1 point.

0 or 1 is pretty bad, since you could then do better just by randomly picking from A, B, or C (as (1/3)*1 + (1/3)*4 + (1/3)*0 = 1.6...). Eliminating one branch entirely eliminates the problem.