It's kind of an old thread, but I know people browse the recently posted list and I have a good enough understanding of what exactly the decision theorists are doing wrong that I can explain it in plain English.
First of all, alpha can only consistently be one number: 1/(1+p). And once you substitute that into α[p2+4(1-p)p] + (1-α)[p+4(1-p)], you get a peculiar quantity: (2/1+p) * [p2 + 4(-1p)p]. Where does the 2/1+p come from? Well, every time you go through the first node, you add up the expected result from the first node and the second node, and you also add up the expected result when you visit the second node. This is a double counting. The 2/1+p is the number of times you visit a node and make a choice, per time you enter the whole thing.
If we assume we're limited to a certain number of experimental runs and not a certain number of continue/exit choices, we need to divide out the 2/(1+p) to compensate for the fact that you make N*(2/1+p) choices in N games.
This post examines an attempt by professional decision theorists to treat an example of time inconsistency, and asks why they failed to reach the solution (i.e., TDT/UDT) that this community has more or less converged upon. (Another aim is to introduce this example, which some of us may not be familiar with.) Before I begin, I should note that I don't think "people are crazy, the world is mad" (as Eliezer puts it) is a good explanation. Maybe people are crazy, but unless we can understand how and why people are crazy (or to put it more diplomatically, "make mistakes"), how can we know that we're not being crazy in the same way or making the same kind of mistakes?
The problem of the ‘‘absent-minded driver’’ was introduced by Michele Piccione and Ariel Rubinstein in their 1997 paper "On the Interpretation of Decision Problems with Imperfect Recall". But I'm going to use "The Absent-Minded Driver" by Robert J. Aumann, Sergiu Hart, and Motty Perry instead, since it's shorter and more straightforward. (Notice that the authors of this paper worked for a place called Center for the Study of Rationality, and one of them won a Nobel Prize in Economics for his work on game theory. I really don't think we want to call these people "crazy".)
Here's the problem description:
At START, the problem seems very simple. If p is the probability of choosing CONTINUE at each intersection, then the expected payoff is p2+4(1-p)p, which is maximized at p = 2/3. Aumann et al. call this the planning-optimal decision.
The puzzle, as Piccione and Rubinstein saw it, is that once you are at an intersection, you should think that you have some probability α of being at X, and 1-α of being at Y. Your payoff for choosing CONTINUE with probability p becomes α[p2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p2+4(1-p)p unless α = 1. So, once you get to an intersection, you'd choose a p that's different from the p you thought optimal at START.
Aumann et al. reject this reasoning and instead suggest a notion of action-optimality, which they argue should govern decision making at the intersections. I'm going to skip explaining its definition and how it works (read section 4 of the paper if you want to find out), and go straight to listing some of its relevant properties:
In problems like this one, UDT is essentially equivalent to planning-optimality. So why did the authors propose and argue for action-optimality despite its downsides (see 2, 5, and 6 above), instead of the alternative solution of simply remembering or recomputing the planning-optimal decision at each intersection and carrying it out?
Well, the authors don't say (they never bothered to argue against it), but I'm going to venture some guesses:
Taken together, these guesses perhaps suffice to explain the behavior of these professional rationalists, without needing to hypothesize that they are "crazy". Indeed, many of us are probably still not fully convinced by UDT for one or more of the above reasons.
EDIT: Here's the solution to this problem in UDT1. We start by representing the scenario using a world program:
def P(i, j):
if S(i) == "EXIT":
payoff = 0
elif S(j) == "EXIT":
payoff = 4
else:
payoff = 1
(Here we assumed that mixed strategies are allowed, so S gets a random string as input. Get rid of i and j if we want to model a situation where only pure strategies are allowed.) Then S computes that payoff at the end of P, averaged over all possible i and j, is maximized by returning "EXIT" for 1/3 of its possible inputs, and does that.