SilasBarta comments on The Absent-Minded Driver - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (139)
What I'm more interested in is: doesn't the UDT/planning-optimal solution imply that injecting randomness can improve an algorithm, which is a big no-no? Because you're saying (and you're right AFAICT) that the best strategy is to randomly choose whether to continue, with a bias in favor of continuing.
Also, could someone go through the steps of how UDT generates this solution, specifically, how it brings to your attention the possibility of expressing the payoff as a function of p? (Sorry, but I'm a bit of a straggler on these decision theory posts.)
Jaynes' principle is that injecting randomness shouldn't help you figure out what's true; implementing a mixed strategy of action is a different matter.
Although the distinction is a bit too fuzzy (in my head) for comfort.
I'm concerned about more than just Jaynes's principle. Injecting noise between your decision, and what you turn out to do, shouldn't be able to help you either. Choosing "continue" 2/3 of the time is the same as "Always continue, but add a random disturbance that reverses this 1/3 of the time."
How can that addition of randomness help you?
The randomness helps in this case because the strategy of determining your actions by which intersection you are at is not available.
Yes, the problem deletes the knowledge of what intersection I'm at. How does it help to further delete knowledge of what my decision is?
There are 3 possible sequences of actions:
The payoffs are such that sequence 2 is the best available, but, having complete knowledge of your decision and no knowledge of which intersection you are at makes sequence 2 impossible. By sacrificing that knowledge, you make available the best sequence, though you can no longer be sure you will take it. In this case, the possibility of performing the best sequence is more valuable than full knowledge of which sequence you actually perform.
By the way, I went ahead and calculated what effect probabilistic knowledge of one's current intersection has on the payoff, so as to know the value of this knowledge. So I calculated the expected return, given that you have a probability r (with 0.5<r<=1) of correctly guessing what intersection you're in, and choose the optimal path based on whichever is most likely.
In the original problem the max payoff (under p=2/3) is 4/3. I found that to beat that, you only need r to be greater than 52.05%, barely better than chance. Alternately, that's only 0.0012 bits of the 1 bit of information contained by the knowledge of which intersection you're at! (Remember that if you have less than .0012 bits, you can just revert to the p=2/3 method from the original problem, which is better than trying to use your knowledge.)
Proof: At X, you have probability r of continuing, then at Y you have a probability r of exiting and (1-r) of continuing.
Thus, EU(r) = r(4r + (1-r) ) = r(3r+1).
Then solve for when EU(r)=4/3, the optimum in the fully ignorant case, which is at r = 0.5205.
You made a mistake here, which is assuming that when you guess you are at X, you should choose CONTINUE with probability 1, and when you guess you are at Y, you should choose EXIT with probability 1. In fact you can improve your expected payoff using a mixed strategy, in which case you can always do better when you have more information.
Here's the math. Suppose when you are at an intersection, you get a clue that reads either 'X' or 'Y'. This clue is determined by a dice roll at START. With probability .49, you get 'X' at both intersections. With probability .49, you get 'Y' at both intersections. With probability .02, you get 'X' at the X intersection, and 'Y' at the Y intersection.
Now, at START, your decision consists of a pair <p,q> of probabilities, where p is your probability to CONTINUE after seeing 'X', and q is your probability to CONTINUE after seeing 'Y'. Your expected payoff is:
.02 * (p*q + 4*(p*(1-q))) + .49 * (p*p + 4*(p*(1-p))) + .49 * (q*q + 4*(q*(1-q)))
which is maximized at p=0.680556, q=0.652778. And your expected payoff is 1.33389 which is > 4/3.
Wow, good catch! (In any case, I had realized that if you have probability less than 52.05%, you shouldn't go with the most likely, but rather, revert the original p=2/3 method at the very least.)
The formula you gave for the mixed strategy (with coefficients .02, .49, .49) corresponds to a 51% probability of guessing right at any given light. (If the probability of guessing right is r, the coefficients should be 2r-1,1-r,1-r.) It actually raises the threshold for which choosing based on the most probable, with no other randomness, becomes the better strategy, but not by much -- just to about 52.1%, by my calculations.
So that means the threshold is 0.0013 bits instead of 0.0012 :-P
(Yeah, I did guess and check because I couldn't think of a better way on this computer.)
I think you might still be confused, but the nature of your confusion isn't quite clear to me. Are you saying that if r>52.1%, the best strategy is a pure one again? That's not true. See this calculation with coefficients .2, .4, .4.
ETA: Also, I think talking about this r, which is supposed to be a guess of "being at X", is unnecessarily confusing, because how that probability should be computed from a given problem statement, and whether it's meaningful at all, are under dispute. I suggest thinking in terms of what you should plan to do when you are at START.
Okay, that's a more specific, helpful answer.
Hm. This would actually appear to be a distinct class of cases in which randomness is "useful", but it's important to note that a simple deterministic algorithm would do better if we were allowed to remember our own past actions - i.e. this is a very special case. I should probably think about it further, but offhand, it looks like the reason for randomized-optimality is that the optimal deterministic algorithm has been prohibited in a way that makes all remaining deterministic algorithms stupid.
More generally, game theory often produces situations where randomization is clearly the rational strategy when you do not know with certainty the other player's action. The example given in this post is straightforwardly convertible into a game involving Nature, as many games do, where Nature plays first and puts you into one of two states; if you are in X and play Continue, Nature plays again and the game loops. The solution to that game is the same as that derived above, I believe.
The point is, for a broad class of games/decision problems where Nature has a substantial role, we might run into similar issues that can be resolved a similar way.
I think the motivation for this problem is that memory in general is a limited resource, and a decision theory should be able to handle cases were recall is imperfect. I don't believe that there was a deliberate attempt to prohibit the optimal deterministic algorithm in order to make randomized algorithms look good.
I don't think that resolves the issue. As I demonstrated in this comment, if you have some probabilistic knowledge of which intersection you're at, you can do better than the p=2/3 method. Specifically, as long as you have 0.0012 bits of information about which intersection you're at (i.e. assign a greater than 52% chance of guessing correctly), you're better off choosing based on what seems most likely.
However -- and this is the kicker -- that means that if you have between 0 and 0.0012 bits of information about your intersection, you're best off throwing that information away entirely and going with the method that's optimal for when you're fully forgetful. So it's still a case where throwing away information helps you.
ETA: False alarm; Wei_Dai corrects me here -- you can still use your knowledge to do better than 4/3 when your probability of guessing right is between 50% and 52.05%.
It's probably better not to think of this as a randomized algorithm. Here is a simpler example of what I mean.
Suppose you have two urns in front of you. One urn is full of N white marbles, and the other urn is empty. Your task is to take a marble out of the first urn, paint it either red or blue, place it in the second urn, and then repeat this process until the first urn is empty. Moreover, when you are done, something very close to two-thirds of the marbles in the second urn must be red.
The catch, of course, is that you have very poor short-term memory, so you can never remember how many marbles you've painted or what colors you've painted them.
The "randomized algorithm" solution would be to use a pseudo-random number generator to produce a number x between 0 and 1 for each marble, and to paint that marble red if and only if x < 2/3.
But there is a non-random way to think of that procedure. Suppose instead that, before you start painting your N marbles, you set out a box of N poker chips, of which you know (that is, have reason to be highly confident) that very nearly two-thirds are red. You then proceed to paint marbles according to the following algorithm. After taking a marble in hand, you select a poker chip non-randomly from the box, and then paint the marble the same color as that poker chip.
This is a non-random algorithm that you can use with confidence, but which requires no memory. And, as I see it, the method with the pseudo-random number generator amounts to the same thing. By deciding to use the generator, you are determining N numbers: the next N numbers that the generator will produce. Moreover, if you know how the generator is constructed, you know (that is, have reason to be highly confident) that very nearly two-thirds of those numbers will be less than 2/3. To my mind, this is functionally identical to the poker chip procedure.
Very clever! It is indeed true that if you forget all previous marble paintings, the best way to ensure that 2/3 get painted one color is to paint it that color with p = 2/3.
And interestingly, I can think of several examples of my own life when I've been in that situation. For example, when I'm playing Alpha Centauri, I want to make sure I have a good mix of artillery, infantry, and speeders, but it's tedious to keep track of how many I have of each, so I just pick in a roughly random way, but biased toward those that I want in higher proportions.
I'm going to see if I can map the urn/marble-painting problem back onto the absent-minded driver problem.
You mean it does not require memory in your brain, because you implemented your memory with the poker chips. It is quite convenient they were available.
My point is that it's no more convenient than having the pseudo-random number generator available. I maintain that the generator is implementing your memory in functionally the same sense. For example, you are effectively guaranteed not to get the same number twice, just as you are effectively guaranteed not to get the same poker chip twice.
ETA: After all, something in the generator must be keeping track of the passage of the marbles for you. Otherwise the generator would keep producing the same number over and over.
Rather than using a PRNG (which, as you say, requires memory), you could use a source of actual randomness (e.g. quantum decay). Then you don't really have extra memory with the randomized algorithm, do you?
I thought of this as well, but it does not really matter because it is the ability to produce the different output in each case event that gives part of the functionality of memory, that is, the ability to distinguish between events. Granted, this is not as effective as deterministically understood memory, where you know in advance which output you get at each event. Essentially, it is memory with the drawback that you don't understand how it works to the extent that you are uncertain how it correlates with what you wanted to remember.
Randomness 'degenerates' (perhaps by action of a malicious daemon) into non-randomness, and so it can do better and no worse than a non-random approach?
(If the environments and agents are identical down to the source of randomness, then the agent defaults to a pure strategy; but with 'genuine' randomness, random sources that are different between instances of the agent, the agent can actually implement the better mixed strategy?)
I'm having trouble parsing your questions. You have some sentences that end in question marks. Are you asking whether I agree with those sentences? I'm having trouble understanding the assertions made by those sentences, so I can't tell whether I agree with them (if that was what you were asking).
The claim that I was making could be summed up as follows. I described an agent using a PRNG to solve a problem involving painting marbles. The usual way to view such a solution is as
My suggestion was instead to view the solution as
The especially limited memory is the part of the PRNG that remembers what the next seed should be. If there weren't some kind of memory involved in the PRNG's operation, the PRNG would keep using the same seed over and over again, producing the same "random" number again and again.
The algorithm is the algorithm that the PRNG uses to turn the first seed into a sequence of pseudo-random numbers.
The certain property of that sequence is the property of having two-thirds of its terms being less than 2/3.
OK, that's clearer. And different from what I thought you were saying.
That is fair enough, though the reason I find scenario at all interesting is that it illustrates the utility of a random strategy under certain conditions.
For me, finding an equivalent nonrandom strategy helps to dispel confusion.
I like your characterization above that the PRNG is "memory with the drawback that you don't understand how it works to the extent that you are uncertain how it correlates with what you wanted to remember." Another way to say it is that the PRNG gives you exactly what you need with near-certainty, while normal memory gives you extra information that happens to be useless for this problem.
What is "random" about the PRNG (the exact sequence of numbers) is extra stuff that you happen not to need. What you need from the PRNG (N numbers, of which two-thirds are less than 2/3) is not random but a near-certainty. So, although you're using a so-called pseudo-random number generator, you're really using an aspect of it that's not random in any significant sense. For this reason, I don't think that the PRNG algorithm should be called "random", any more than is the poker chip algorithm.
Take 3 marbles out of the urn. Paint one of them blue, and then the other two red. Put all three in the second urn. Repeat
Yeah, yeah - white marbles are half-medusas, so if you can see more than one you die, or something.
Taking more than one marble out of the urn at a time violates the task description. Don't fight the hypothetical!
... That's what the second paragraph is for...
If you're allowed to use external memory, why not just write down how many you painted of each color? Note that memory is different from a random number generator; for example, a random number generator can be used (imperfectly) to coordinate with a group of people with no communication, whereas memory would require communication but could give perfect results.
Have you read this comment of mine from another branch of this conversation?
I just added to the post with an answer.
Thanks! :-) But I still don't understand what made you express the payoff as a function of p. Was it just something you thought of when applying UDT (perhaps after knowing that's how someone else approached the problem), or is there something about UDT that required you to do that?
What do you mean? p is the only control parameter... You consider a set of "global" mixed strategies, indexed by p, and pick one that leads to the best outcome, without worrying about where your mind that does this calculation is currently located and under what conditions you are thinking this thought.
Perhaps, but it's an innovation to think of the problem in terms of "solving for the random fraction of times I'm going to do them". That is, even considering that you should add randomness in between your decision and what you do, is an insight. What focused your attention on optimizing with respect to p?
Mixed strategy is a standard concept, so here we are considering a set S of all (global) mixed strategies available for the game. When you are searching for the best strategy, you are maximizing the payoff over S. You are searching for the mixed strategy that gives the best payoff. What UDT tells is that you should just do that, even if you are considering what to do in a situation where some of the options have run out, and, as here, even if you have no idea where you are. "The best strategy" quite literally means
The only parameter for a given strategy is the probability of turning, so it's natural to index the strategies by that probability. This indexing is a mapping t:[0,1]->S that places a mixed strategy in correspondence with a value of turning probability. Now, we can rewrite the expected utility maximization in terms of probability:
For a strategy corresponding to turning probability p, it's easy to express corresponding expected utility:
We now can find the optimal strategy as
Okay, that's making more sense -- the part where you get to parameterizing p as a real is what I was interested in.
But do you do the same thing when applying UDT to Newcomb's problem? Do you consider it a necessary part of UDT that you take p (with 0<=p<=1) as a continuous parameter to maximize over, where p is the probability of one-boxing?
Fundamentally, this depends on the setting -- you might not be given a random number generator (randomness is defined with respect to the game), and so the strategies that depend on a random value won't be available in the set of strategies to choose from. In Newcomb's problem, the usual setting is that you have to be fairly deterministic or Omega punishes you (so that a small probability of two-boxing may even be preferable to pure one-boxing, or not, depending on Omega's strategy), or Omega may be placed so that your strategy is always deterministic for it (effectively, taking mixed strategies out of the set of allowed ones).
S() is suppose to be an implementation of UDT. By looking at the world program P, it should determine that among all possible input-output mappings, those that return "EXIT" for 1/3 of all inputs (doesn't matter which ones) maximize average payoff. What made me express the payoff as a function of p is by stepping through what S is supposed to do as an implementation of UDT.
Does that make sense?
I'm still confused. Your response seems to just say, "I did it because it works." -- which is a great reason! But I want to know if UDT gave you more guidance than that.
Does UDT require that you look at the consequences of doing something p% of the time (irrespective of which ones), on all problems?
Basically, I'm in the position of that guy/gal that everyone here probably helped out in high school:
"How do you do the proof in problem 29?" "Oh, just used identities 3 and 5, solve for t, and plug it back into the original equation." "But how did you know to do that?"
No, UDT (at least in my formulation) requires that you look at all possible input-output mappings, and choose the one that is optimal. In this case it so happens that any function that returns "EXIT" for 1/3 of inputs is optimal.