In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.
The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.
I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?
As it turns out, there are a few caveats to this:
- CDT, EDT, FDT, and UDT are high-level clusters of ways to go about decision theory. They have multiple attempted formalisms, and it’s unclear to what extent different formalisms recommend the same things. For FDT and UDT in particular, it’s unclear whether any one attempted formalism (e.g. the graphical models in the FDT paper) will be successful. This is because:
- Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution. Moreover, any non-natural, arbitrary details in potential solutions are problematic, since some formalisms require everybody to know that everybody uses sufficiently similar algorithms. This highlights that:
- The toy problems are radically simpler than actual problems that agents might encounter in the future. For example, it’s unclear how they generalise to acausal cooperation between different civilisations. Such civilisations could use implicitly implemented algorithms that are more or less similar to each others’, may or may not be trying and succeeding to predict each others’ actions, and might be in asymmetric situations with far more options than just cooperating and defecting. This poses a lot of problems that don’t appear when you consider pure copies in symmetric situations, or pure predictors with known intentions.
As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.
If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.
I agree in the case where the predictor is only good at all later, ie, its initial guess is random/fixed. Which is consistent with the way I originally brought ASP up. In that case, it absolutely makes sense to 2-box if your discounting is steep enough.
But, deliberation takes time. We can think of logical inductors, or Skyrms-style deliberation setups (Skyrms, The Dynamics of Rational Deliberation). The agent has a certain amount of time to think. If it spends all that time calculating probabilities (as in LIDT), and only makes a decision at the end, then it makes sense to say you can't successfully 1-box in a way which makes you predictably do so. But if you model the decision as being made over an amount of time, then you could possibly make the decision early on in your deliberation, so that the predictor can see. So, we can think of something like a commitment which you can make while still refining probabilities.
Again, I agree this doesn't make sense if the predictor is absolutely immovable (only takes our choice as evidence to refine its prediction in later rounds). But if the predictor is something like a logical inductor, then even if it runs much slower than the agent, it can "see" some of the early things the agent does.
Of course, there is the question of how you would know that making an early commitment rather than thinking longer would be a good strategy. I'm imagining they get each other's source code. We can think about it in a learning-theoretic way as follows. Suppose I want good behavior on problems where two AIs exchange source code and then have to play a game (with a certain amount of time to think before they have to act). The system can do anything with its time; think longer, make early commitments, whatever.
Actually, let's simplify a little: since the predictor is not fully an agent, we can just think in terms of building a system which is given source code which outputs the end utility (so the source code includes the predictor, the agent itself, everything). I want a system which does well on such problems if it has enough time to think. In order to get that, what I'm going to do is have it spend the first fraction of its thinking time training itself on smaller problems sampled at random. (It could and should choose instances intelligently, but let's say it doesn't.) It treats these entirely episodically; in each instance it only tries to maximize the score in that one case.
Nonetheless, it can eventually learn to make its decision early in cases where it is beneficial to be predictable.
This is all supposed to be a model of what our agent can do in order to come to a decision in ASP. It can think about similar problems. It can notice a trend where it's useful to make certain decisions early. Sure, it might figure that out too late, in which case the best it can do is probably 2-box. But it might also figure it out in time.
It can do that, which would make it unpredictable. However, it can also not do that. In this case it has no special motivation to do that, so there's no reason to, which means there's no reason for the predictor to expect that.
Not sure whether this was a point you were trying to make, but, I agree that in the case of a perfect predictor learning theory does a fine job of 1-boxing regardless of temporal discounting or any commitment mechanisms.
I'm worrying that the overarching thread may get lost. I think this point about ASP is an important sub-point, for a few reasons, but I hope the discussion doesn't dead-end here. My picture of where we're at:
I've gravitated to ASP partly because I hope it illustrates something about how I see logic as relevant to a learning-theoretic approach to decision theory. But I'm still interested more broadly in what class of decision problems you think you can do well on, with and without commitment mechanisms.