In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.
The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.
I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?
As it turns out, there are a few caveats to this:
- CDT, EDT, FDT, and UDT are high-level clusters of ways to go about decision theory. They have multiple attempted formalisms, and it’s unclear to what extent different formalisms recommend the same things. For FDT and UDT in particular, it’s unclear whether any one attempted formalism (e.g. the graphical models in the FDT paper) will be successful. This is because:
- Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution. Moreover, any non-natural, arbitrary details in potential solutions are problematic, since some formalisms require everybody to know that everybody uses sufficiently similar algorithms. This highlights that:
- The toy problems are radically simpler than actual problems that agents might encounter in the future. For example, it’s unclear how they generalise to acausal cooperation between different civilisations. Such civilisations could use implicitly implemented algorithms that are more or less similar to each others’, may or may not be trying and succeeding to predict each others’ actions, and might be in asymmetric situations with far more options than just cooperating and defecting. This poses a lot of problems that don’t appear when you consider pure copies in symmetric situations, or pure predictors with known intentions.
As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.
If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.
Yeah, that's what I meant.
Suppose the interval between encounters with the predictor is long enough that, due to the agent's temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
But I wasn't sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.
As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn't literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but "smaller") scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.
Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I'm fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.
Yeah, I guess we can do a variety of things:
I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.
Thinking about LI policy selection where we choose a slow-growing function f(n) which determines how long we think before we choose the policy to follow on day n –– there's this weird trade-off between how (apparently) "good" the updatelessness is vs how long it takes to be any good at all. I'm fine with notions of rationality being parameterized by an ordinal or some such if it's just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.
So it would be nice to have a rationality notion which clarified this situation.
My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren't as nice in the non-realizable cases such as logical uncertainty. But it doesn't make sense to abandon updateless principles altogether because of this!