In this post, I clarify how far we are from a complete solution to decision theory, and the way in which high-level philosophy relates to the mathematical formalism. I’ve personally been confused about this in the past, and I think it could be useful to people who casually follows the field. I also link to some less well-publicized approaches.
The first disagreement you might encounter when reading about alignment-related decision theory is the disagreement between Causal Decision Theory (CDT), Evidential Decision Theory (EDT), and different logical decision theories emerging from MIRI and lesswrong, such as Functional Decision Theory (FDT) and Updateless Decision Theory (UDT). This is characterized by disagreements on how to act in problems such as Newcomb’s problem, smoking lesion and the prisoner's dilemma. MIRI’s paper on FDT represents this debate from MIRI’s perspective, and, as exemplified by the philosopher who refereed that paper, academic philosophy is far from having settled on how to act in these problems.
I’m quite confident that the FDT-paper gets those problems right, and as such, I used to be pretty happy with the state of decision theory. Sure, the FDT-paper mentions logical counterfactuals as a problem, and sure, the paper only talks about a few toy problems, but the rest is just formalism, right?
As it turns out, there are a few caveats to this:
- CDT, EDT, FDT, and UDT are high-level clusters of ways to go about decision theory. They have multiple attempted formalisms, and it’s unclear to what extent different formalisms recommend the same things. For FDT and UDT in particular, it’s unclear whether any one attempted formalism (e.g. the graphical models in the FDT paper) will be successful. This is because:
- Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution. Moreover, any non-natural, arbitrary details in potential solutions are problematic, since some formalisms require everybody to know that everybody uses sufficiently similar algorithms. This highlights that:
- The toy problems are radically simpler than actual problems that agents might encounter in the future. For example, it’s unclear how they generalise to acausal cooperation between different civilisations. Such civilisations could use implicitly implemented algorithms that are more or less similar to each others’, may or may not be trying and succeeding to predict each others’ actions, and might be in asymmetric situations with far more options than just cooperating and defecting. This poses a lot of problems that don’t appear when you consider pure copies in symmetric situations, or pure predictors with known intentions.
As a consequence, knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet, and if we ever find a suitable formalism, it’s very unclear how it will generalise.
If you want to dig into this more, Abram Demski mentions some open problems in this comment. Some attempts at making better formalisations includes Logical Induction Decision Theory (which uses the same decision procedure as evidential decision theory, but gets logical uncertainty by using logical induction), and a potential modification, Asymptotic Decision Theory. There’s also a proof-based approach called Modal UDT, for which a good place to start would be the 3rd section in this collection of links. Another surprising avenue is that some formalisations of the high-level clusters suggest that they're all the same. If you want to know more about the differences between Timeless Decision Theory (TDT), FDT, and versions 1.0, 1.1, and 2 of UDT, this post might be helpful.
I had an interesting thought that made me update towards your position. There is my old post about "metathreat equilibria", an idea I developed with Scott's help during my trip to Los Angeles. So, I just realized the same principle can be realized in the setting of repeated games. In particular, I am rather confident that the following, if formulated a little more rigorously, is a theorem:
Consider the Iterated Prisoner's Dilemma in which strategies are constrained to depend only on the action of the opponent in the previous round. Then, at the limit γ→1 (shallow time discount), the only thermodynamic equilibrium is mutual cooperation.
Then, there is the question of how to generalize it. Obviously we want to consider more general games and more general strategies, but allowing fully general strategies probably won't work (just like allowing arbitrary programs doesn't work in the "self-modification" setting). One obvious generalization is allowing the strategy to depend on some finite suffix of the history. But, this isn't natural in the context of agent theory: why would agents forget everything that happened before a certain time? Instead, we can constraint the strategy to be finite-state, and maybe require it to be communicating (i.e. forbid "grim triggers" where players change their behavior forever). On the game side, we can consider arbitrary repeated games, or communicating stochastic games (they have to be communicating because otherwise we can represent one-shot games), or even communicating partially observable stochastic games. This leads me to the following bold conjecture:
Consider any suitable (as above) game in which strategies are constrained to be (communicating?) finite-state. Then, at the limit γ→1, all thermodynamic equilibria are Pareto efficient.
This would be the sort of result that seems like it should be applicable to learning agents. That is, if the conjecture is true, there is good hope that appropriate learning agents are guaranteed to converge to Pareto efficient outcomes. Even if it the conjecture requires some moderately strong additional assumptions, it seems worth studying.