## An Introduction to Löb's Theorem in MIRI Research

Would you like to see a primer on several MIRI research topics (assuming only the background of having taken a course with proofs in math or computer science)? Or are you curious why MIRI does so much with mathematical logic, and why people on Less Wrong keep referring to Löb's Theorem?

If you answered yes to either question, you may be interested in my lecture notes, An Introduction to Löb's Theorem in MIRI Research! These came out of an introductory talk that I gave at a MIRIx workshop.

Since I've got some space here, I'll just copy and paste the table of contents and the introduction section...

## New forum for MIRI research: Intelligent Agent Foundations Forum

Today, the Machine Intelligence Research Institute is launching a new forum for research discussion: the Intelligent Agent Foundations Forum! It's already been seeded with a bunch of new work on MIRI topics from the last few months.

We've covered most of the (what, why, how) subjects on the forum's new welcome post and the How to Contribute page, but this post is an easy place to comment if you have further questions (or if, maths forbid, there are technical issues *with* the forum instead of *on* it).

But before that, go ahead and check it out!

(Major thanks to Benja Fallenstein, Alice Monday, and Elliott Jin for their work on the forum code, and to all the contributors so far!)

**EDIT 3/22:** Jessica Taylor, Benja Fallenstein, and I wrote forum digest posts summarizing and linking to recent work (on the IAFF and elsewhere) on reflective oracle machines, on corrigibility, utility indifference, and related control ideas, and on updateless decision theory and the logic of provability, respectively! These are pretty excellent resources for reading up on those topics, in my biased opinion.

## "Solving" selfishness for UDT

*With many thanks to Beluga and lackofcheese.*

When trying to decide between SIA and SSA, two anthropic probability theories, I concluded that the question of anthropic probability is badly posed and that it depends entirely on the values of the agents. When debating the issue of personal identity, I concluded that the question of personal identity is badly posed and depends entirely on the values of the agents. When the issue of selfishness in UDT came up recently, I concluded that the question of selfishness is...

But let's not get ahead of ourselves.

## Proper value learning through indifference

*A putative new idea for AI control; index here.*

Many designs for creating AGIs (such as Open-Cog) rely on the AGI deducing moral values as it develops. This is a form of value loading (or value learning), in which the AGI updates its values through various methods, generally including feedback from trusted human sources. This is very analogous to how human infants (approximately) integrate the values of their society.

The great challenge of this approach is that it relies upon an AGI which already has an interim system of values, being able and willing to correctly update this system. Generally speaking, humans are unwilling to easily update their values, and we would want our AGIs to be similar: values that are too unstable aren't values at all.

So the aim is to clearly separate the conditions under which values should be kept stable by the AGI, and conditions when they should be allowed to vary. This will generally be done by specifying criteria for the variation ("only when talking with Mr and Mrs Programmer"). But, as always with AGIs, unless we program those criteria perfectly (hint: we won't) the AGI will be motivated to interpret them differently from how we would expect. It will, as a natural consequence of its program, attempt to manipulate the value updating rules according to its current values.

How could it do that? A very powerful AGI could do the time honoured "take control of your reward channel", by either threatening humans to give it the moral answer it wants, or replacing humans with "humans" (constructs that pass the programmed requirements of being human, according to the AGI's programming, but aren't actually human in practice) willing to give it these answers. A weaker AGI could instead use social manipulation and leading questioning to achieve the morality it desires. Even more subtly, it could tweak its internal architecture and updating process so that it updates values in its preferred direction (even something as simple as choosing the order in which to process evidence). This will be hard to detect, as a smart AGI might have a much clearer impression of how its updating process will play out in practice than it programmers would.

The problems with value loading have been cast into the various "Cake or Death" problems. We have some idea what criteria we need for safe value loading, but as yet we have no candidates for such a system. This post will attempt to construct one.

## Timelessness as a Conservative Extension of Causal Decision Theory

Author's Note: Please let me know in the comments exactly what important background material I have missed, and *exactly* what I have misunderstood, and please try not to mind that everything here is written in the academic voice.

## Abstract: Timeless Decision Theory often seems like the correct way to handle many game-theoretical dilemmas, but has not quite been satisfactorily formalized and still handles certain problems the wrong way. We present an intuition that helps us extend Causal Decision Theory towards Timeless Decision Theory while adding rigor, and then formalize this intuition. Along the way, we describe how this intuition can guide both us and programmed agents in various Newcomblike games.

## Introduction

One day, a Time Lord called Omega drops out of the sky, walks up to me on the street, and places two boxes in front of me. One of these is opaque, the other is transparent and contains $1000. He tells me I can take either the opaque box alone, or both boxes, but that if and only if he predicted using his Time Lord Science I would take just the opaque box, it contains $1,000,000. He then flies away back to the his home-world of Gallifrey. I know that whatever prediction he made was/will be correct, because after all he is a Time Lord.

The established, gold-standard algorithm of Causal Decision Theory fails to win the maximum available sum of money on this problem, just as it fails on a symmetrical one-shot Prisoner's Dilemma. In fact, as human beings, we can say that CDT fails miserably, because while a programmed agent goes "inside the game" and proceeds to earn a good deal less money than it could, we human observers are sitting outside, carefully drawing outcome tables that politely inform us of just how much money our programmed agents are leaving on the table. While purely philosophical controversies abound in the literature about the original Newcomb's Problem, it is generally obvious from our outcome tables in the Prisoners' Dilemma that "purely rational" CDT agents would very definitely benefit by cooperating, and that actual human beings asked to play the game calculate outcomes as if forming coalitions rather than as if maximizing personal utility -- thus cooperating and winning. Even in the philosophical debates, it is generally agreed that one-boxers in Newcomb's Problem are, in fact, obtaining more money.

While some have attempted to define rationality as the outputs of specific decision algorithms, we hold with the school of thought that rationality means minimizing regret: a rational agent should select its decision algorithms in order to win as much as it will know it could have won ex-post-facto. Failing perfection, this optimum should be approximated as closely as possible.

Yudkowsky's Timeless Decision Theory approaches this problem by noting that many so-called decisions are actually outcomes from concurrent or separated instantiations of a single algorithm, that Timeless Decision Theory itself is exactly such an algorithm, and that many decisions (that actually are decisions in the sense that the algorithm deciding them is a utility-maximizing decision-theory) are acausally, timelessly connected. Agents running TDT will decide not as if they are determining one mere assignment to one mere variable in a causal graph but as if they're determining the output of the computation they implement, and thus of every logical node in the entire graph derived from their computation. However, it still has some kinks to work out:

Yudkowsky (2010) shows TDT succeeding in the original Newcomb’s problem. Unfortunately, deciding exactly when and where to put the logical nodes, and what conditional probabilities to place on them, is not yet an algorithmic process.

How would TDT look if instantiated in a more mature application? Given a very large and complex network, TDT would modify it in the following way: It would investigate each node, noting the ones that were results of instantiated calculations. Then it would collect these nodes into groups where every node in a group was the result of the same calculation. (Recall that we don’t know what the result is, just that it comes from the same calculation.) For each of these groups, TDT would then add a logical node representing the result of the abstract calculation, and connect it as a parent to each node in the group. Priors over possible states of the logical nodes would have to come from some other reasoning process, presumably the one that produces causal networks in the first place. Critically, one of these logical nodes would be the result of TDT’s own decision process in this situation. TDT would denote that as the decision node and use the resulting network to calculate the best action by equation 1.1.

The bolding is added by the present authors, as it highlights the issue we intend to address here. Terms like "timeless" and "acausal" have probably caused more confusion around Timeless Decision Theory than any other aspect of what is actually an understandable and reasonable algorithm. I will begin by presenting a clearer human-level intuition behind the correct behavior in Newcomb's Problem and the Prisoner's Dilemma, and will then proceed to formalize that intuition in Coq and apply it to sketch a more rigorously algorithmic Timeless Decision Theory. The formalization of this new intuition avoids problems of infinite self-reference or infinite recursion in reasoning about the algorithms determining decisions of oneself or others.

## Timeless decisions are actually entangled with each-other

The kind of apparent retrocausality present in Newcomb's Problem makes no intuitive sense whatsoever. Not only our intuitions but all our knowledge of science tell us that (absent the dubious phenomenon of closed timelike curves) causal influences always and only flow from the past to the future, never the other way around. Nonetheless, in the case of Newcomb-like problems, it has been seriously argued that:

the Newcomb problem cannot but be retrocausal, if there is genuine evidential dependence of the predictor’s behaviour on the agent’s choice, from the agent’s point of view.

We do not believe in retrocausality, at least not as an objective feature of the world. Any subjectively apparent retrocausality, we believe, must be some sort of illusion that reduces to genuine, right-side-up causality. Timeless or acausal decision-making resolves the apparent retrocausality by noticing that different "agents" in Newcomblike problems are actually reproductions of the same algorithm, and that they can thus be logically correlated without any direct causal link.

We further prime our intuitions about Newcomb-like problems with the observation that CDT-wielding Newcomb players who bind themselves to a precommitment to one-box before Omega predicts their actions will win the $1,000,000:

At t = 0 you can take a pill that turns you into a “one boxer”. The pill will lead the mad scientist to predict (at t = ½) that you will take one box, and so will cause you to receive £1,000,000 but will also cause you to leave a free £1,000 on the table at t = 1. CDT tells you to take the pill at t = 0: it is obviously the act, among those available at t = 0, that has the best overall causal consequences.

The "paradox", then, lies in how the CDT agent comes to believe that their choice is completely detached from which box contains how much money, when in fact Omega's prediction of their choice was accurate, and directly caused Omega to place money in boxes accordingly, all of this despite no retrocausality occurring. Everything makes perfect sense prior to Omega's prediction.

What, then, goes wrong with CDT? CDT agents will attempt to cheat against Omega: to be predicted as a one-boxer and then actually take both boxes. If given a way to obtain more money by precommitting to one-boxing, they will do so, but will subsequently feel regret over having followed their precommitment and "irrationally" taken only one box when both contained money. They may even begin to complain about the presence or absence of free will, as if this could change the game and enable their strategy to actually work.

When we cease such protestations and accept that CDT behaves irrationally, the real question becomes: which outcomes are genuinely possible in Newcomb's Problem, which outcomes are preferable, and why does CDT fail to locate these?

Plainly if we believe that Omega has a negligible or even null error rate, then in fact only two outcomes are possible:

- Our agent is predicted to take both boxes, and does so, receiving only $1000 since Omega has not filled the opaque box.
- Our agent is predicted to take the opaque box, which Omega fills, and the agent does take the opaque box, receiving $1,000,000.

Plainly, $1 million is a greater sum than $1000, and the former outcome state is thus preferable to the latter. We require an algorithm that can search out and select this outcome based on general principles, in any Newcomblike game rather than based on special-case heuristics.

Whence, then, a causal explanation of what to do? The authors' intuition was sparked by a bit of reading about the famously "spooky" phenomenon of quantum entanglement, also sometimes theorized to involve retrocausality. Two particles interact and become entangled; from then on, their quantum states will remain correlated until measurement collapses the wave-function of one particle or the other. Neither party performing a measurement will ever be able to tell which measurement took place first in time, but both measurements will always yield correlated results. This occurs despite the fact that quantum theory is confirmed to have no hidden variables, and even when general relativity's light-speed limit on the transmission of information prevents the entangled particles from "communicating" any quantum information. A paradox is apparent and most people find it scientifically unaesthetic.

In reality, there is no paradox at all. All that has happened is that the pair of particles are in quantum superposition together: their observables are mutually governed by a single joint probability distribution. The measured observable states do not go from "randomized" to "correlated" as the measurement is made. The measurement only "samples" a single classical outcome governing both particles from the joint probability distribution that is actually there. The joint probability distribution was actually caused by the 100% local and slower-than-light interaction that entangled the two particles in the first place.

Likewise for Newcomb's Problem in decision theory. As the theorists of precommitment had intuited, the outcome is not actually caused when the CDT agent believes itself to be making a decision. Instead, the outcome was caused when Omega measured the agent and predicted its choice ahead of time: the state of the agent at this time causes both Omega's prediction and the agent's eventual action.

We thus develop an intuition that like a pair of particles, the two correlated decision processes behind Omega's prediction and behind the agent's "real" choice are in some sense entangled: correlated due to a causal interaction in their mutual past. All we then require to win at Newcomb's Problem is a rigorous conception of such entanglement and a way of handling it algorithmically to make regret-minimizing decisions when entangled.

## Formalized decision entanglement

Let us begin by assuming that an agent can be defined as a function from a set of Beliefs and a Decision to an Action. There will not be very much actual proof-code given here, and what is given was written in the Coq proof assistant. The proofs, short though they be, were thus mechanically checked before being given here; "do try this at home, kids."

Definition Agent (Beliefs Decision Action: Type) : Type := Beliefs -> Decision -> Action.

We can then broaden and redefine our definition of decision entanglement as saying, essentially, "Two agents are entangled when either one of them would do what the other is doing, were they to trade places and thus beliefs but face equivalent decisions." More simply, if a certain two agents are entangled over a certain two equivalent decisions, any differences in what decisions they actually make arise from differences in beliefs.

Inductive entangled {Beliefs Decision Action} (a1 a2: Agent Beliefs Decision Action) d1 d2 :=

| ent : (forall (b: Beliefs), a1 b d1 = a2 b d2) -> d1 = d2 -> entangled a1 a2 d1 d2.

This kind of entanglement can then, quite quickly, be shown to be an equivalence relation, thus partitioning the set of all logical nodes in a causal graph into Yudkowsky's "groups where every node in a group was the result of the same calculation", with these groups being equivalence classes.

Theorem entangled_reflexive {B D A} : forall (a: Agent B D A) d,

` entangled a a d d.`

Proof.

` intros.`

` constructor.`

` intros. reflexivity. reflexivity.`

Qed.

Theorem entangled_symmetric {B D A}: forall (a1 a2: Agent B D A) d1 d2,

` entangled a1 a2 d1 d2 ->`

` entangled a2 a1 d2 d1.`

Proof.

` intros.`

` constructor;`

` induction H;`

intros; symmetry.

` apply e. apply e0.`

Qed.

Theorem entangled_transitive {B D A}: forall (a1 a2 a3: Agent B D A) d1 d2 d3,

` entangled a1 a2 d1 d2 ->`

` entangled a2 a3 d2 d3 ->`

` entangled a1 a3 d1 d3.`

Proof.

` intros a1 a2 a3 d1 d2 d3 H12 H23.`

` constructor;`

induction H12; induction H23; subst.

` intros b. rewrite e. rewrite e1.`

` reflexivity. reflexivity.`

Qed.

Actually proving that this relation holds simply consists of proving that two agents given equivalent decisions will always decide upon the same action (similar to proving program equilibrium) no matter what set of arbitrary beliefs is given them -- hence the usage of a second-order forall. Proving this does not require actually running the decision function of either agent. Instead, it requires demonstrating that the abstract-syntax trees of the two decision functions can be made to unify, up to the renaming of universally-quantified variables. This is what allows us to prove the entanglement relation's symmetry and transitivity: our assumptions give us rewritings known to hold over the universally-quantified agent functions and decisions, thus letting us employ unification as a proof tool without knowing what specific functions we might be handling.

Thanks to employing the unification of syntax trees rather than the actual running of algorithms, we can conservatively extend Causal Decision Theory with logical nodes and entanglement to adequately handle timeless decision-making, without any recourse to retrocausality nor to the potentially-infinitely loops of Sicilian Reasoning. (Potential applications of timeless decision-making to win at Ro Sham Bo remain an open matter for the imagination.)

Decision-theoretically, since our relation doesn't have to know anything about the given functions other than (forall (b: Beliefs), a1 b d = a2 b d), we can test whether our relationship holds over any two logical/algorithm nodes in an arbitrary causal graph, since all such nodes can be written as functions from their causal inputs to their logical output. We thus do not need a particular conception of what constitutes an "agent" in order to make decisions rigorously: we only need to know what decision we are making, and where in a given causal graph we are making it. From there, we can use simple (though inefficient) pairwise testing to find the equivalence class of all logical nodes in the causal graph equivalent to our decision node, and then select a utility-maximizing output for each of those nodes using the logic of ordinary Causal Decision Theory.

The slogan of a Causal Decision Theory with Entanglement (CDT+E) can then be summed up as, "select the decision which maximizes utility for the equivalence class of nodes to which I belong, with all of us acting and exerting our causal effects in concert, across space and time (but subject to our respective belief structures)."

The performance of CDT with entanglement on common problems

While we have not yet actually programmed a software agent with a CDT+E decision algorithm over Bayesian causal graphs (any readers who can point us to a corpus of preexisting source code for building, testing, and reasoning about decision-theory algorithms will be much appreciated, as we can then replace this wordy section with a formal evaluation), we can provide informal but still somewhat rigorous explanations of what it should do on several popular problems and why.

First, the simplest case: when a CDT+E agent is placed into Newcomb's Problem, provided that the causal graph expresses the "agenty-ness" of whatever code Omega runs to predict our agent's actions, both versions of the agent (the "simulated" and the "real") will look at the causal graph they are given, detect their entanglement with each-other via pairwise checking and proof-searching (which may take large amounts of computational power), and subsequently restrict their decision-making to choose the best outcome over worlds where they both make the same decision. This will lead the CDT+E agent to take only the opaque box (one-boxing) and win $1,000,000. This is the same behavior for the same reasons as is obtained with Timeless Decision Theory, but with less human intervention in the reasoning process.

Provided that the CDT+E agent maintains some model of past events in its causal network, the Parfit’s Hitchhiker Problem trivially falls to the same reasoning as found in the original Newcomb’s Problem.

Furthermore, two CDT+E agents placed into the one-shot Prisoners' Dilemma and given knowledge of each-other's algorithms as embodied logical nodes in the two causal graphs will notice that they are entangled, choose the most preferable action over worlds in which both agents choose identically, and thus choose to cooperate. Should a CDT+E agent playing the one-shot Prisoner's Dilemma against an arbitrary agent with potentially non-identical code fail to prove entanglement with its opponent (fail to prove that its opponent's decisions mirror its own, up to differences in beliefs), it will refuse to trust its opponent and defect. A more optimal agent for the Prisoners' Dilemma would in fact demand from itself a proof that either it is or is not entangled with its opponent, and would be able to reason specifically about worlds in which the decisions made by two nodes cannot be the same. Doing so requires the Principle of the Excluded Middle, an axiom not normally used in the constructive logic of automated theorem-proving systems.

Lastly, different versions of CDT+E yield interestingly different results in the Counterfactual Mugging Problem. Let us assume that the causal graph given to the agent contains three logical nodes: the actual agent making its choice to pay Omega $100, Omega's prediction of what the agent will do in this case, and Omega's imagination of the agent receiving $1,000 had the coin come up the other way. The version of the entanglement relation here quantifies over decisions themselves at the first-order level, and thus the two versions of the agent who are dealing with the prospect of giving Omega $100 will become entangled. Despite being entangled, they will see no situation of any benefit to themselves, and will refuse to pay Omega the money. However, consider the stricter definition of entanglement given below:

Inductive strongly_entangled {Beliefs Decision Action} (a1 a2: Agent Beliefs Decision Action) :=

| ent : (forall (b: Beliefs) (d: Decision), a1 b d = a2 b d) -> entangled a1 a2.

This definition says that two agents are strongly entangled when they yield the same decisions for every possible pair of beliefs and decision problem that can be given to them. This continues to match our original intuition regarding decision entanglement: that we are dealing with the same algorithm (agent), with the same values, being instantiated at multiple locations in time and space. It is somewhat stronger than the reasoning behind Timeless Decision Theory: it can recognize two instantiations of the same agent that face two different decisions, and enable them to reason that they are entangled with each-other.

Under this stronger version of the entanglement relation (whose proofs for being an equivalence relation are somewhat simpler, by the way), a CDT+E agent given the Counterfactual Mugging will recognize itself as entangled not only with the predicted factual version of itself that might give Omega $100, but also with the predicted counterfactual version of itself that receives $1000 on the alternate coin flip. Each instance of the agent then independently computes the same appropriate tuple of output actions to maximize profit across the entire equivalence class (namely: predicted-factual gives $100, real-factual gives $100, predicted-counterfactual receives $1000).

Switching entirely to the stronger version of entanglement would cause a CDT+E agent to lose certain games requiring cooperation with other agents that are even trivially different (for instance, if one agent likes chocolate and the other hates it, they are not strongly entangled). These games remain winnable with the weaker, original form of entanglement.

Future research

Future research could represent the probabilistic possibility of entanglement within a causal graph by writing down multiple parallel logical/algorithm nodes as children of the same parent, each of which exists and acts with a probability conditional on the outcome of the parent node. A proof engine extended with probabilities over logical sentences (which, to the authors' knowledge, is not yet accomplished for second-order constructive logics of the kind used here) could also begin to assign probabilities to entanglement between logical/algorithm nodes. These probabilistic beliefs can then integrate into the action-selection algorithm of Causal Decision Theory just like any other probabilistic beliefs; the case of pure logic and pure proof from axioms merely constitutes assigning a degenerate probability of 1.0 to some belief.

Previous researchers have noted that decision-making over probabilistic acausal entanglement with other agents can be used to represent the notion of "universalizability" from Kantian deontological ethics. We note that entanglements with decision nodes in the past and future of a single given agent actually lead to behavior not unlike a "virtue ethics" (that is, the agent will start trying to enforce desirable properties up and down its own life history). When we begin to employ probabilities on entanglement, the Kantian and virtue-ethical strategies will become more or less decision-theoretically dominant based on the confidence with which CDT+E agents believe they are entangled with other agents or with their past and future selves.

Acausal trade/cooperation with agents other than the given CDT+E agent itself can also be considered, at least under the weaker definition of entanglement. In such cases, seemingly undesirable behaviors such as subjection to acausal versions of Pascal's Mugging could appear. However, entanglements (whether Boolean, constructive, or probabilistically believed-in) occur between logical/decision nodes in the causal graph, which are linked by edges denoting conditional probabilities. Each CDT+E agent will thus weight the other in accordance with their beliefs about the probability mass of causal link from one to the other, making acausal Muggings have the same impact on decision-making as normal ones.

The discovery that games can have different outcomes under different versions of entanglement leads us to believe that our current concept of entanglement between agents and decisions is incomplete. We believe it is possible to build a form of entanglement that will pay Omega in the Counterfactual Mugging without trivially losing at the Prisoners’ Dilemma (as strong entanglement can), but our current attempts to do so sacrifice the transitivity of entanglement. We do not yet know if there are any game-theoretic losses inherent in that sacrifice. Still, we hope that further development of the entanglement concept can lead to a decision theory that will more fully reflect the "timeless" decision-making intuition of retrospectively detecting rational precommitments and acting according to them in the present.

CDT+E opens up room for a fully formal and algorithmic treatment of the "timeless" decision-making processes proposed by Yudkowsky, including acausal "communication" (regarding symmetry or nonsymmetry) and acausal trade in general. However, like the original Timeless Decision Theory, it still does not actually have an algorithmic process for placing the logical/decision nodes into the causal graph -- only for dividing the set of all such nodes into equivalence classes based on decision entanglement. Were such an algorithmic process to be found, it could be used by an agent to locate itself within its model of the world via the stronger definition of entanglement. This could potentially reduce the problem of naturalizing induction to the subproblems of building a causal model that contains logical or algorithmic nodes, locating the node in the present model whose decisions are strongly entangled with those of the agent, and then proceeding to engage in "virtue ethical" planning for near-future probabilistically strongly-entangled versions of the agent's logical node up to the agent's planning horizon.

Acknowledgements

The authors would like to thank Joshua and Benjamin Fox for their enlightening lectures on Updateless Decision Theory, and to additionally thank Benjamin Fox in specific for his abundant knowledge, deep intuition and clear guidance regarding acausal decision-making methods that actually win. Both Benjamin Fox and David Steinberg have our thanks for initial reviewing and help clarifying the text.

## Probability and radical uncertainty

In the previous article in this sequence, I conducted a thought experiment in which simple probability was not sufficient to choose how to act. Rationality required reasoning about *meta-probabilities*, the probabilities of probabilities.

Relatedly, lukeprog has a brief post that explains how this matters; a long article by HoldenKarnofsky makes meta-probability central to utilitarian estimates of the effectiveness of charitable giving; and Jonathan_Lee, in a reply to that, has used the same framework I presented.

In my previous article, I ran thought experiments that presented you with various colored boxes you could put coins in, gambling with uncertain odds.

The last box I showed you was blue. I explained that it had a fixed but unknown probability of a twofold payout, uniformly distributed between 0 and 0.9. The overall probability of a payout was 0.45, so the expectation value for gambling was 0.9—a bad bet. Yet your optimal strategy was to gamble a bit to figure out whether the odds were good or bad.

Let’s continue the experiment. I hand you a black box, shaped rather differently from the others. Its sealed faceplate is carved with runic inscriptions and eldritch figures. “I find this one *particularly* interesting,” I say.

## Probability, knowledge, and meta-probability

This article is the first in a sequence that will consider situations where probability estimates are not, by themselves, adequate to make rational decisions. This one introduces a "meta-probability" approach, borrowed from E. T. Jaynes, and uses it to analyze a gambling problem. This situation is one in which reasonably straightforward decision-theoretic methods suffice. Later articles introduce increasingly problematic cases.

## Evidential Decision Theory, Selection Bias, and Reference Classes

**See also:** Does Evidential Decision Theory really fail Solomon's Problem?, What's Wrong with Evidential Decision Theory?

It seems to me that the examples usually given of decision problems where EDT makes the wrong decisions are really examples of performing Bayesian updates incorrectly. The basic problem seems to be that naive EDT ignores a *selection bias* when it assumes that an agent that has just performed an action should be treated as a random sample from the population of all agents who have performed that action. Said another way, naive EDT agents make some unjustified assumptions about what *reference classes* they should put themselves into when considering counterfactuals. A more sophisticated Bayesian agent should make neither of these mistakes, and correcting them should not in principle require moving beyond EDT but just becoming less naive in applying it.

### Elaboration

Recall that an EDT agent attempts to maximize conditional expected utility. The main criticism of EDT is that naively computing conditional probabilities leads to the conclusion that you should perform actions which *are good news* upon learning that they happened, as opposed to actions which *cause good outcomes* (what CDT attempts to do instead). For a concrete example of the difference, let's take the smoking lesion problem:

Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer.

Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?

In the smoking lesion problem, smoking is bad news, but it doesn't cause a bad outcome: learning that someone smokes, in the absence of further information, increases your posterior probability that they have the lesion and therefore cancer, but choosing to smoke cannot in fact alter whether you have the lesion / cancer or not. Naive EDT recommends not smoking, but naive CDT recommends smoking, and in this case it seems that naive CDT's recommendation is correct and naive EDT's recommendation is not.

The naive EDT agent's reasoning process involves considering the following counterfactual: "if I observe myself smoking, that increases my posterior probability that I have the lesion and therefore cancer, and that would be bad. Therefore I will not smoke." But it seems to me that in this counterfactual, the naive EDT agent -- who smokes and then glumly concludes that there is an increased probability that they have cancer -- is performing a Bayesian update incorrectly, and that the incorrectness of this Bayesian update, rather than any fundamental problem with making decisions based on conditional probabilities, is what causes the naive EDT agent to perform poorly.

Here are some other examples of this kind of Bayesian update, all of which seem obviously incorrect to me. They lead to silly decisions because they are silly updates.

- "If I observe myself throwing away expensive things, that increases my posterior probability that I am rich and can afford to throw away expensive things, and that would be good. Therefore I will throw away expensive things." (This example requires that you have some uncertainty about your finances -- perhaps you never check your bank statement and never ask your boss what your salary is.)
- "If I observe myself not showering, that increases my posterior probability that I am clean and do not need to shower, and that would be good. Therefore I will not shower." (This example requires that you have some uncertainty about how clean you are -- perhaps you don't have a sense of smell or a mirror.)
- "If I observe myself playing video games, that increases my posterior probability that I don't have any work to do, and that would be good. Therefore I will play video games." (This example requires that you have some uncertainty about how much work you have to do -- perhaps you write this information down and then forget it.)

### Selection Bias

Earlier I said that *in the absence of further information*, learning that someone smokes increases your posterior probability that they have the lesion and therefore cancer in the smoking lesion problem. But when a naive EDT agent is deciding what to do, they *have* further information: in the counterfactual where they're smoking, they know that they're smoking because they're in a counterfactual about what would happen if they smoked (or something like that). This information should screen off inferences about other possible causes of smoking, which is perhaps clearer in the bulleted examples above. If you consider what would happen if you threw away expensive things, you know that you're doing so because you're considering what would happen if you threw away expensive things and not because you're rich.

Failure to take this information into account is a kind of selection bias: a naive EDT agent considering the counterfactual where they perform some action treats itself as a random sample from the population of similar agents who have performed such actions, but it is not in fact such a random sample! The sampling procedure, which consists of actually performing an action, is undoubtedly biased.

### Reference Classes

Another way to think about the above situation is that a naive EDT agent chooses inappropriate reference classes: when an agent performs an action, the appropriate reference class is not all other agents who have performed that action. It's unclear to me exactly what it is, but at the very least it's something like "other sufficiently similar agents who have performed that action under sufficiently similar circumstances."

This is actually very easy to see in the smoker's lesion problem because of the following observation (which I think I found in Eliezer's old TDT writeup): suppose the world of the smoker's legion is populated entirely with naive EDT agents who do not know whether or not they have the lesion. Then the above argument suggests that none of them will choose to smoke. But if that's the case, then where does the correlation between the lesion and smoking come from? Any agents who smoke are either not naive EDT agents or know whether they have the lesion. In either case, that makes them inappropriate members of the reference class any reasonable Bayesian agent should be using.

Furthermore, if the naive EDT agents collectively decide to become slightly less naive and restrict their reference class to each other, they now find that smoking no longer gives any information about whether they have the lesion or not! This is a kind of reflective inconsistency: the naive recommendation not to smoke in the smoker's lesion problem has the property that, if adopted by a population of naive EDT agents, it breaks the correlations upon which the recommendation is based.

### The Tickle Defense

As it happens, there is a standard counterargument in the decision theory literature to the claim that EDT recommends not smoking in the smoking lesion problem. It is known as the "tickle defense," and runs as follows: in the smoking lesion problem, what an EDT agent should be updating on is not the action of smoking but an internal desire, or "tickle," prompting it to smoke, and once the presence or absence of such a tickle has been updated on it screens off any information gained by updating on the act of smoking or not smoking. So EDT + Tickles smokes on the smoking lesion problem. (Note that this prescription *also* has the effect of breaking the correlation claimed in the setup of the smoking lesion problem among a population of EDT + Tickles agents who don't know whether hey have the lesion or not. So maybe there's just something wrong with the smoking lesion problem.)

The tickle defense is good in that it encourages ignoring less information than naive EDT, but it strikes me as a patch covering up part of a more general problem, namely the problem of how to choose appropriate reference classes when performing Bayesian updates (or something like that). So I don't find it a satisfactory rescuing of EDT. It doesn't help that there's a more sophisticated version known as the "meta-tickle defense" that recommends two-boxing on Newcomb's problem.

### Sophisticated EDT?

What does a more sophisticated version of EDT, taking the above observations into account, look like? I don't know. I suspect that it looks like some version of TDT / UDT, where TDT corresponds to something like trying to update on "being the kind of agent who outputs this action in this situation" and UDT corresponds to something more mysterious that I haven't been able to find a good explanation of yet, but I haven't thought about this much. If someone else has, let me know.

Here are some vague thoughts. First, I think this comment by Stuart_Armstrong is right on the money:

I've found that, in practice, most versions of EDT are underspecified, and people use their intuitions to fill the gaps in one direction or the other.

A "true" EDT agent needs to update on all the evidence they've ever observed, and it's very unclear to me how to do this in practice. So it seems that it's difficult to claim with much certainty that EDT will or will not do a particular thing in a particular situation.

CDT-via-causal-networks and TDT-via-causal-networks seem like reasonable candidates for more sophisticated versions of EDT in that they formalize the intuition above about screening off possible causes of a particular action. TDT seems like it better captures this intuition in that it better attempts to update on the cause of an action in a hypothetical about that action (the cause being that TDT outputs that action). My intuition here is that it should be possible to see causal networks as arising naturally out of Bayesian considerations, although I haven't thought about this much either.

AIXI might be another candidate. Unfortunately, AIXI can't handle the smoking lesion problem because it models itself as separate from the environment, whereas a key point in the smoking lesion problem is that an agent in the world of the smoking lesion has some uncertainty about its innards, regarded as part of its environment. Fully specifying sophisticated EDT might involve finding a version of AIXI that models itself as part of its environment.

## Robust Cooperation in the Prisoner's Dilemma

I'm proud to announce the preprint of Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic, a joint paper with Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire (me), and Eliezer Yudkowsky.

This paper was one of three projects to come out of the 2nd MIRI Workshop on Probability and Reflection in April 2013, and had its genesis in ideas about formalizations of decision theory that have appeared on LessWrong. (At the end of this post, I'll include links for further reading.)

Below, I'll briefly outline the problem we considered, the results we proved, and the (many) open questions that remain. Thanks in advance for your thoughts and suggestions!

## Background: Writing programs to play the PD with source code swap

(If you're not familiar with the Prisoner's Dilemma, see here.)

The paper concerns the following setup, which has come up in academic research on game theory: say that you have the chance to write a computer program **X**, which takes in one input and returns either *Cooperate* or *Defect*. This program will face off against some other computer program **Y**, but with a twist: **X** will receive the source code of **Y** as input, and **Y** will receive the source code of **X** as input. And you will be given your program's winnings, so you should think carefully about what sort of program you'd write!

Of course, you could simply write a program that defects regardless of its input; we call this program **DefectBot**, and call the program that cooperates on all inputs **CooperateBot**. But with the wealth of information afforded by the setup, you might wonder if there's some program that might be able to achieve mutual cooperation in situations where **DefectBot** achieves mutual defection, without thereby risking a sucker's payoff. (Douglas Hofstadter would call this a perfect opportunity for superrationality...)

## Previously known: CliqueBot and FairBot

And indeed, there's a way to do this that's been known since at least the 1980s. You can write a computer program that knows its own source code, compares it to the input, and returns *C* if and only if the two are identical (and *D* otherwise). Thus it achieves mutual cooperation in one important case where it intuitively ought to: when playing against itself! We call this program **CliqueBot**, since it cooperates only with the "clique" of agents identical to itself.

There's one particularly irksome issue with **CliqueBot**, and that's the fragility of its cooperation. If two people write functionally analogous but syntactically different versions of it, those programs will defect against one another! This problem can be patched somewhat, but not fully fixed. Moreover, mutual cooperation might be the best strategy against some agents that are not even functionally identical, and extending this approach requires you to explicitly delineate the list of programs that you're willing to cooperate with. Is there a more flexible and robust kind of program you could write instead?

As it turns out, there is: in a 2010 post on LessWrong, cousin_it introduced an algorithm that we now call **FairBot**. Given the source code of **Y**, **FairBot** searches for a proof (of less than some large fixed length) that **Y** returns *C* when given the source code of **FairBot**, and then returns *C* if and only if it discovers such a proof (otherwise it returns *D*). Clearly, if our proof system is consistent, **FairBot** only cooperates when that cooperation will be mutual. But the really fascinating thing is what happens when you play two versions of **FairBot** against each other. Intuitively, it seems that *either* mutual cooperation or mutual defection would be stable outcomes, but it turns out that if their limits on proof lengths are sufficiently high, they will achieve mutual cooperation!

The proof that they mutually cooperate follows from a bounded version of Löb's Theorem from mathematical logic. (If you're not familiar with this result, you might enjoy Eliezer's Cartoon Guide to Löb's Theorem, which is a correct formal proof written in much more intuitive notation.) Essentially, the asymmetry comes from the fact that both programs are searching for the same outcome, so that a short proof that one of them cooperates leads to a short proof that the other cooperates, and vice versa. (The opposite is not true, because the formal system can't know it won't find a contradiction. This is a subtle but essential feature of mathematical logic!)

## Generalization: Modal Agents

Unfortunately, **FairBot** isn't what I'd consider an ideal program to write: it happily cooperates with **CooperateBot**, when it could do better by defecting. This is problematic because in real life, the world isn't separated into agents and non-agents, and any natural phenomenon that doesn't predict your actions can be thought of as a **CooperateBot** (or a **DefectBot**). You don't want your agent to be making concessions to rocks that happened not to fall on them. (There's an important caveat: some things have utility functions that you care about, but don't have sufficient ability to predicate their actions on yours. In that case, though, it wouldn't be a true Prisoner's Dilemma if your values actually prefer the outcome (*C*,*C*) to (*D*,*C*).)

However, **FairBot** belongs to a promising class of algorithms: those that decide on their action by looking for short proofs of logical statements that concern their opponent's actions. In fact, there's a really convenient mathematical structure that's analogous to the class of such algorithms: the modal logic of provability (known as GL, for Gödel-Löb).

So that's the subject of this preprint: **what can we achieve in decision theory by considering agents defined by formulas of provability logic?**

## Decision Theory FAQ

Co-authored with crazy88. Please let us know when you find mistakes, and we'll fix them. Last updated 03-27-2013.

**Contents**:

- 1. What is decision theory?
- 2. Is the rational decision always the right decision?
- 3. How can I better understand a decision problem?
- 4. How can I measure an agent's preferences?
- 5. What do decision theorists mean by "risk," "ignorance," and "uncertainty"?
- 6. How should I make decisions under ignorance?
- 7. Can decisions under ignorance be transformed into decisions under uncertainty?
- 8. How should I make decisions under uncertainty?
- 9. Does axiomatic decision theory offer any action guidance?
- 10. How does probability theory play a role in decision theory?
- 11. What about "Newcomb's problem" and alternative decision algorithms?

## 1. What is decision theory?

*Decision theory*, also known as *rational choice theory*, concerns the study of preferences, uncertainties, and other issues related to making "optimal" or "rational" choices. It has been discussed by economists, psychologists, philosophers, mathematicians, statisticians, and computer scientists.

We can divide decision theory into three parts (Grant & Zandt 2009; Baron 2008). *Normative* decision theory studies what an ideal agent (a perfectly rational agent, with infinite computing power, etc.) would choose. *Descriptive* decision theory studies how non-ideal agents (e.g. humans) *actually* choose. *Prescriptive* decision theory studies how non-ideal agents can improve their decision-making (relative to the normative model) despite their imperfections.

For example, one's *normative* model might be expected utility theory, which says that a rational agent chooses the action with the highest expected utility. Replicated results in psychology *describe* humans repeatedly *failing* to maximize expected utility in particular, predictable ways: for example, they make some choices based not on potential future benefits but on irrelevant past efforts (the "sunk cost fallacy"). To help people avoid this error, some theorists *prescribe* some basic training in microeconomics, which has been shown to reduce the likelihood that humans will commit the sunk costs fallacy (Larrick et al. 1990). Thus, through a coordination of normative, descriptive, and prescriptive research we can help agents to succeed in life by acting more in accordance with the normative model than they otherwise would.

This FAQ focuses on normative decision theory. Good sources on descriptive and prescriptive decision theory include Stanovich (2010) and Hastie & Dawes (2009).

Two related fields beyond the scope of this FAQ are game theory and social choice theory. Game theory is the study of conflict and cooperation among multiple decision makers, and is thus sometimes called "interactive decision theory." Social choice theory is the study of making a collective decision by combining the preferences of multiple decision makers in various ways.

This FAQ draws heavily from two textbooks on decision theory: Resnik (1987) and Peterson (2009). It also draws from more recent results in decision theory, published in journals such as *Synthese* and *Theory and Decision*.

View more: Next