In our jobs as AI safety researchers, we think a lot about what it means to have reasonable beliefs and to make good decisions. This matters because we want to understand how powerful AI systems might behave. It also matters because we ourselves need to know how to make good decisions in light of tremendous uncertainty about how to shape the long-term future.

It seems to us that there is a pervasive feeling in this community that the way to decide which norms of rationality to follow is to pick the ones that win. When it comes to the choice between CDT vs. EDT vs. LDT…, we hear we can simply choose the one that gets the most utility. When we say that perhaps we ought to be imprecise Bayesians, and therefore be clueless about our effects on the long-term future, we hear that imprecise Bayesianism is “outperformed” by other approaches to decision-making.

On the contrary, we think that “winning” or “good performance” offers very little guidance. On any way of making sense of those words, we end up either calling a very wide range of beliefs and decisions “rational”, or reifying an objective that has nothing to do with our terminal goals without some substantive assumptions. We also need to look to non-pragmatic principles — in the context of epistemology, for example, things like the principle of indifference or Occam’s razor. Crucially, this opens the door to being guided by non-(precise-)Bayesian principles.

“Winning” gives little guidance

We’ll use “pragmatic principles” to refer to principles according to which belief-forming or decision-making procedures should “perform well” in some sense. We’ll look at various pragmatic principles and argue that they provide little action-guidance.

Avoiding dominated strategies

First, to review some basic points about common justifications of epistemic and decision-theoretic norms:

A widely-used strategy for arguing for norms of rationality involves avoiding dominated strategies. We can all agree that it’s bad to take a sequence of actions that you’re certain are worse for you than something else.[1] And various arguments take the form: If you don’t conform to particular norms of rationality, you are disposed to act in ways that guarantee that you’re worse off than you could be. A number of arguments for Bayesian epistemology and decision theory — Dutch book arguments; arguments for the axioms of representation theorems; and complete class theorems — are like that.

But what these arguments really show is that you are disposed to playing a dominated strategy if we cannot model your behavior as if you were a Bayesian with a certain prior and utility function. They don’t say anything about the procedure by which you need to make your decisions. I.e., they don’t say that you have to write down precise probabilities, utilities, and make decisions by solving for the Bayes-optimal policy for those. They also don’t tell you that you have to behave as if you have any particular prior. The prior that rationalizes your decisions after the fact might have nothing to do with the beliefs you consciously endorse.

One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule for decision-making with imprecise credences (e.g., maximin expected utility or minimax regret). Or maybe you want to account for your moral uncertainty (e.g., picking the plan that respects more deontological constraints). Obviously, avoiding dominated strategies alone doesn’t recommend this procedure. Nor does “pick some precise prior and optimize with respect to it”.

If we want to argue about whether this procedure is justified, we have to argue at the level of the substantive principles it invokes. (For example, maybe at bottom we like a principle of “simplicity”, and think Bayesianism is the most simple/straightforward route to avoiding dominated strategies. But maybe we find the principles justifying imprecise probabilities plus Dynamic Strong Maximality compelling enough to outweigh this consideration.)

Heuristics

As humans, we can’t implement the Bayesian algorithm anyway. So you might say that this is all beside the point. As bounded agents we’ve got to use heuristics that lead to “good performance”. Unfortunately, we still don’t see a way of making sense of “good performance” that respects our terminal goals and leads to much action-guidance on its own. Here are some things it could mean.

Convergence to high utility. You might say that a heuristic performs well if its performance (in terms of accuracy or utility, respectively) converges sufficiently quickly to a value that is good, in some sense. An example that’s much discussed in the rationality community is logical induction, which uses a kind of asymptotic non-exploitability criterion. Other examples are heuristics for sequential prediction as well as exploration in sequential decision-making (multi-armed bandits, etc). These are often judged by whether, and how fast, their worst-case regret converges to zero.    

What these arguments say is basically: “If you try various strategies, look at how well they’ve done based on observed outcomes, and keep using the ones that have done the best, your performance will converge to the best possible performance (in some sense) in the limit of infinite data”. This doesn’t help us at all, for a few reasons.

First, the kinds of outcomes we’re interested in for our terminal goals are things like “did this intervention on an advanced AI system lead to a catastrophic outcome?”. We don’t have any direct observations like that, only proxies. So if we want to draw inferences about our terminal utilities, we need additional assumptions about how to generalize from the domains we’ve observed to those we can’t (more on this next). Second of all, these results assume that you have arbitrarily many opportunities to try different strategies — if you fall into a “trap”, you can always try again. But that’s not the case for us, because of lock-in events. We don’t have arbitrarily many opportunities to try out different strategies for making AI less x- or s-risky and seeing what happens.   

Doing what’s worked well in the past.[2] We often encounter claims that we ought to use some heuristic because it has worked well in the past. Some examples of statements that might be interpreted in this way (though we’re not sure if this is how they were meant):

  • Cluster thinking: “Cluster thinking is more similar to empirically effective prediction methods.”
  • Using precise probabilities. From Lewis: “In the same way our track record of better-than-chance performance warrants us to believe our guesses on hard geopolitical forecasts, it also warrants us to believe a similar cognitive process will give ‘better than nothing’ guesses on which actions tend to be better than others, as the challenges are similar between both.”  

Maybe the most obvious criticism of this notion of winning is that, if you are a longtermist, you haven’t observed your decisions “work well”, in the sense of leading to good aggregate outcomes across all moral patients for all time. But let’s grant for now that there is some important sense in which we can tell whether our practices have worked well before, either in the sense of making good predictions about things we can observe, or leading to good observable consequences according to proxies for our terminal goals.

Presumably we should only trust a heuristic based on its past performance insofar as we have some reason to think that similar mechanisms that caused it to work previously are at play in our current problem. That is, past performance isn’t our terminal goal itself, but rather a potential source of information about future performance with respect to our terminal goals. We might think that “go with your gut” is a good heuristic for making interpersonal judgments, but not predicting the stock market or geopolitical events. And we can give some rough mechanistic account of this. Our understanding of psychology makes it unsurprising that human intuitions about others’ character would do a decent job tracking truth, but not so much with stock-picking. (See also Violet Hour’s discussion of how to update on the track record of superforecasters.)

This is not to say that we always have to form detailed mechanical models to judge whether a heuristic’s performance will generalize. You don’t have to be a hedgehog to agree with what we’re saying. Even the humblest reference class forecaster has to choose a reference class. And how else can they do that besides by referring to some (perhaps very vague) beliefs about whether the observations in their reference class are generated by similar mechanisms? 

This means that the justification must bottom out not just in the heuristic’s historical performance, but also in our beliefs about the mechanisms which lead to the heuristic performing well.[3] And what justifies such beliefs? It can’t just be the historical performance of my belief-forming processes, or we have a regress. In our view, this all has to bottom out in non-pragmatic principles governing the weights we assign to the relevant mechanisms. We won’t get into the relative merits of different principles here, besides to say that we doubt plausible principles will often recommend naive extrapolation from some historical reference class. (Cf. writing on the limitations of “outside view” reasoning, e.g., this.) 

Fitting pre-theoretic intuitions about correct behavior. For example,[4] some justifications for cluster thinking over sequence thinking might reduce to pre-theoretic intuitions about what kinds of decision patterns should be avoided, and how to avoid them. From Karnofsky:[5]

  • “A cluster-thinking-style ‘regression to normality’ seems to prevent some obviously problematic behavior relating to knowably impaired judgment.”
  • “Sequence thinking seems to tend toward excessive comfort with ‘ends justify the means’ type thinking.”
    • One interpretation of this claim is that we can recognize “ends justify the means” reasoning as bad in its own right, regardless of whether we have evidence of this reasoning being harmful on average historically. (A fanatic might insist that it’s unsurprising if fanatical bets consistently failed to pay off ex post, so we have no such evidence.)

And, when discussing the view that we ought to have imprecise credences and therefore be clueless about many longtermist questions, we’ve often encountered arguments that might be interpreted this way. We’ve often heard things along the lines of, “Your epistemology and/or decision rule must be wrong if it implies you’re clueless about whether actively trying to do things that seem good for your values is good”, for example.

Insofar as we think we ought to assess actions by their consequences, however, it’s not clear what the argument is supposed to be here. Of course, intuitions about what kinds of actions lead to good consequences can guide our reasoning. But that is different from saying that whether a decision rule recommends a particular behavior is itself a criterion for the rationality of a decision rule. To us that looks like a rejection of consequentialism.

Non-pragmatic principles 

We’ve now seen how four notions of “winning” — avoiding dominated strategies, good long-run performance, good observed performance, recommending pre-theoretically endorsed behaviors — don’t do much to constrain how an agent forms beliefs or makes decisions. To say more about that, we will need to turn to non-pragmatic principles, endorsed not because they follow from some objective performance criterion but because our philosophical conscience can’t deny them.

Some examples of non-pragmatic principles: 

  • (Precise) principle of indifference. In the absence of any information, assign equal weights to symmetrical possible outcomes (e.g., the faces of a die);
  • Occam’s razor. We should give less weight to hypotheses which posit a greater number of fundamental entities, more complex laws, etc.;[6] 

  • Fit with the evidence. We should give more weight to hypotheses that make our observations more probable;  
  • Deference. Deference principles are things like, “If X has much more information about Q than me and is at least as competent a reasoner, I should adopt X’s beliefs about Q instead of going with mine”;
  • Imprecision. If our evidence and other epistemic norms don’t pin down a precise credence, then we ought to have an imprecise epistemic attitude, represented by sets of probabilities;            
  • Regularity. We should have credences different from 0 or 1 in logically possible propositions.

Now, as bounded agents, our decisions will usually not be determined by quantified beliefs, even quantified beliefs over very simple models. We will have some vaguer all-things-considered beliefs that dictate our decision. Still, we might think that these norms can provide some guidance for our vague all-things-considered beliefs. For example:

  • (Vague principle of indifference.) “These outcomes seem roughly symmetrical and their values are roughly opposite, so I’ll treat them as not contributing to the overall decision”;
  • (Vague deference.) “She knows much more about this domain than me, and in cases I know of has come to the same reasoned conclusion as me, so I’ll give her opinion in this case a lot more weight than my gut feeling”;
  • (Vague imprecision.) “There are lots of considerations about the value of actions A and B pointing in different ways with no clear way of weighing them; the outputs of my toy models are highly sensitive to seemingly arbitrary differences in parameters; so I’ll regard it as indeterminate whether A is better than B”.

It’s possible to construe, e.g., “doing what’s worked well in the past” as a non-pragmatic principle. As we’ve argued, though, past performance on local goals isn’t what we ultimately care about, so this principle seems poorly motivated. A better motivation for doing what’s worked in the past would be a belief that the mechanisms governing success at goal achievement in past environments will hold in future environments. But this is unappealing as a brute constraint on beliefs, rather than being grounded in reasons to expect generalization. In principle, those reasons might come from something like Occam’s razor (“the hypothesis that success will generalize across environments is simpler than alternatives”), though we’re skeptical of that route.

Where does that leave us? Well, say you’re persuaded by the axioms of precise probabilism — you think you should have a precise prior. You might use some form of Occam’s razor to get that prior. And the “fit with evidence” principle gets you to Bayesian epistemology. Given a few other principles (see e.g. here), then, your notion of “achieving terminal goals” is “maximizing expected utility with respect to an Occam prior conditionalized on my evidence”. And we can derive other normative standards from other combinations of principles.  

Conclusion

So, our beliefs and decisions must be grounded in non-pragmatic principles, not just an objective standard of “winning”.

This doesn’t require a realist stance on which principles are best. All the reasons for doubt about our judgments about ethical principles tracking some mind-independent truth apply here, too. In some sense, probably, anything goes. But as with ethics, we can still reflect on which principles are ultimately most compelling to us. In ethics we need not just say, “Well, I happen to only care about my neighbors, and that’s that”. Likewise, in epistemology/decision theory, we need not shrug and say “Well, these just happen to be my credences/heuristics”.

For our part, we favor a norm of suspending judgment in cases where other norms don’t pin down a belief or decision. As hinted out throughout the post, this means that our beliefs — especially concerning our effects on the long-run future — will often be severely indeterminate. On the most plausible decision rules for indeterminate beliefs, insofar as we are impartially altruistic, this might well leave us clueless about what to do. Without an objective standard of “winning” to turn to, this leaves us searching for new principles that could guide us in the face of indeterminacy. But that’s all for another post.   

Acknowledgments

Thanks to Caspar Oesterheld, Martín Soto, Tristan Cook, Michael St. Jules, Sylvester Kollin, Nicolas Macé, and Mia Taylor for input on this post.  

References

Hedden, B. 2015. “Time-Slice Rationality.” Mind; a Quarterly Review of Psychology and Philosophy 124 (494): 449–91.

Soares, Nate, and Benja Fallenstein. 2015. “Toward Idealized Decision Theory.” arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1507.01986.

 

  1. ^

     That said, according to “time-slice rationality” (Hedden 2015), there is no unified decision-maker across different time points. Rather, “you” at time 0 are a different decision-maker from “you” at time 1, and what is rational for you-at-time-1 only depends on you-at-time-0 insofar as you-at-time-0 are part of the decision-making environment for you-at-time-1. On this view, then, arguably you-at-time-1 are not rationally obligated to make decisions that would avoid a sure loss from the perspective of you-at-time-0. Of course, if you-at-time-0 are capable of binding you-at-time-1 to an action that avoids a sure loss from your perspective, you ought to do so. (But in this case, it doesn’t seem appropriate to say the action of you-at-time-1 is a “decision” they themselves make in order to avoid a sure loss.)

  2. ^

     As discussed above, a policy of “doing what worked well in the past” might be argued for on the grounds that it leads to good long-term outcomes. But, here we’re talking about “having worked well in the past” as a justification that’s independent of long-run performance arguments.

  3. ^

     Cf. “no free lunch theorems”, which can be interpreted in this context as saying that no matter how well a heuristic did in the past, its performance in the future depends on the distribution of future problems.

  4. ^

     See also the discussion of decision theory performance in, e.g., Soares and Fallenstein (2015). You might have a strong intuition it “wins” not to pay in Evidential Blackmail, and this makes you favor causal decision theory over evidential decision theory all else equal (independently of how much you endorse the foundations of causal decision theory, or its historical track record). See Oesterheld here for why these sorts of intuitions are not objective performance metrics for decision theories.

  5. ^

     We aren’t confident that these arguments were meant to be grounded in pre-theoretic intuitions, rather than “doing what’s worked well in the past” above.

  6. ^

     Pragmatic justifications of Occam’s razor are circular, as noted by Yudkowsky: “You could argue that Occam's Razor has worked in the past, and is therefore likely to continue to work in the future.  But this, itself, appeals to a prediction from Occam's Razor. "Occam's Razor works up to October 8th, 2007 and then stops working thereafter" is more complex, but it fits the observed evidence equally well.” Cf. Hume on the circularity of inductive justifications of induction.