In our jobs as AI safety researchers, we think a lot about what it means to have reasonable beliefs and to make good decisions. This matters because we want to understand how powerful AI systems might behave. It also matters because we ourselves need to know how to make good decisions in light of tremendous uncertainty about how to shape the long-term future.

It seems to us that there is a pervasive feeling in this community that the way to decide which norms of rationality to follow is to pick the ones that win. When it comes to the choice between CDT vs. EDT vs. LDT…, we hear we can simply choose the one that gets the most utility. When we say that perhaps we ought to be imprecise Bayesians, and therefore be clueless about our effects on the long-term future, we hear that imprecise Bayesianism is “outperformed” by other approaches to decision-making.

On the contrary, we think that “winning” or “good performance” offers very little guidance. On any way of making sense of those words, we end up either calling a very wide range of beliefs and decisions “rational”, or reifying an objective that has nothing to do with our terminal goals without some substantive assumptions. We also need to look to non-pragmatic principles — in the context of epistemology, for example, things like the principle of indifference or Occam’s razor. Crucially, this opens the door to being guided by non-(precise-)Bayesian principles.

“Winning” gives little guidance

We’ll use “pragmatic principles” to refer to principles according to which belief-forming or decision-making procedures should “perform well” in some sense. We’ll look at various pragmatic principles and argue that they provide little action-guidance.

Avoiding dominated strategies

First, to review some basic points about common justifications of epistemic and decision-theoretic norms:

A widely-used strategy for arguing for norms of rationality involves avoiding dominated strategies. We can all agree that it’s bad to take a sequence of actions that you’re certain are worse for you than something else.[1] And various arguments take the form: If you don’t conform to particular norms of rationality, you are disposed to act in ways that guarantee that you’re worse off than you could be. A number of arguments for Bayesian epistemology and decision theory — Dutch book arguments; arguments for the axioms of representation theorems; and complete class theorems — are like that.

But what these arguments really show is that you are disposed to playing a dominated strategy if we cannot model your behavior as if you were a Bayesian with a certain prior and utility function. They don’t say anything about the procedure by which you need to make your decisions. I.e., they don’t say that you have to write down precise probabilities, utilities, and make decisions by solving for the Bayes-optimal policy for those. They also don’t tell you that you have to behave as if you have any particular prior. The prior that rationalizes your decisions after the fact might have nothing to do with the beliefs you consciously endorse.

One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule for decision-making with imprecise credences (e.g., maximin expected utility or minimax regret). Or maybe you want to account for your moral uncertainty (e.g., picking the plan that respects more deontological constraints). Obviously, avoiding dominated strategies alone doesn’t recommend this procedure. Nor does “pick some precise prior and optimize with respect to it”.

If we want to argue about whether this procedure is justified, we have to argue at the level of the substantive principles it invokes. (For example, maybe at bottom we like a principle of “simplicity”, and think Bayesianism is the most simple/straightforward route to avoiding dominated strategies. But maybe we find the principles justifying imprecise probabilities plus Dynamic Strong Maximality compelling enough to outweigh this consideration.)

Heuristics

As humans, we can’t implement the Bayesian algorithm anyway. So you might say that this is all beside the point. As bounded agents we’ve got to use heuristics that lead to “good performance”. Unfortunately, we still don’t see a way of making sense of “good performance” that respects our terminal goals and leads to much action-guidance on its own. Here are some things it could mean.

Convergence to high utility. You might say that a heuristic performs well if its performance (in terms of accuracy or utility, respectively) converges sufficiently quickly to a value that is good, in some sense. An example that’s much discussed in the rationality community is logical induction, which uses a kind of asymptotic non-exploitability criterion. Other examples are heuristics for sequential prediction as well as exploration in sequential decision-making (multi-armed bandits, etc). These are often judged by whether, and how fast, their worst-case regret converges to zero.    

What these arguments say is basically: “If you try various strategies, look at how well they’ve done based on observed outcomes, and keep using the ones that have done the best, your performance will converge to the best possible performance (in some sense) in the limit of infinite data”. This doesn’t help us at all, for a few reasons.

First, the kinds of outcomes we’re interested in for our terminal goals are things like “did this intervention on an advanced AI system lead to a catastrophic outcome?”. We don’t have any direct observations like that, only proxies. So if we want to draw inferences about our terminal utilities, we need additional assumptions about how to generalize from the domains we’ve observed to those we can’t (more on this next). Second of all, these results assume that you have arbitrarily many opportunities to try different strategies — if you fall into a “trap”, you can always try again. But that’s not the case for us, because of lock-in events. We don’t have arbitrarily many opportunities to try out different strategies for making AI less x- or s-risky and seeing what happens.   

Doing what’s worked well in the past.[2] We often encounter claims that we ought to use some heuristic because it has worked well in the past. Some examples of statements that might be interpreted in this way (though we’re not sure if this is how they were meant):

  • Cluster thinking: “Cluster thinking is more similar to empirically effective prediction methods.”
  • Using precise probabilities. From Lewis: “In the same way our track record of better-than-chance performance warrants us to believe our guesses on hard geopolitical forecasts, it also warrants us to believe a similar cognitive process will give ‘better than nothing’ guesses on which actions tend to be better than others, as the challenges are similar between both.”  

Maybe the most obvious criticism of this notion of winning is that, if you are a longtermist, you haven’t observed your decisions “work well”, in the sense of leading to good aggregate outcomes across all moral patients for all time. But let’s grant for now that there is some important sense in which we can tell whether our practices have worked well before, either in the sense of making good predictions about things we can observe, or leading to good observable consequences according to proxies for our terminal goals.

Presumably we should only trust a heuristic based on its past performance insofar as we have some reason to think that similar mechanisms that caused it to work previously are at play in our current problem. That is, past performance isn’t our terminal goal itself, but rather a potential source of information about future performance with respect to our terminal goals. We might think that “go with your gut” is a good heuristic for making interpersonal judgments, but not predicting the stock market or geopolitical events. And we can give some rough mechanistic account of this. Our understanding of psychology makes it unsurprising that human intuitions about others’ character would do a decent job tracking truth, but not so much with stock-picking. (See also Violet Hour’s discussion of how to update on the track record of superforecasters.)

This is not to say that we always have to form detailed mechanical models to judge whether a heuristic’s performance will generalize. You don’t have to be a hedgehog to agree with what we’re saying. Even the humblest reference class forecaster has to choose a reference class. And how else can they do that besides by referring to some (perhaps very vague) beliefs about whether the observations in their reference class are generated by similar mechanisms? 

This means that the justification must bottom out not just in the heuristic’s historical performance, but also in our beliefs about the mechanisms which lead to the heuristic performing well.[3] And what justifies such beliefs? It can’t just be the historical performance of my belief-forming processes, or we have a regress. In our view, this all has to bottom out in non-pragmatic principles governing the weights we assign to the relevant mechanisms. We won’t get into the relative merits of different principles here, besides to say that we doubt plausible principles will often recommend naive extrapolation from some historical reference class. (Cf. writing on the limitations of “outside view” reasoning, e.g., this.) 

Fitting pre-theoretic intuitions about correct behavior. For example,[4] some justifications for cluster thinking over sequence thinking might reduce to pre-theoretic intuitions about what kinds of decision patterns should be avoided, and how to avoid them. From Karnofsky:[5]

  • “A cluster-thinking-style ‘regression to normality’ seems to prevent some obviously problematic behavior relating to knowably impaired judgment.”
  • “Sequence thinking seems to tend toward excessive comfort with ‘ends justify the means’ type thinking.”
    • One interpretation of this claim is that we can recognize “ends justify the means” reasoning as bad in its own right, regardless of whether we have evidence of this reasoning being harmful on average historically. (A fanatic might insist that it’s unsurprising if fanatical bets consistently failed to pay off ex post, so we have no such evidence.)

And, when discussing the view that we ought to have imprecise credences and therefore be clueless about many longtermist questions, we’ve often encountered arguments that might be interpreted this way. We’ve often heard things along the lines of, “Your epistemology and/or decision rule must be wrong if it implies you’re clueless about whether actively trying to do things that seem good for your values is good”, for example.

Insofar as we think we ought to assess actions by their consequences, however, it’s not clear what the argument is supposed to be here. Of course, intuitions about what kinds of actions lead to good consequences can guide our reasoning. But that is different from saying that whether a decision rule recommends a particular behavior is itself a criterion for the rationality of a decision rule. To us that looks like a rejection of consequentialism.

Non-pragmatic principles 

We’ve now seen how four notions of “winning” — avoiding dominated strategies, good long-run performance, good observed performance, recommending pre-theoretically endorsed behaviors — don’t do much to constrain how an agent forms beliefs or makes decisions. To say more about that, we will need to turn to non-pragmatic principles, endorsed not because they follow from some objective performance criterion but because our philosophical conscience can’t deny them.

Some examples of non-pragmatic principles: 

  • (Precise) principle of indifference. In the absence of any information, assign equal weights to symmetrical possible outcomes (e.g., the faces of a die);
  • Occam’s razor. We should give less weight to hypotheses which posit a greater number of fundamental entities, more complex laws, etc.;[6] 

  • Fit with the evidence. We should give more weight to hypotheses that make our observations more probable;  
  • Deference. Deference principles are things like, “If X has much more information about Q than me and is at least as competent a reasoner, I should adopt X’s beliefs about Q instead of going with mine”;
  • Imprecision. If our evidence and other epistemic norms don’t pin down a precise credence, then we ought to have an imprecise epistemic attitude, represented by sets of probabilities;            
  • Regularity. We should have credences different from 0 or 1 in logically possible propositions.

Now, as bounded agents, our decisions will usually not be determined by quantified beliefs, even quantified beliefs over very simple models. We will have some vaguer all-things-considered beliefs that dictate our decision. Still, we might think that these norms can provide some guidance for our vague all-things-considered beliefs. For example:

  • (Vague principle of indifference.) “These outcomes seem roughly symmetrical and their values are roughly opposite, so I’ll treat them as not contributing to the overall decision”;
  • (Vague deference.) “She knows much more about this domain than me, and in cases I know of has come to the same reasoned conclusion as me, so I’ll give her opinion in this case a lot more weight than my gut feeling”;
  • (Vague imprecision.) “There are lots of considerations about the value of actions A and B pointing in different ways with no clear way of weighing them; the outputs of my toy models are highly sensitive to seemingly arbitrary differences in parameters; so I’ll regard it as indeterminate whether A is better than B”.

It’s possible to construe, e.g., “doing what’s worked well in the past” as a non-pragmatic principle. As we’ve argued, though, past performance on local goals isn’t what we ultimately care about, so this principle seems poorly motivated. A better motivation for doing what’s worked in the past would be a belief that the mechanisms governing success at goal achievement in past environments will hold in future environments. But this is unappealing as a brute constraint on beliefs, rather than being grounded in reasons to expect generalization. In principle, those reasons might come from something like Occam’s razor (“the hypothesis that success will generalize across environments is simpler than alternatives”), though we’re skeptical of that route.

Where does that leave us? Well, say you’re persuaded by the axioms of precise probabilism — you think you should have a precise prior. You might use some form of Occam’s razor to get that prior. And the “fit with evidence” principle gets you to Bayesian epistemology. Given a few other principles (see e.g. here), then, your notion of “achieving terminal goals” is “maximizing expected utility with respect to an Occam prior conditionalized on my evidence”. And we can derive other normative standards from other combinations of principles.  

Conclusion

So, our beliefs and decisions must be grounded in non-pragmatic principles, not just an objective standard of “winning”.

This doesn’t require a realist stance on which principles are best. All the reasons for doubt about our judgments about ethical principles tracking some mind-independent truth apply here, too. In some sense, probably, anything goes. But as with ethics, we can still reflect on which principles are ultimately most compelling to us. In ethics we need not just say, “Well, I happen to only care about my neighbors, and that’s that”. Likewise, in epistemology/decision theory, we need not shrug and say “Well, these just happen to be my credences/heuristics”.

For our part, we favor a norm of suspending judgment in cases where other norms don’t pin down a belief or decision. As hinted out throughout the post, this means that our beliefs — especially concerning our effects on the long-run future — will often be severely indeterminate. On the most plausible decision rules for indeterminate beliefs, insofar as we are impartially altruistic, this might well leave us clueless about what to do. Without an objective standard of “winning” to turn to, this leaves us searching for new principles that could guide us in the face of indeterminacy. But that’s all for another post.   

Acknowledgments

Thanks to Caspar Oesterheld, Martín Soto, Tristan Cook, Michael St. Jules, Sylvester Kollin, Nicolas Macé, and Mia Taylor for input on this post.  

References

Hedden, B. 2015. “Time-Slice Rationality.” Mind; a Quarterly Review of Psychology and Philosophy 124 (494): 449–91.

Soares, Nate, and Benja Fallenstein. 2015. “Toward Idealized Decision Theory.” arXiv [cs.AI]. arXiv. http://arxiv.org/abs/1507.01986.

 

  1. ^

     That said, according to “time-slice rationality” (Hedden 2015), there is no unified decision-maker across different time points. Rather, “you” at time 0 are a different decision-maker from “you” at time 1, and what is rational for you-at-time-1 only depends on you-at-time-0 insofar as you-at-time-0 are part of the decision-making environment for you-at-time-1. On this view, then, arguably you-at-time-1 are not rationally obligated to make decisions that would avoid a sure loss from the perspective of you-at-time-0. Of course, if you-at-time-0 are capable of binding you-at-time-1 to an action that avoids a sure loss from your perspective, you ought to do so. (But in this case, it doesn’t seem appropriate to say the action of you-at-time-1 is a “decision” they themselves make in order to avoid a sure loss.)

  2. ^

     As discussed above, a policy of “doing what worked well in the past” might be argued for on the grounds that it leads to good long-term outcomes. But, here we’re talking about “having worked well in the past” as a justification that’s independent of long-run performance arguments.

  3. ^

     Cf. “no free lunch theorems”, which can be interpreted in this context as saying that no matter how well a heuristic did in the past, its performance in the future depends on the distribution of future problems.

  4. ^

     See also the discussion of decision theory performance in, e.g., Soares and Fallenstein (2015). You might have a strong intuition it “wins” not to pay in Evidential Blackmail, and this makes you favor causal decision theory over evidential decision theory all else equal (independently of how much you endorse the foundations of causal decision theory, or its historical track record). See Oesterheld here for why these sorts of intuitions are not objective performance metrics for decision theories.

  5. ^

     We aren’t confident that these arguments were meant to be grounded in pre-theoretic intuitions, rather than “doing what’s worked well in the past” above.

  6. ^

     Pragmatic justifications of Occam’s razor are circular, as noted by Yudkowsky: “You could argue that Occam's Razor has worked in the past, and is therefore likely to continue to work in the future.  But this, itself, appeals to a prediction from Occam's Razor. "Occam's Razor works up to October 8th, 2007 and then stops working thereafter" is more complex, but it fits the observed evidence equally well.” Cf. Hume on the circularity of inductive justifications of induction.

New Comment
14 comments, sorted by Click to highlight new comments since:

I think I care a bunch about the subject matter of this post, but something about the way this post is written leaves me feeling confused and ungrounded.

Before reading this post, my background beliefs were:

  1. Rationality doesn't (quite) equal Systemized Winning. Or, rather, that focusing on this seems to lead people astray more than helps them.
  2. There's probably some laws of cognition to be discovered, about what sort of cognition will have various good properties, in idealized situations.
  3. There's probably some messier laws of cognition that apply to humans (but those laws are maybe more complicated).
  4. Neither sets of laws necessarily have a simple unifying framework that accomplishes All the Things (although I think the search for simplicity/elegance/all-inclusiveness is probably a productive search, i.e. it tends to yield good stuff along the way. "More elegance" is usually achievable on the margin.
  5. There might be heuristics that work moderately well for humans much of the time, which approximate those laws.
    1. there are probably Very Rough heuristics you can tell an average person without lots of dependencies, and somewhat better heuristics you can give to people who are willing to learn lots of subskills.

Given all that... is there anything in-particular I am meant to take from this post? (I have right now only skimmed it, it felt effortful to comb for the novel bits). I can't tell whether the few concrete bits are particularly important, or just illustrative examples.

The key claim is: You can’t evaluate which beliefs and decision theory to endorse just by asking “which ones perform the best?” Because the whole question is what it means to systematically perform better, under uncertainty. Every operationalization of “systematically performing better” we’re aware of is either:

  • Incomplete — like “avoiding dominated strategies”, which leaves a lot unconstrained;
  • A poorly motivated proxy for the performance we actually care about — like “doing what’s worked in the past”; or
  • Secretly smuggling in nontrivial non-pragmatic assumptions — like “doing what’s worked in the past, not because that’s what we actually care about, but because past performance predicts future performance”

This is what we meant to convey with this sentence: “On any way of making sense of those words, we end up either calling a very wide range of beliefs and decisions “rational”, or reifying an objective that has nothing to do with our terminal goals without some substantive assumptions.”

(I can't tell from your comment if you agree with all of that. But, if this was all obvious to you, great! But we’ve often had discussions where someone appealed to “which ones perform the best?” in a way that misses these points.)

My understanding from discussions with the authors (but please correct me):

This post is less about pragmatically analyzing which particular heuristics work best for ideal or non-ideal agents in common environments (assuming a background conception of normativity), and more about the philosophical underpinnings of normativity itself.

Maybe it's easiest if I explain what this post grows out of:

There seems to be a widespread vibe amongst rationalists that "one-boxing in Newcomb is objectively better, because you simply obtain more money, that is, you simply win". This vibe is no coincidence, since Eliezer and Nate, in some of their writing about FDT, use language strongly implying that decision theory A is objectively better than decision theory B because it just wins more. Unfortunately, this intuitive notion of winning cannot actually be made into a philosophically valid objective metric. (In more detail, a precise definition of winning is already decision-theory-complete, so these arguments beg the question.) This point is well-known in philosophical academia, and was already succinctly explained in a post by Caspar (which the authors mention).

In the current post, the authors extend a similar philosophical critique to other widespread uses of winning, or background assumptions about rationality. For example, some people say that "winning is about not playing dominated strategies"... and the authors agree about avoiding dominated strategies, but point out that this is not too action-guiding, because it is consistent with many policies. Or also, some people say that "rationality is about implementing the heuristics that have worked well in the past, and/or you think will lead to good future performance"... but these utterances hide other philosophical assumptions, like assuming the same mechanisms are at play in the past and future, which are especially tenuous for big problems like x-risk. Thus, vague references to winning aren't enough to completely pin down and justify behavior. Instead, we fundamentally need additional constraints or principles about normativity, what the authors call non-pragmatic principles. Of course, these principles cannot themselves be justified in terms of past performance (which would lead to circularity), so they instead need to be taken as normative axioms (just like we need ethical axioms, because ought cannot be derived from is).

some people say that "winning is about not playing dominated strategies"

I do not believe this statement. As in, I do not currently know of a single person, associated either with LW or with decision-theory academia, that says "not playing dominated strategies is entirely action-guiding." So, as Raemon pointed out, "this post seems like it’s arguing with someone but I’m not sure who."

In general, I tend to mildly disapprove of words like "a widely-used strategy", "we often encounter claims" etc, without any direct citations to the individuals who are purportedly making these mistakes. If it really was that widely-used, surely it would be trivial for the authors to quote a few examples off the top of their head, no? What does it say about them that they didn't?

mildly disapprove of words like "a widely-used strategy"

The text says “A widely-used strategy for arguing for norms of rationality involves avoiding dominated strategies”, which is true* and something we thought would be familiar to everyone who is interested in these topics. For example, see the discussion of Dutch book arguments in the SEP entry on Bayesianism and all of the LessWrong discussion on money pump/dominance/sure loss arguments (e.g., see all of the references in and comments on this post). But fair enough, it would have been better to include citations.

"we often encounter claims"

We did include (potential) examples in this case. Also, similarly to the above, I would think that encountering claims like “we ought to use some heuristic because it has worked well in the past” is commonplace among readers so didn’t see the need to provide extensive evidence.

*Granted, we are using “dominated strategy” in the wide sense of “strategy that you are certain is worse than something else”, which glosses over technical points like the distinction between dominated strategy and sure loss.

Adding to Jesse's comment, the "We’ve often heard things along the lines of..." line refers both to personal communications and to various comments we've seen, e.g.:

  • [link]: "Since this intuition leads to the (surely false) conclusion that a rational beneficent agent might just as well support the For Malaria Foundation as the Against Malaria Foundation, it seems to me that we have very good reason to reject that theoretical intuition"
  • [link]: "including a few mildly stubborn credence functions in some judiciously chosen representors can entail effective altruism from the longtermist perspective is a fool’s errand. Yet this seems false"
  • [link]: "I think that if you try to get any meaningful mileage out of the maximality rule ... basically everything becomes permissible, which seems highly undesirable"
    • (Also, as we point out in the post, this is only true insofar as you only use maximality, applied to total consequences. You can still regard obviously evil things as unacceptable on non-consequentialist grounds, for example.)

Thanks, this gave me the context I needed.

Put another way: this post seems like it’s arguing with someone but I’m not sure who.

A choice can influence the reality of the situation where it could be taken. Thus a "dominated strategy" can be winning when choosing the "better possibilities" prevents the situation where you would be considering the decision from occurring. Problem statements in classical forms (such as payoff matrices of games) prohibit such considerations. In Newcomb's problem, where "winning" is a good way of looking at what's wrong with two-boxing, the issue is that the game theory way of framing possible outcomes doesn't recognize that some of the outcomes refute the situation where the outcomes are being chosen. This is clearer in examples like Transparent Newcomb. Overall behavior of an algorithm influences whether it's given the opportunity to run in the first place.

So the relevance of "winning" isn't so much about balancing the many senses of winning across the many possibilities where some winning occurs or doesn't, expected utility vs. other framings. It's more about paying attention to which possibilities are real, and whether winning in the more central senses occurs on those possibilities or not.

I'm confused by what you mean by "non-pragmatic". For example, what makes "avoiding dominated strategies" pragmatic but "deference" non-pragmatic? 

(It seems like the pragmatic ones help you decide what to do and the non-pragmatic ones help you decide what to believe, but then this doesn't answer how to make good decisions.)

Sorry this was confusing! From our definition here:

We’ll use “pragmatic principles” to refer to principles according to which belief-forming or decision-making procedures should “perform well” in some sense.

  • "Avoiding dominated strategies" is pragmatic because it directly evaluates a decision procedure or set of beliefs based on its performance. (People do sometimes apply pragmatic principles like this one directly to beliefs, see e.g. this work on anthropics.)
  • Deference isn't pragmatic, because the appropriateness of your beliefs is evaluated by how your beliefs relate to the person you're deferring to. Someone could say, "You should defer because this tends to lead to good consequences," but then they're not applying deference directly as a principle — the underlying principle is "doing what's worked in the past."

I don't fully understand the post. Without a clear definition of "winning," the points you're trying to make — as well as the distinction between pragmatic and non-pragmatic principles (which also aligns with strategies and knowledge formation) — aren't totally clear. For instance, "winning," in some vague sense, probably also includes things like "fitting with evidence," taking advice from others, and so on. You don't necessarily need to turn to non-pragmatic principles or those that don’t derive from the principle of winning. "Winning" is a pretty loose term.

Without a clear definition of "winning,"

This is part of the problem we're pointing out in the post. We've encountered claims of this "winning" flavor that haven't been made precise, so we survey different things "winning" could mean more precisely, and argue that they're inadequate for figuring out which norms of rationality to adopt.

Without an objective standard of “winning” to turn to, this leaves us searching for new principles that could guide us in the face of indeterminacy. But that’s all for another post.

First time ever I am left hanging by a LW post. Genuinely.