Solving Pascal’s Wager using dynamic programming

Paul Wilczewski

In Infinite Ethics Nick Bostrom argues against using discount factors to resolve problems created by the temporal and spatial infinities that arise in moral theories. While this argument is compelling in the context of aggregative consequentialism it is less compelling under ethical egosim. Incorporating discount factors into ethical models handles many of the problematic infinities and allows us to analyze individual ethical decisions mathematically. In this post I present a model for decision making under uncertainty that incorporates the possibility of eternal reward or punishment.

Motivation

Material conditions for humanity have never been so good yet so many people focus primarily on worldly rewards. Many of these worldly rewards even have diminishing marginal utility. With such abundance for so many I would expect humanity to become more religious and more concerned about what happens after we die. Why has the opposite happened? Perhaps we lack the models needed for evaluating existential matters. Every day we face ethical choices that may or may not have eternal consequences. Despite belief in these consequences even deeply religious people sin sometimes. They risk eternal punishment for some worldly benefit - we should be able to model this behavior. Maybe sin is sometimes rational. Maybe even Faust's bargain was utility optimizing.

In his work Bostrom identifies two types of discount factors, spatial and temporal. The former is easily addressed through ethical egoism. If I am unconcerned about the well-being of others, to the extent it does not impact my well-being, then my spatial discount factor is zero and one class of infinities is resolved. The later can be addressed with temporal discount factors less than one. In economics temporal discount factors are widely used and empirically verified. While Bostrom writes temporal discount factors are "viewed with great suspicion" empirical studies show that people use discount factors when making decisions. These discount factors are commonly used to evaluate potential reward and punishment in the economics of crime literature. For example in their work Crime and Human Nature (1985) Wilson and Herrnstein argue that individuals with high temporal discount factors are more prone to criminal behavior. If sins are crimes that are only punished after we die then we can apply the same decision theoretic framework to sins as we do to crime.

Reformulating Pascal's Wager

To begin constructing a model of decision making under existential uncertainty I use Pascal’s Wager as a starting point. The original wager includes a single choice with four possible outcomes. I propose that it is more useful to consider it an infinite-horizon discrete time decision problem. This formulation addresses the two most common objections to the wager: (1) that the expected reward is infinite and (2) that the decision table is incomplete. After addressing these objections I demonstrate that dynamic programming can provide reasonable policies for how to act under existential uncertainty.

The presence of an infinite reward in the wager has always been contentious. Some philosophers have argued that humans cannot experience or appreciate an infinite reward and therefore it must be finite. However the reward is eternal so even a finite reward experienced over an infinite-horizon would be infinite. In my formulation I posit that the human capacity for experiencing reward is finite over any finite interval of time. Essentially that utility over any finite interval is bounded. Since the reward is eternal then the total reward is still infinite as in the original version of the wager. The difference is that now the infinite reward is expressed as an infinite series of finite rewards.

When deciding on the wager a rational actor will calculate the total utility of the reward series. In the case of a constant finite reward H in each time interval the infinite sum of this constant value is infinite - as in the original wager. However since many of these rewards occur very very very far in the future a rational actor will incorporate his time preferences when calculating the total utility of the reward. Humans show a very clear time preference, consistently preferring rewards today to those in the distant future. Discount factors measure the magnitude of these time preferences, where γ<1 indicates a preference for rewards that occur sooner. The lower the value of γ the higher the individual preference for immediate rewards. A rational actor will calculate the total utility of the infinite rewards using his discount factor and find that it has a finite utility of Hγ/(1-γ). The reward is essentially a perpetuity, its total utility is only infinite when the discount factor is equal to one.

With a discount factor less than one the decision table contains only finite elements. Using a simplified example a rational actor can weigh the reward against any costs c incurred for wagering on God (praying, attending church, not sinning etc). Here I assume that the reward for wagering against God is always zero (no punishment for non-believers).

	God exists	God does not exist
Wager on God	Hγ/(1-γ) - c	- c
Wager against God	0	0

Given this decision table a rational actor will wager on God if p*Hγ/(1-γ) > c where p is the probability that God exists. All else equal lower costs, higher discount factors, a higher capacity for reward and a higher probability of God existing all make the wager more appealing. While exact values for these four parameters are unknowable, estimating reasonable ranges can still provide a useful model for approaching the wager.

Multiple wagers

Next I extend this formulation to address the second class of objections - that the decision table is incomplete. Specifically I include (1) the possibility of infinite punishment and (2) that the wager requires a series of actions over the course of life. The possibility of infinite punishment is analogous to infinite reward. If eternal punishment is treated as an infinite series of finite punishments then the discounted value will be finite for γ<1.

To expand the decision table to multiple actions over multiple periods I reframe the wager as a Markov decision process. At each state s an actor can choose an action a. The choice triggers a state transition to s’ with probability P(s’ | s, a) and the actor receives a reward R(s, a). This model allows for an arbitrary number of actions and states. In each state a rational actor aims to choose the action that maximizes the expected value of current and future rewards V according to the Bellman equation:

While this model requires more parameters and assumptions than the original wager it more closely resembles human decision making. Wagering on God involves more than a simple baptism at birth or a deathbed confession. To illustrate the decision process I use a simple parameterization of the model:

Actions: good or evil.
States: alive, dead, heaven or hell.
Transitions: there is no possibility of transition from dead, heaven or hell. There are four transitions from alive where q is the probability of death, p is the probability God exists and J(A) is the judgment function which reflects the probability of entering heaven.

Transitions	alive	dead	heaven	hell
alive	1-q	q(1-p)	qpJ(A)	qp(1-J(A))

Rewards:
- R(alive, good) = c
- R(alive, evil) = v
- R(heaven) = H
- R(hell) = -H

Dynamic programming

Now consider a rational actor seeking to maximize rewards choosing between a good action and an evil action. For simplicity I assume his maximum life is 10 and that the judgment function is equal to the fraction of past actions that have been good (A). If he chooses the good action then his value function evaluates to:

Since his maximum life is 10 there is no possibility of remaining alive, q=1. Next I assume that if God does not exist then V(dead) = 0, that the states of heaven and hell are permanent and that the expected value of those eternal rewards is Hγ/(1-γ) and -Hγ/(1-γ) respectively. Using these assumptions the expected value of the good action is:

If he chooses the evil action then this value function evaluates to:

Where A’ is the fraction of good actions given the final choice of an evil action. Choosing an action requires evaluating and comparing V(s|evil) and V(s|good). Mathematically this choice can be expressed as an inequality, to choose evil its immediate reward v needs to satisfy:

Since these rewards reflect utility and are difficult to directly measure the inequality is better expressed as the difference between the reward for evil and the reward for good as a fraction of the greatest possible reward:

Suppose the parameter values are estimated to be p=0.01, γ=0.95, A=0.9 and A’=0.8 then the inequality evaluates to:

Therefore if the reward for evil exceeds the reward for good by more than 3.61% of the maximum possible reward H a rational actor should choose evil.

This example illustrates how to use dynamic programming to solve for the optimal action in one state over one period. Solving for a comprehensive optimal policy generally requires numerical methods or reinforcement learning. However my goal is not to solve for a comprehensive policy but to create a model for incorporating the possibility of eternal reward or punishment into rational decision making. While this model does not explain which actions are ethically correct it describes a framework for how humans might actually make ethical decisions. Further analysis could yield valuable heuristics for making better decisions under existential uncertainty.