Two-boxing, smoking and chewing gum in Medical Newcomb problems
I am currently learning about the basics of decision theory, most of which is common knowledge on LW. I have a question, related to why EDT is said not to work.
Consider the following Newcomblike problem: A study shows that most people who two-box in Newcomblike problems as the following have a certain gene (and one-boxers don't have the gene). Now, Omega could put you into something like Newcomb's original problem, but instead of having run a simulation of you, Omega has only looked at your DNA: If you don't have the "two-boxing gene", Omega puts $1M into box B, otherwise box B is empty. And there is $1K in box A, as usual. Would you one-box (take only box B) or two-box (take box A and B)? Here's a causal diagram for the problem:
Since Omega does not do much other than translating your genes into money under a box, it does not seem to hurt to leave it out:
I presume that most LWers would one-box. (And as I understand it, not only CDT but also TDT would two-box, am I wrong?)
Now, how does this problem differ from the smoking lesion or Yudkowsky's (2010, p.67) chewing gum problem? Chewing Gum (or smoking) seems to be like taking box A to get at least/additional $1K, the two-boxing gene is like the CGTA gene, the illness itself (the abscess or lung cancer) is like not having $1M in box B. Here's another causal diagram, this time for the chewing gum problem:
As far as I can tell, the difference between the two problems is some additional, unstated intuition in the classic medical Newcomb problems. Maybe, the additional assumption is that the actual evidence lies in the "tickle", or that knowing and thinking about the study results causes some complications. In EDT terms: The intuition is that neither smoking nor chewing gum gives the agent additional information.
Acausal trade barriers
A putative new idea for AI control; index here.
Many of the ideas presented here require AIs to be antagonistic towards each other - or at least hypothetically antagonistic towards hypothetical other AIs. This can fail if the AIs engage in acausal trade, so it would be useful if we could prevent such things from happening.
Now, I have to admit I'm still quite confused by acausal trade, so I'll simplify it to something I understand much better, an anthropic decision problem.
Staples and paperclips, cooperation and defection
Cilppy has a utility function p, linear in paperclips, while Stapley has a utility function s, linear in staples (and both p and s are normalised to zero with one aditional item adding 1 utility). They are not causally connected, and each must choose "Cooperate" or "Defect". If they "Cooperate", they create 10 copies of the items they do not value (so Clippy creates 10 staples, Stapley creates 10 paperclips). If they choose defect, they create one copy of the item they value (so Clippy creates 1 paperclip, Stapley creates 1 staple).
Assume both agents know these facts, both agents use anthropic decision theories, and both agents are identical apart from their separate locations and distinct utility functions.
Then the outcome is easy: both agents will consider that "cooperate-cooperate" or "defect-defect" are the only two possible options, "cooperate-cooperate" gives them the best outcome, so they will both cooperate. It's a sweet story of cooperation and trust between lovers that never agree and never meet.
Breaking cooperation
How can we demolish this lovely agreement? As I often do, I will assume that there is some event X that will turn Clippy on, with P(X) ≈ 1 (hence P(¬X) << 1). Similarly there is an event Y that turns Stapley on. Since X and Y are almost certain, they should not affect the results above. If the events don't happen, the AIs will never get turned on at all.
Now I am going to modify utility p, replacing it with
p' = p - E(p|¬X).
This p with a single element subtracted off it, the expected value of p given that Clippy has not been turned on. This term feels like a constant, but isn't exactly, as we shall see. Do the same modification to utility s, using Y:
s' = s - E(s|¬Y).
Now contrast "cooperate-cooperate" and "defect-defect". If Clippy and Stapley are both cooperators, then p=s=10. However, if the (incredibly unlikely) ¬X were to happen, then Clippy would not exist, but Stapley would still cooperate (as Stapley has no way of knowing about Clippy's non-existence), and create ten paperclips. So E(p|¬X) = E(p|X) ≈ 10, and p' ≈ 0. Similarly s' ≈ 0.
If both agents are defectors, though, then p=s=1. Since each agent creates its own valuable object, E(p|¬X) = 0 (Clippy cannot create a paperclip if Clippy does not exist) and similarly E(s|¬Y)=0.
So p'=s'=1, and both agents will choose to defect.
If this is a good analogue for acausal decision making, it seems we can break that, if needed.
[LINK] The P + epsilon Attack (Precommitment in cryptoeconomics)
Vitalik Buterin has a new post about an interesting theoretical attack against Bitcoin. The idea relies on the assumption that the attacker can credibly commit to something quite crazy. The crazy thing is this: paying out 25.01 BTC to all the people who help him in his attack to steal 25 BTC from everyone, but only if the attack fails. This leads to a weird payoff matrix where the dominant strategy is to help him in the attack. The attack succeeds, and no payoff is made.
Of course, smart contracts make such crazy commitments perfectly possible, so this is a bit less theoretical than it sounds. But even as an abstract though experiment about decision theories, it looks pretty interesting.
By the way, Vitalik Buterin is really on a roll. Just a week ago he had a thought-provoking blog post about how Decentralized Autonomous Organizations could possibly utilize a concept often discussed here: decision theory in a setup where agents can inspect each others' source code. It was shared on LW Discussion, but earned less exposure than I think it deserved.
EDIT 1: One smart commenter of the original post spotted that an isomorphic, extremely cool game was already proposed by billionaire Warren Buffett. Does this thing already have a name in game theory maybe?
EDIT 2: I wrote the game up in detail for some old-school game theorist friends:
The attacker orchestrates a game with 99 players. The attacker himself does not participate in the game.
Rules:
Each of the players can either defect or cooperate, in the usual game theoretic setup where they do announce their decisions simultaneously, without side channels. We call "aggregate outcome" the decision that was made by the majority of the players. If the aggregate outcome is defection, we say that the attack succeeds. A player's payoff consists of two components:
1. If her decision coincides with the aggregate outcome, the player gets 10 utilons.
and simultaneously:
2. if the attack succeeds, the attacker gets 1 utilons from each of the 99 players, regardless of their own decision.
There are two equilibria, but the second payoff component breaks the symmetry, and everyone will cooperate.
Now the attacker spices things up, by making a credible commitment before the game. ("Credible" simply means that somehow they make sure that the promise can not be broken. The classic way to achieve such things is an escrow, but so called smart contracts are emerging as a method for making fully unbreakable commitments.)
The attacker's commitment is quite counterintuitive: he promises that he will pay 11 utilons to each of the defecting players, but only if the attack fails.
Now the payoff looks like this:
Defection became a dominant strategy. The clever thing, of course, is that if everyone defects, then the attacker reaches his goal without paying out anything.
Less exploitable value-updating agent
My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.
For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.
One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.
As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.
We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).
The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).
So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).
This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.
Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.
Lying in negotiations: a maximally bad problem
In a previous post, I showed that the Nash Bargaining Solution (NBS), the Kalai-Smorodinsky Bargaining Solution (KSBS) and own my Mutual Worth Bargaining Solution (MWBS) were all maximally vulnerable to lying. Here I can present a more general result: all bargaining solutions are maximally vulnerable to lying.
Assume that players X and Y have settled on some bargaining solution (which only cares about the defection point and the utilities of X and Y). Assume further that player Y knows player X's utility function. Let player X look at the possible outcomes, and let her label any outcome O "admissible" if there is some possible bargaining partner YO with utility function uO such that O would be the outcome of the bargain between X and YO.
For instance, in the case of NBS and KSBS, the admissible outcomes would be the outcomes Pareto-better than the disagreement point. The MWBS has a slightly larger set of admissible outcomes, as it allows players to lose utility (up to the maximum they could possibly gain).
Then the general result is:
If player Y is able to lie about his utility function while knowing player X's true utility (and player X is honest), he can freely select his preferred outcome among the outcomes that are admissible.
The proof of this is also derisively brief: player Y need simply claim to have utility uO, in order to force outcome O.
Thus, if you've agreed on a bargaining solution, all that you've done is determined the set of outcomes among which your lying opponent will freely choose.
There may be a subtlety: your system could make use of an objective (or partially objective) disagreement point, which your opponent is powerless to change. This doesn't change the result much:
If player Y is able to lie about his utility function while knowing player X's true utility (and player X is honest), he can freely select his preferred outcome among the outcomes that are admissible given the disagreement point.
Exploitation and gains from trade
Note that the above result did not make any assumptions about the outcome being Pareto - giving up Pareto doesn't make you non-exploitable (or "strategyproof" as it is often called).
But note also that the result does not mean that the system is exploitable! In the random dictator setup, you randomly assign power to one player, who then makes all the decisions. In terms of expected utility, this is a pUX+(1-p)UY, where UX is the best outcome ("Utopia") for X and UY the best outcome for Y, and p the probability that X is the random dictator. The theorem still holds for this setup: player X knows that player Y will be able to select freely among the admissible outcomes, which is the set S={pUX+(1-p)O | O an outcome}. However, player X knows that player Y will select pUX+(1-p)UY as this maximises his expected utility. So a bargaining solution which has a particular selection of admissible outcomes can be strategyproof.
However, it seems that the only strategyproof bargaining solutions are variants of random dictators! These solutions do not allow much gain from trade. Conversely, the more you open your bargaining solution up to gains from trade, the more exploitable you become from lying. This can be seen in the examples above: my MWBS tried to allow greater gains (in expectation) by not restricting to strict Pareto improvements from the disagreement point. As a result, it makes itself more vulnerable to liars.
What to do
What can be done about this? There seem to be several possibilities:
- Restrict to bargaining solutions difficult to exploit. This is the counsel of despair: give up most of the gains from trade, to protect yourself from lying. But there may be a system where the tradeoff between exploitability and potential gains is in some sense optimal.
- Figure out your opponent's true utility function. The other obvious solution: prevent lying by figuring out what your opponent really values, by inspecting their code, their history, their reactions, etc... This could be combined with refusing to trade with those who don't make their true utility easy to discover (or only using non-exploitable trades with those).
- Hide your own true utility. The above approach only works because the liar knows their opponent, and their opponent doesn't know them. If both utilities are hidden, it's not clear how exploitable the system really is.
- Play only multi-player. If there are many different trades with many different people, it becomes harder to construct a false utility that exploits them all. This is in a sense a variant of "hiding your own true utility": in that situation, the player has to lie given their probability distribution of your possible possible utilities; in this this situation, they have to lie, given the known distribution of multiple true utilities.
So there does not seem to be a principled way of getting rid of liars. But the multi-player (or hidden utility function) may point to a single "best" bargaining solution: the one that minimises the returns to lying and maximises the gains to trade, given ignorance of the other's utility function.
Blackmail, continued: communal blackmail, uncoordinated responses
The heuristic that one should always resist blackmail seems a good one (no matter how tricky blackmail is to define). And one should be public about this, too; then, one is very unlikely to be blackmailed. Even if one speaks like an emperor.
But there's a subtlety: what if the blackmail is being used against a whole group, not just against one person? The US justice system is often seen to function like this: prosecutors pile on ridiculous numbers charges, threatening uncounted millennia in jail, in order to get the accused to settle for a lesser charge and avoid the expenses of a trial.
But for this to work, they need to occasionally find someone who rejects the offer, put them on trial, and slap them with a ridiculous sentence. Therefore by standing up to them (or proclaiming in advance that you will reject such offers), you are not actually making yourself immune to their threats. Your setting yourself up to be the sacrificial one made an example of.
Of course, if everyone were a UDT agent, the correct decision would be for everyone to reject the threat. That would ensure that the threats are never made in the first place. But - and apologies if this shocks you - not everyone in the world is a perfect UDT agent. So the threats will get made, and those resisting them will get slammed to the maximum.
Of course, if everyone could read everyone's mind and was perfectly rational, then they would realise that making examples of UDT agents wouldn't affect the behaviour of non-UDT agents. In that case, UDT agents should resist the threats, and the perfectly rational prosecutor wouldn't bother threatening UDT agents. However - and sorry to shock your views of reality three times in one post - not everyone is perfectly rational. And not everyone can read everyone's minds.
So even a perfect UDT agent must, it seems, sometimes succumb to blackmail.
Value learning: ultra-sophisticated Cake or Death
Many mooted AI designs rely on "value loading", the update of the AI’s preference function according to evidence it receives. This allows the AI to learn "moral facts" by, for instance, interacting with people in conversation ("this human also thinks that death is bad and cakes are good – I'm starting to notice a pattern here"). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.
But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.
Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:
Expectation(p(C(u)) | a) = p(C(u)).
Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn't know which moral outcome was more likely.
That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,
Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).
How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:"Ask a human", the human would answer "cake", and it would then update its values to reflect that cake was valuable but death wasn't. However, the above condition means that if the AI instead chose the action b:"don't ask", exactly the same thing would happen.
In practice, this means that as soon as the AI knows that a human would answer "cake", it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.
SUDT: A toy decision theory for updateless anthropics
The best approach I know for thinking about anthropic problems is Wei Dai's Updateless Decision Theory (UDT). We aren't yet able to solve all problems that we'd like to—for example, when it comes to game theory, the only games we have any idea how to solve are very symmetric ones—but for many anthropic problems, UDT gives the obviously correct solution. However, UDT is somewhat underspecified, and cousin_it's concrete models of UDT based on formal logic are rather heavyweight if all you want is to figure out the solution to a simple anthropic problem.
In this post, I introduce a toy decision theory, Simple Updateless Decision Theory or SUDT, which is most definitely not a replacement for UDT but makes it easy to formally model and solve the kind of anthropic problems that we usually apply UDT to. (And, of course, it gives the same solutions as UDT.) I'll illustrate this with a few examples.
This post is a bit boring, because all it does is to take a bit of math that we already implicitly use all the time when we apply updateless reasoning to anthropic problems, and spells it out in excruciating detail. If you're already well-versed in that sort of thing, you're not going to learn much from this post. The reason I'm posting it anyway is that there are things I want to say about updateless anthropics, with a bit of simple math here and there, and while the math may be intuitive, the best thing I can point to in terms of details are the posts on UDT, which contain lots of irrelevant complications. So the main purpose of this post is to save people from having to reverse-engineer the simple math of SUDT from the more complex / less well-specified math of UDT.
(I'll also argue that Psy-Kosh's non-anthropic problem is a type of counterfactual mugging, I'll use the concept of l-zombies to explain why UDT's response to this problem is correct, and I'll explain why this argument still works if there aren't any l-zombies.)
*
I'll introduce SUDT by way of a first example: the counterfactual mugging. In my preferred version, Omega appears to you and tells you that it has thrown a very biased coin, which had only a 1/1000 chance of landing heads; however, in this case, the coin has in fact fallen heads, which is why Omega is talking to you. It asks you to choose between two options, (H) and (T). If you choose (H), Omega will create a Friendly AI; if you choose (T), it will destroy the world. However, there is a catch: Before throwing the coin, Omega made a prediction about which of these options you would choose if the coin came up heads (and it was able to make a highly confident prediction). If the coin had come up tails, Omega would have destroyed the world if it's predicted that you'd choose (H), and it would have created a Friendly AI if it's predicted (T). (Incidentally, if it hadn't been able to make a confident prediction, it would just have destroyed the world outright.)
| Coin falls heads (chance = 1/1000) | Coin falls tails (chance = 999/1000) | |
| You choose (H) if coin falls heads | Positive intelligence explosion |
Humanity wiped out |
| You choose (T) if coin falls heads | Humanity wiped out | Positive intelligence explosion |
In this example, we are considering two possible worlds: and
. We write
(no pun intended) for the set of all possible worlds; thus, in this case,
. We also have a probability distribution over
, which we call
. In our example,
and
.
In the counterfactual mugging, there is only one situation you might find yourself in in which you need to make a decision, namely when Omega tells you that the coin has fallen heads. In general, we write for the set of all possible situations in which you might need to make a decision; the
stands for the information available to you, including both sensory input and your memories. In our case, we'll write
, where
is the single situation where you need to make a decision.
For every , we write
for the set of possible actions you can take if you find yourself in situation
. In our case,
. A policy (or "plan") is a function
that associates to every situation
an action
to take in this situation. We write
for the set of all policies. In our case,
, where
and
.
Next, there is a set of outcomes, , which specify all the features of what happens in the world that make a difference to our final goals, and the outcome function
, which for every possible world
and every policy
specifies the outcome
that results from executing
in the world
. In our case,
(standing for FAI and DOOM), and
and
.
Finally, we have a utility function . In our case,
and
. (The exact numbers don't really matter, as long as
, because utility functions don't change their meaning under affine transformations, i.e. when you add a constant to all utilities or multiply all utilities by a positive number.)
Thus, an SUDT decision problem consists of the following ingredients: The sets ,
and
of possible worlds, situations you need to make a decision in, and outcomes; for every
, the set
of possible actions in that situation; the probability distribution
; and the outcome and utility functions
and
. SUDT then says that you should choose a policy
that maximizes the expected utility
, where
is the expectation with respect to
, and
is the true world.
In our case, is just the probability of the good outcome
, according to the (prior) distribution
. For
, that probability is 1/1000; for
, it is 999/1000. Thus, SUDT (like UDT) recommends choosing (T).
If you set up the problem in SUDT like that, it's kind of hidden why you could possibly think that's not the right thing to do, since we aren't distinguishing situations that are "actually experienced" in a particular possible world
; there's nothing in the formalism that reflects the fact that Omega never asks us for our choice if the coin comes up tails. In my post on l-zombies, I've argued that this makes sense because even if there's no version of you that actually consciously experiences being in the heads world, this version still exists as a Turing machine and the choices that it makes influence what happens in the real world. If all mathematically possible experiences exist, so that there aren't any l-zombies, but some experiences are "experienced more" (have more "magical reality fluid") than others, the argument is even clearer—even if there's some anthropic sense in which, upon being told that the coin fell heads, you can conclude that you should assign a high probability of being in the heads world, the same version of you still exists in the tails world, and its choices influence what happens there. And if everything is experienced to the same degree (no magical reality fluid), the argument is clearer still.
*
From Vladimir Nesov's counterfactual mugging, let's move on to what I'd like to call Psy-Kosh's probably counterfactual mugging, better known as Psy-Kosh's non-anthropic problem. This time, you're not alone: Omega gathers you together with 999,999 other advanced rationalists, all well-versed in anthropic reasoning and SUDT. It places each of you in a separate room. Then, as before, it throws a very biased coin, which has only a 1/1000 chance of landing heads. If the coin does land heads, then Omega asks all of you to choose between two options, (H) and (T). If the coin falls tails, on the other hand, Omega chooses one of you at random and asks that person to choose between (H) and (T). If the coin lands heads and you all choose (H), Omega will create a Friendly AI; same if the coin lands tails, and the person who's asked chooses (T); else, Omega will destroy the world.
| Coin falls heads (chance = 1/1000) | Coin falls tails (chance = 999/1000) | |
| Everyone chooses (H) if asked | Positive intelligence explosion |
Humanity wiped out |
| Everyone chooses (T) if asked |
Humanity wiped out | Positive intelligence explosion |
| Different people choose differently |
Humanity wiped out | (Depends on who is asked) |
We'll assume that all of you prefer a positive FOOM over a gloomy DOOM, which means that all of you have the same values as far as the outcomes of this little dilemma are concerned: , as before, and all of you have the same utility function, given by
and
. As long as that's the case, we can apply SUDT to find a sensible policy for everybody to follow (though when there is more than one optimal policy, and the different people involved can't talk to each other, it may not be clear how one of the policies should be chosen).
This time, we have a million different people, who can in principle each make an independent decision about what to answer if Omega asks them the question. Thus, we have . Each of these people can choose between (H) and (T), so
for every person
, and a policy
is a function that returns either (H) or (T) for every
. Obviously, we're particularly interested in the policies
and
satisfying
and
for all
.
The possible worlds are , and their probabilities are
and
. The outcome function is as follows:
,
for
,
if
, and
otherwise.
What does SUDT recommend? As in the counterfactual mugging, is the probability of the good outcome
, under policy
. For
, the good outcome can only happen if the coin falls heads: in other words, with probability
. If
, then the good outcome can not happen if the coin falls heads, because in that case everybody gets asked, and at least one person chooses (T). Thus, in this case, the good outcome will happen only if the coin comes up tails and the randomly chosen person answers (T); this probability is
, where
is the number of people answering (T). Clearly, this is maximized for
, where
; moreover, in this case we get the probability
, which is better than for
, so SUDT recommends the plan
.
Again, when you set up the problem in SUDT, it's not even obvious why anyone might think this wasn't the correct answer. The reason is that if Omega asks you, and you update on the fact that you've been asked, then after updating, you are quite certain that the coin has landed heads: yes, your prior probability was only 1/1000, but if the coin has landed tails, the chances that you would be asked was only one in a million, so the posterior odds are about 1000:1 in favor of heads. So, you might reason, it would be best if everybody chose (H); and moreover, all the people in the other rooms will reason the same way as you, so if you choose (H), they will as well, and this maximizes the probability that humanity survives. This relies on the fact that the others will choose the same way as you, but since you're all good rationalists using the same decision theory, that's going to be the case.
But in the worlds where the coin comes up tails, and Omega chooses someone else than you, the version of you that gets asked for its decision still "exists"... as an l-zombie. You might think that what this version of you does or doesn't do doesn't influence what happens in the real world; but if we accept the argument from the previous paragraph that your decisions are "linked" to those of the other people in the experiment, then they're still linked if the version of you making the decision is an l-zombie: If we see you as a Turing machine making a decision, that Turing machine should reason, "If the coin came up tails and someone else was chosen, then I'm an l-zombie, but the person who is actually chosen will reason exactly the same way I'm doing now, and will come to the same decision; hence, my decision influences what happens in the real world even in this case, and I can't do an update and just ignore those possible worlds."
I call this the "probably counterfactual mugging" because in the counterfactual mugging, you are making your choice because of its benefits in a possible world that is ruled out by your observations, while in the probably counterfactual mugging, you're making it because of its benefits in a set of possible worlds that is made very improbable by your observations (because most of the worlds in this set are ruled out). As with the counterfactual mugging, this argument is just all the stronger if there are no l-zombies because all mathematically possible experiences are in fact experienced.
*
As a final example, let's look at what I'd like to call Eliezer's anthropic mugging: the anthropic problem that inspired Psy-Kosh's non-anthropic one. This time, you're alone again, except that there's many of you: Omega is creating a million copies of you. It flips its usual very biased coin, and if that coin falls heads, it places all of you in exactly identical green rooms. If the coin falls tails, it places one of you in a green room, and all the others in red rooms. It then asks all copies in green rooms to choose between (H) and (T); if your choice agrees with the coin, FOOM, else DOOM.
| Coin falls heads (chance = 1/1000) | Coin falls tails (chance = 999/1000) | |
| Green roomers choose (H) | Positive intelligence explosion |
Humanity wiped out |
| Green roomers choose (T) | Humanity wiped out | Positive intelligence explosion |
Our possible worlds are back to being , with probabilities
and
. We are also back to being able to make a choice in only one particular situation, namely when you're a copy in a green room:
. Actions are
, outcomes
, utilities
and
, and the outcome function is given by
and
. In other words, from SUDT's perspective, this is exactly identical to the situation with the counterfactual mugging, and thus the solution is the same: Once more, SUDT recommends choosing (T).
On the other hand, the reason why someone might think that (H) could be the right answer is closer to that for Psy-Kosh's probably counterfactual mugging: After waking up in a green room, what should be your posterior probability that the coin has fallen heads? Updateful anthropic reasoning says that you should be quite sure that it has fallen heads. If you plug those probabilities into an expected utility calculation, it comes out as in Psy-Kosh's case, heavily favoring (H).
But even if these are good probabilities to assign epistemically (to satisfy your curiosity about what the world probably looks like), in light of the arguments from the counterfactual and the probably counterfactual muggings (where updating definitely is the right thing to do epistemically, but plugging these probabilities into the expected utility calculation gives the wrong result), it doesn't seem strange to me to come to the conclusion that choosing (T) is correct in Eliezer's anthropic mugging as well.
What should superrational players do in asymmetric games?
Rereading Hofstadter's essays on superrationality prompted me to wonder what strategies superrational agents would want to commit to in asymmetric games. In symmetric games, everyone can agree on outcome they'd like to jointly achieve, leaving the decision-theoretic question of whether the players can commit or not. In asymmetric games, life becomes murkier. There are typically many Pareto-efficient outcomes, and we enter the wilds of cooperative game theory and bargaining solutions trying to identify the right one. While, say, the Nash bargaining solution is appealing on many levels, I have a hard time connecting the logic of superrationality to any particular solution. Recently though, I found some insight in "Cooperation in Strategic Games Revisited" by Adam Kalai and Ehud Kalai (working paper version and three-page summary version) for the special case of two-player games with side transfers.
Just to make sure everyone's on common ground, the prototypical game examined in the argument for superrationality is the prisoners' dilemma:
| Alice / Bob | Cooperate | Defect |
| Cooperate | 10 / 10 | 0 / 12 |
| Defect | 12 / 0 | 4 / 4 |
The unique dominant-strategy equilibrium is (Defect, Defect). However, Hofstadter argues that "superrational" players would recognize the symmetry in reasoning processes between each other and thus conclude that cooperating is in their interest. The argument is not in favor of unconditional cooperation. Instead, the reasoning is closer to "I cooperate if and only I expect you to cooperate if and only if I cooperate". Many bits have been devoted to formalizing this reasoning in timeless decision theory and other variants.
The symmetry in the prisoners' dilemma makes it easy to pick out (Cooperate, Cooperate) as the action profile each player ideally wants to see happen. Consider instead the following skewed prisoners' dilemma:
| Alice/Bob | Cooperate | Defect |
| Cooperate | 2 / 18 | 0 / 12 |
| Defect | 12 / 0 | 4 / 4 |
The (Cooperate, Cooperate) outcome still has the highest total benefit, but (Defect, Defect) is also Pareto-efficient. With this asymmetry, it seems reasonable for Alice to Defect, even as someone who would cooperate in the original prisoners' dilemma. Suppose however players can also agree to transfer utility between themselves on a 1-to-1 basis (like if they value cash equally and can make side-payments). Then, (Cooperate, Cooperate) with a transfer between 2 and 14 from Bob to Alice dominates (Defect, Defect). The size of the transfer is still up in the air, although a transfer of 8 (leaving both with a payoff of 10) is appealing since it takes us back to the original symmetric game. I feel confident suggesting this as an outcome the players should commit to if possible.
While the former game could be symmetrized in a nice way, what about more general games where payoffs could look even more askew or strategy sets could be completely different?
Let A be the payoff matrix for Alice and B be the payoff matrix for Bob in any given game. Kalai and Kalai point out that the game (A, B) can be decomposed into the sum of two games:
where payoffs are identical in the first game (the team game) and zero-sum in the second (the advantage game). Consider playing these games separately. In the team game, Alice and Bob both agree on the action profile that maximizes their payoff with no controversy. In the advantage game, preferences are exactly opposed, so each can play their maximin strategy, again with no controversy. Of course, the rub is the team game strategy profile could be very different from the advantage game strategy profile.
Suppose Alice and Bob could commit to playing each game separately. Kalai and Kalai define the payoffs each gets between the two games as
where coco stands for cooperative/competitive. We don't actually have two games to be played separately, so the way to achieve these payoffs is for Alice and Bob to actually play the team game actions and hypothetically play the advantage game. Transfers then even out the gains from the team game results and add in the hypothetical advantage game results. Even though the original game might be asymmetric, this simple decomposition allows players to cooperate exactly where interests are aligned and compete exactly where interests are opposed.
For example, consider two hot dog vendors. There are 40 potential customers at the airport and 100 at the beach. If both choose the same location, they split the customers there evenly. Otherwise, the vendor at each location sells to everyone at that place. Alice turns a profit of $2 per customer, while Bob turns a profit of $1 per customer. Overall this yields the payoffs:
| Alice / Bob | Airport | Beach |
| Airport | 40 / 20 | 80 / 100 |
| Beach | 200 / 40 | 100 / 50 |
The game decomposes into the team game:
| Alice / Bob | Airport | Beach |
| Airport | 30 / 30 | 90 / 90 |
| Beach | 120 / 120 | 75 / 75 |
and the advantage game:
| Alice / Bob | Airport | Beach |
| Airport | 10 / -10 | -10 / 10 |
| Beach | 80 / -80 | 25 / -25 |
The maximizing strategy profile for the team game is (Beach, Airport) with payoffs (120, 120). The maximin strategy profile for the advantage game is (Beach, Beach) with payoffs (25, -25). In total, this game has a coco-value of (145, 95), which would be realized by Alice selling at the beach, Bob selling at the airport, and Alice transferring 55 to Bob. Alice generates most of the profits in this situation, but Bob has to be compensated for his credible threat to start selling at the beach too.
The bulk of the Kalai and Kalai article is extending the coco-value to incomplete information settings. For instance, each vendor might have some private information about the weather tomorrow, which will affect the number of customers at the airport and the beach. The Kalais prove that being able to publicly observe the payoffs for the chosen actions is sufficient for agents to commit themselves to the coco-value ex-ante (before receiving any private information) and that being able to publicly observe all hypothetical payoffs from alternative action profiles is sufficient for commitment even after agents have private information.
The Kalais provide an axiomatization of the coco-value, showing it is the payoff pair that uniquely satisfies all of the following:
- Pareto optimality: The sum of the values is maximal.
- Shift invariance: Increasing a player's payoff by a constant amount in each cell increases their value by the same amount.
- Payoff dominance: If one player always gets more than the other in each cell, that player can't get a smaller value for the game.
- Invariance to redundant strategies: Adding a new action that is a convex combination of the payoffs of two other actions can't change the value.
- Monotonicity in actions: Removing an action from a player can't increase their value for the game.
- Monotonicity in information: Giving a player strictly less information can't increase their value for the game.
The coco-value is also easily computable, unlike Nash equilibria in general. I'm hard-pressed to think of any more I could want from it (aside from easy extensions to bigger classes of games). Given its simplicity, I'm surprised it wasn't hit upon earlier.
Game theory and expected opponents
Thanks to V_V and Emile for some great discussion. Since writing up a post seems to reliably spark interesting comments, that's what I'll do!
Summary
If I wanted to write down a decision theory that gets the correct answer to game-theoretic problems (like playing the middle Nash equilibrium in a blind chicken-like game), it would have to, in a sense, implement all of game theory. This is hard because human-generated solutions to games use a lot of assumptions about what the other players will do, and putting those assumptions into our algorithm is a confusing problem. In order to tell what's really going on, we need to make that information more explicit. Once we do that, maybe we can get a UDT-like algorithm to make good moves in tricky games.
Newcomb's Problem
For an example of a game with unusually good information about our opponent, how about Newcomb's problem. Is it really a game, you ask? Sure, I say!
In the payoff matrix to the right, you play red and Omega plays blue. The numbers for Omega just indicate that Omega only wants to put in the million dollars if you will 1-box. If this was a normal game-theory situation, you wouldn't easily know what to do - your best move depends on Omega's move. This is where typical game theory procedure would be to say "well, that's silly, let's specify some extra nice properties the choice of both players should have so that we get a unique solution."
But the route taken in Newcomb's problem is different - we pick out a unique solution by increasing how much information the players have about each other. Omega knows what you will play, and you know that Omega knows what you will play. Now all we need to figure out what to do is some information like "If Omega has an available strategy that will definitely get it the highest possible payoff, it will take it." The best strategy, of course, is to one-box so that Omega puts in the million dollars.
Newcomb's Game vs. an Ignorant opponent
Consider another possible opponent in this game - one who has no information about what your move will be. Whereas Omega always knows your move, an ignorant opponent has no idea what you will play - they have no basis to think you're more likely to 1-box than 2-box, or vice versa.
Interestingly, for this particular payoff matrix this makes you ignorant too - you have no basis to think the ignorant opponent would rather put the money in than not, or vice versa. So you assign a 50% chance to each (probability being quantified ignorance) and find that two-boxing has the highest rewards. This didn't even require the sophistication of taking into account your own action, like the game against Omega did, since an ignorant opponent can't respond to your action.
Human opponents
Ok, so we've looked at a super-knowledgeable opponent, and a super-ignorant opponent, what does a more typical game theory situation look like? Well, it's when our opponent is more like us - someone trying to pick the strategy that gets them the best reward, with similar information to what we have. In typical games between humans, both know that the other is a human player - and they know that it's known, etc.
In terms of what we know, we know that our opponent is drawn from some distribution of opponents that are about as good as we are at games, and that they have the same information about us that we have about them.
What information do we mean we have when we say our opponent is "good at games"? I don't know. I can lay out some possibilities, but this is the crux of the post. I'll frame our possible knowledge in terms of past games, like how one could say of Newcomb's problem "you observe a thousand games, and Omega always predicts right."
Possibility 1: We know our opponent has played a lot of games against completely unknown opponents in the past, and has a good record, where "good" means "as good or better than the average opponent."
Possibility 2: We know our opponent played some games against a closed group of players who played each other, and that group collectively had a good record.
Possibility 3: We know our opponent is a neural net that's been trained in some standard way to be good at playing a variety of games, or some sort of hacked-together implementation of game theory, or a UDT agent if that's a good idea. (Seems more complicated than necessary, but on the other hand opponents are totally allowed to be complicated)
Suppose we know information set #2. I think it's the most straightforward. All we have to do to turn this information into a distribution over opponents is to figure out what mixtures of players get above-average group results, then average those together. Once we know who our opponent is on average, we just follow the strategy that on average gets the best average payoff.
Does the strategy picked this way look like what game theory would say? Not quite - it assumes that the opponent has a medium chance of being stupid. And in some games, like the prisoner's dilemma, the best-payoff groups are actually the ones you can exploit the most. So on closer examination, someone in a successful group isn't the game-theory opponent we're looking for.
Kidnapping and the game of Chicken
Observe the payoff matrix at right (the unit of reward? Cookies.). Each player wants to play 'A', but only so long as the two players play different moves.
Suppose that Red got to move first. There are some games where moving first is terrible - take Rock Paper Scissors for example. But in this game, moving first is great, because you get to narrow down your opponent's options! If Red goes first, Red picks 'A', and then Blue has to pick 'B' to get a cookie.
This is basically kidnapping. Red has taken all three cookies hostage, and nobody gets any cookies unless Blue agrees to Red's demands for two cookies. Whoever gets to move first plays the kidnapper, and the other player has to decide whether to accede to their ransom demand in exchange for a cookie.
What if neither player gets to move before the other, but instead they have their moves revealed at the same time?
Pre-Move Chat:
Red: "I'm going to pick A, you'd better pick B."
Blue: "I don't care what you pick, I'm picking A. You can pick A too if you really want to get 0 cookies."
Red: "Okay I'm really seriously going to pick A. Please pick B."
Blue: "Nah, don't think so. I'll just pick A. You should just pick B."
And so on. They are now playing a game of Chicken. Whoever swerves first is worse off, but if neither of them give in, they crash into each other and die and get no cookies.
So, The Question: is it better to play A, or to play B?
Of all the SIA-doomsdays in the all the worlds...
Ideas developed with Paul Almond, who kept on flogging a dead horse until it started showing signs of life again.
Doomsday, SSA and SIA
Imagine there's a giant box filled with people, and clearly labelled (inside and out) "(year of some people's lord) 2013". There's another giant box somewhere else in space-time, labelled "2014". You happen to be currently in the 2013 box.
Then the self-sampling assumption (SSA) produces the doomsday argument. It works approximately like this: SSA has a preference for universe with smaller numbers of observers (since it's more likely that you're one-in-a-hundred than one-in-a-billion). Therefore we expect that the number of observers in 2014 is smaller than we would otherwise "objectively" believe: the likelihood of doomsday is higher than we thought.
What about the self-indication assumption (SIA) - that makes the doomsday argument go away, right? Not at all! SIA has no effect on the number of observers expected in the 2014, but increases the expected number of observers in 2013. Thus we still expect that the number of observers in 2014 to be lower than we otherwise thought. There's an SIA doomsday too!
Enter causality
What's going on? SIA was supposed to defeat the doomsday argument! What happens is that I've implicitly cheated - by naming the boxes "2013" and "2014", I've heavily implied that these "boxes" figuratively correspond two subsequent years. But then I've treated them as independent for SIA, like two literal distinct boxes.
What makes us think _any_ of our terminal values aren't based on a misunderstanding of reality?
Let's say Bob's terminal value is to travel back in time and ride a dinosaur.
It is instrumentally rational for Bob to study physics so he can learn how to build a time machine. As he learns more physics, Bob realizes that his terminal value is not only utterly impossible but meaningless. By definition, someone in Bob's past riding a dinosaur is not a future evolution of the present Bob.
There are a number of ways to create the subjective experience of having gone into the past and ridden a dinosaur. But to Bob, it's not the same because he wanted both the subjective experience and the knowledge that it corresponded to objective fact. Without the latter, he might as well have just watched a movie or played a video game.
So if we took the original, innocent-of-physics Bob and somehow calculated his coherent extrapolated volition, we would end up with a Bob who has given up on time travel. The original Bob would not want to be this Bob.
But, how do we know that _anything_ we value won't similarly dissolve under sufficiently thorough deconstruction? Let's suppose for a minute that all "human values" are dangling units; that everything we want is as possible and makes as much sense as wanting to hear the sound of blue or taste the flavor of a prime number. What is the rational course of action in such a situation?
PS: If your response resembles "keep attempting to XXX anyway", please explain what privileges XXX over any number of other alternatives other than your current preference. Are you using some kind of pre-commitment strategy to a subset of your current goals? Do you now wish you had used the same strategy to precommit to goals you had when you were a toddler?
The Interrupted Ultimate Newcomb's Problem
While figuring out my error in my solution to the Ultimate Newcomb's Problem, I ran across this (distinct) reformulation that helped me distinguish between what I was doing and what the problem was actually asking.
... but that being said, I'm not sure if my answer to the reformulation is correct either.
The question, cleaned for Discussion, looks like this:
You approach the boxes and lottery, which are exactly as in the UNP. Before reaching it, you come to sign with a flashing red light. The sign reads: "INDEPENDENT SCENARIO BEGIN."
Omega, who has predicted that you will be confused, shows up to explain: "This is considered an artificially independent experiment. Your algorithm for solving this problem will not be used in my simulations of your algorithm for my various other problems. In other words, you are allowed to two-box here but one-box Newcomb's problem, or vice versa."
This is motivated by the realization that I've been making the same mistake as in the original Newcomb's Problem, though this justification does not (I believe) apply to the original. The mistake is simply this: that I assumed that I simply appear in medias res. When solving the UNP, it is (seems to be) important to remember that you may be in some very rare edge case of the main problem, and that you are choosing your algorithm for the problem as a whole.
But if that's not true - if you're allowed to appear in the middle of the problem, and no counterfactual-yous are at risk - it sure seems like two-boxing is justified - as khafra put it, "trying to ambiently control basic arithmetic".
(Speaking of which, is there a write up of ambient decision theory anywhere? For that matter, is there any compilation of decision theories?)
EDIT: (Yes to the first, though not under that name: Controlling Constant Programs.)
Duller blackmail definitions
For a more parable-ic version of this, see here.
Suppose I make a precommitment P to take action X unless you take action Y. Action X is not in my interest: I wouldn't do it if I knew you'd never take action Y. You would want me to not precommit to P.
Is this blackmail? Suppose we've been having a steamy affair together, and I have the letters to prove it. It would be bad for both of these if they were published. Then X={Publish the letters} and Y={You pay me money} is textbook blackmail.
But suppose I own a MacGuffin that you want (I value it at £9). If X={Reject any offer} and Y={You offer more than £10}, is this still blackmail? Formally, it looks the same.
What about if I bought the MacGuffin for £500 and you value it at £1000? This makes no difference to the formal structure of the scenario. Then my behaviour feels utterly reasonable, rather than vicious and blackmail-ly.
What is the meaningful difference between the two scenarios? I can't really formalise it.
Countess and Baron attempt to define blackmail, fail
For a more concise version of this argument, see here.
We meet our heroes, the Countess of Rectitude and Baron Chastity, as they continue to investigate the mysteries of blackmail by sleeping together and betraying each other.
The Baron had a pile of steamy letters between him and the Countess: it would be embarrassing to both of them if these letters got out. Yet the Baron confided the letters to a trusted Acolyte, with strict instructions. The Acolyte was to publish these letters, unless the Countess agreed to give the Baron her priceless Ping Vase.
This seems a perfect example of blackmail:
- The Baron is taking a course of action that is intrinsically negative for him. This behaviour only makes sense if it forces the Countess to take a specific action which benefits him. The Countess would very much like it if the Baron couldn't do such things.
As it turns out, a servant broke the Ping Vase while chasing the Countess's griffon. The servant was swiftly executed, but the Acolyte had to publish the letters as instructed, to great embarrassment all around (sometimes precommitments aren't what they're cracked up to be). After six days of exile in the Countess's doghouse (a luxurious, twenty-room affair) and eleven days of make-up sex, the Baron was back to planning against his lover.
My Take on a Decision Theory
Finding a good decision theory is hard. Previous attempts, such as Timeless Decision Theory, work, it seems, in providing a stable, effective decision theory, but are mathematically complicated. Simpler theories, like CDT or EDT, are much more intuitive, but have deep flaws. They fail at certain problems, and thus violate the maxim that rational agents should win. This makes them imperfect.
But it seems to me that there is a relatively simple fix one could make to them, in the style of TDT, to extend their power considerably. Here I will show an implementation of such an extension of CDT, that wins on the problems that classic CDT fails on. It quite possibly could turn out that this is not as powerful as TDT, but it is a significant step in that direction, starting only from the naivest of decision theories. It also could turn out that this is nothing more than a reformulation of TDT or a lesser version thereof. In that case, this still has some value as a simpler formulation, easier to understand. Because as it stands, TDT seems like a far cry from a trivial extension of the basic, intuitive decision theories, as this hopes to be.
We will start by remarking that when CDT (or EDT) tries to figure out the expected value or a action or outcome, the naive way which it does so drops crucial information, which is what TDT manages to preserve. As such, I will try to calculate a CDT with this information not dropped. This information is, for CDT, the fact that Omega has simulated you and figured out what you are going to do. Why does a CDT agent automatically assume that it is the "real" one, so to speak? This trivial tweak seems powerful. I will, for the purpose of this post, call this tweaked version of CDT "Simulationist Causal Decision Theory", or SCDT for short.
Let's run this tweaked version though Newcomb's problem. Let Alice be a SCDT agent. Before the problem begins, as is standard in Newcomb's problem, Omega looks at Alice and calculates what choice Alice will make in the game. Without to much loss of generality, we can assume that Omega directly simulates Alice, and runs the simulation through the a simulation of the game, in order make the determination of what choice Alice will make. In other formulations of Newcomb's problem, Omega figures in out some other way what Alice will do, say by doing a formal analysis of her source code, but that seems intuitively equivalent. This is a possible flaw, but if the different versions of Newcomb's problem are equivalent (as they seem to be) this point evaporates, and so we will put it aside for now, and continue.
We will call the simulated agent SimAlice. SimAlice does not know, of course, that she is being simulated, and is an exact copy of Alice in all respects. In particular, she also uses the same SCDT thought processes as Alice, and she has the same utility function as Alice.
So, Alice (or SimAlice, she doesn't know which one she is) is presented with the game. She reasons thusly:
There are two possible cases: Either I am Alice or I am SimAlice.
- If I am Alice: Choosing both boxes will always get me exactly $1000 more then choosing just one. Regardless of whether or not there is $1,000,000 in box 2, by choosing box 1 as well, I am getting an extra $1000. (Note that this is exactly the same reasoning standard CDT uses!)
- If I am SimAlice: Then "I" don't actually get any money in this game, regardless of what I choose. But my goal is not SimAlice getting money it is is Alice getting money, by the simple fact that this is what Alice wants, and we assumed above that SimAlice uses the same utility function as Alice.And depending what I choose now, that will affect the way Omega sets up the boxes, and so affects the amount of money Alice will get. Specifically, if I one box, Omega will put an extra $1,000,000 in box 2, and so Alice will get an extra $1,000,000, no matter what she chooses. (Because in both the choices Alice could make (taking either box 2 or boxes 1&2), she takes box 2, and so will wind up with a bonus $1,000,000 above what she would get if box 2 was empty, which is what would happen if SimAlice didn't two box.)
So, as I don't know whether I am Alice or SimAlice, and as there is one of each, there is a 0.5 probability of me being either one, so by the law of total expectation,E[money|I one box]=0.5 * E[money|(I one box)&(I am Alice)] + 0.5 * E[money|(I one box)&(I am SimAlice)]So my expected return off one boxing (above what I would get by two boxing) is 0.5 * -$1000 + 0.5 * $1,000,000 = $450,000, which is positive, so I should one box.
Other prespective on resolving the Prisoner's dilemma
Sometimes I see new ideas that, without offering any new information, offers a new perspective on old information, and a new way of thinking about an old problem. So it is with this lecture and the prisoner's dilemma.
Now, I worked a lot with the prisoners dilemma, with superrationality, negotiations, fairness, retaliation, Rawlsian veils of ignorance, etc. I've studied the problem, and its possible resolutions, extensively. But the perspective of that lecture was refreshing and new to me:
The prisoner's dilemma is resolved only when the off-diagonal outcomes of the dilemma are known to be impossible.
The "off-diagonal outcomes" are the "(Defect, Cooperate)" and the "(Cooperate, Defect)" squares where one person walks away with all the benefit and the other has none:
| (Baron, Countess) | Cooperate | Defect |
|---|---|---|
| Cooperate |
(3,3) | (0,5) |
| Defect |
(5,0) | (1,1) |
Facing an identical (or near identical) copy of yourself? Then the off-diagonal outcomes are impossible, because you're going to choose the same thing. Facing Tit-for-tat in an iterated prisoner's dilemma? Well, the off-diagonal squares cannot be reached consistently. Is the other prisoner a Mafia don? Then the off-diagonal outcomes don't exist as written: there's a hidden negative term (you being horribly murdered) that isn't taken into account in that matrix. Various agents with open code are essentially publicly declaring the conditions under which they will not reach for the off-diagonal. The point of many contracts and agreements is to make the off-diagonal outcome impossible or expensive.
As I said, nothing fundamentally new, but I find the perspective interesting. To my mind, it suggests that when resolving the prisoner's dilemma with probabilistic outcomes allowed, I should be thinking "blocking off possible outcomes", rather than "reaching agreement".
The VNM independence axiom ignores the value of information
Followup to : Is risk aversion really irrational?
After reading the decision theory FAQ and re-reading The Allais Paradox I realized I still don't accept the VNM axioms, especially the independence one, and I started thinking about what my true rejection could be. And then I realized I already somewhat explained it here, in my Is risk aversion really irrational? article, but it didn't make it obvious in the article how it relates to VNM - it wasn't obvious to me at that time.
Here is the core idea: information has value. Uncertainty therefore has a cost. And that cost is not linear to uncertainty.
Let's take a first example: A is being offered a trip to Ecuador, B is being offered a great new laptop and C is being offered a trip to Iceland. My own preference is: A > B > C. I love Ecuador - it's a fantastic country. But I prefer a laptop over a trip to Iceland, because I'm not fond of cold weather (well, actually Iceland is pretty cool too, but let's assume for the sake of the article that A > B > C is my preference).
But now, I'm offered D = (50% chance of A, 50% chance of B) or E = (50% chance of A, 50% chance of C). The VNM independence principle says I should prefer D > E. But doing so, it forgets the cost of information/uncertainty. By choosing E, I'm sure I'll be offered a trip - I don't know where, but I know I'll be offered a trip, not a laptop. By choosing D, I'm no idea on the nature of the present. I've much less information on my future - and that lack of information has a cost. If I know I'll be offered a trip, I can already ask for days off at work, I can go buy a backpack, I can start doing the paperwork to get my passport. And if I know I won't be offered a laptop, I may decide to buy one, maybe not as great as one I would have been offered, but I can still buy one. But if I chose D, I've much less information about my future, and I can't optimize it as much.
The same goes for the Allais paradox: having certitude of receiving a significant amount of money ($24 000) has a value, which is present in choice 1A, but not in all others (1B, 2A, 2B).
And I don't see why a "rational agent" should neglect the value of this information, as the VNM axioms imply. Any thought about that?
Proof of fungibility theorem
Appendix to: A fungibility theorem
Suppose that is a set and we have functions
. Recall that for
, we say that
is a Pareto improvement over
if for all
, we have
. And we say that it is a strong Pareto improvement if in addition there is some
for which
. We call
a Pareto optimum if there is no strong Pareto improvement over it.
Theorem. Let be a set and suppose
for
are functions satisfying the following property: For any
and any
, there exists an
such that for all
, we have
.
Then if an element of
is a Pareto optimum, then there exist nonnegative constants
such that the function
achieves a maximum at
.
Math appendix for: "Why you must maximize expected utility"
This is a mathematical appendix to my post "Why you must maximize expected utility", giving precise statements and proofs of some results about von Neumann-Morgenstern utility theory without the Axiom of Continuity. I wish I had the time to make this post more easily readable, giving more intuition; the ideas are rather straight-forward and I hope they won't get lost in the line noise!
The work here is my own (though closely based on the standard proof of the VNM theorem), but I don't expect the results to be new.
*
I represent preference relations as total preorders on a simplex
; define
,
,
and
in the obvious ways (e.g.,
iff both
and
, and
iff
but not
). Write
for the
'th unit vector in
.
In the following, I will always assume that satisfies the independence axiom: that is, for all
and
, we have
if and only if
. Note that the analogous statement with weak preferences follows from this:
holds iff
, which by independence is equivalent to
, which is just
.
Lemma 1 (more of a good thing is always better). If and
, then
.
Proof. Let . Then,
and
. Thus, the result follows from independence applied to
,
,
, and
.
Lemma 2. If and
, then there is a unique
such that
for
and
for
.
Proof. Let be the supremum of all
such that
(note that by assumption, this condition holds for
). Suppose that
. Then there is an
such that
. By Lemma 1, we have
, and the first assertion follows.
Suppose now that . Then by definition of
, we do not have
, which means that we have
, which was the second assertion.
Finally, uniqueness is obvious, because if both and
satisfied the condition, we would have
.
Definition 3. is much better than
, notation
or
, if there are neighbourhoods
of
and
of
(in the relative topology of
) such that we have
for all
and
. (In other words, the graph of
is the interior of the graph of
.) Write
or
when
(
is not much better than
), and
(
is about as good as
) when both
and
.
Theorem 4 (existence of a utility function). There is a such that for all
,
Unless for all
and
, there are
such that
.
Proof. Let be a worst and
a best outcome, i.e. let
be such that
for all
. If
, then
for all
, and by repeated applications of independence we get
for all
, and therefore
again for all
, and we can simply choose
.
Thus, suppose that . In this case, let
be such that for every
,
equals the unique
provided by Lemma 2 applied to
and
. Because of Lemma 1,
. Let
.
We first show that implies
. For every
, we either have
, in which case by Lemma 2 we have
for arbitrarily small
, or we have
, in which case we set
and find
. Set
. Now, by independence applied
times, we have
; analogously, we obtain
for arbitrarily small
. Thus, using
and Lemma 1,
and therefore
as claimed. Now note that if
, then this continues to hold for
and
in a sufficiently small neighbourhood of
and
, and therefore we have
.
Now suppose that . Since we have
and
, we can find points
and
arbitrarily close to
and
such that the inequality becomes strict (either the left-hand side is smaller than one and we can increase it, or the right-hand side is greater than zero and we can decrease it, or else the inequality is already strict). Then,
by the preceding paragraph. But this implies that
, which completes the proof.
Corollary 5. is a preference relation (i.e., a total preorder) that satisfies independence and the von Neumann-Morgenstern continuity axiom.
Proof. It is well-known (and straightforward to check) that this follows from the assertion of the theorem.
Corollary 6. is unique up to affine transformations.
Proof. Since is a VNM utility function for
, this follows from the analogous result for that case.
Corollary 7. Unless for all
, for all
the set
has lower dimension than
(i.e., it is the intersection of
with a lower-dimensional subspace of
).
Proof. First, note that the assumption implies that . Let
be given by
,
, and note that
is the intersection of the hyperplane
with the closed positive orthant
. By the theorem,
is not parallel to
, so the hyperplane
is not parallel to
. It follows that
has dimension
, and therefore
can have at most this dimension. (It can have smaller dimension or be the empty set if
only touches or lies entirely outside the positive orthant.)
Smoking lesion as a counterexample to CDT
I stumbled upon this paper by Andy Egan and thought that its main result should be shared. We have the Newcomb problem as counterexample to CDT, but that can be dismissed as being speculative or science-fictiony. In this paper, Andy Egan constructs a smoking lesion counterexample to CDT, and makes the fascinating claim that one can construct counterexamples to CDT by starting from any counterexample to EDT and modifying it systematically.
The "smoking lesion" counterexample to EDT goes like this:
- There is a rare gene (G) that both causes people to smoke (S) and causes cancer (C). Susan mildly prefers to smoke than not to - should she do so?
EDT implies that she should not smoke (since the likely outcome in a world where she doesn't smoke is better than the likely outcome in a world where she does). CDT correctly allows her to smoke: she shouldn't care about the information revealed by her preferences.
But we can modify this problem to become a counterexample to CDT, as follows:
- There is a rare gene (G) that both causes people to smoke (S) and makes smokers vulnerable to cancer (C). Susan mildly prefers to smoke than not to - should she do so?
Here EDT correctly tells her not to smoke. CDT refuses to use her possible decision as evidence that she has the gene and tells her to smoke. But this makes her very likely to get cancer, as she is very likely to have the gene given that she smokes.
The idea behind this new example is that EDT runs into paradoxes whenever there is a common cause (G) of both some action (S) and some undesirable consequence (C). We then take that problem and modify it so that there is a common cause G of both some action (S) and of a causal relationship between that action and the undesirable consequence (S→C). This is then often a paradox of CDT.
It isn't perfect match - for instance if the gene G were common, then CDT would say not to smoke in the modified smoker's lesion. But it still seems that most EDT paradoxes can be adapted to become paradoxes of CDT.
Cake, or death!
Here we'll look at the famous cake or death problem teasered in the Value loading/learning post.
Imagine you have an agent that is uncertain about its values and designed to "learn" proper values. A formula for this process is that the agent must pick an action a equal to:
- argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)
Let's decompose this a little, shall we? A is the set of actions, so argmax of a in A simply means that we are looking for an action a that maximises the rest of the expression. W is the set of all possible worlds, and e is the evidence that the agent has seen before. Hence p(w|e,a) is the probability of existing in a particular world, given that the agent has seen evidence e and will do action a. This is summed over each possible world in W.
And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (normalised) utility functions the agent is considering. In value loading, we don't program the agent with the correct utility function from the beginning; instead we imbue it with some sort of learning algorithm (generally with feedback) so that it can deduce for itself the correct utility function. The expression p(C(u)|w) expresses the probability that the utility u is correct in the world w. For instance, it might cover statements "it's 99% certain that 'murder is bad' is the correct morality, given that I live in a world where every programmer I ask tells me that murder is bad".
The C term is the correctness of the utility function, given whatever system of value learning we're using (note that some moral realists would insist that we don't need a C, that p(u|w) makes sense directly, that we can deduce ought from is). All the subtlety of the value learning is encoded in the various p(C(u)|w): this determines how the agent learns moral values.
So the whole formula can be described as:
- For each possible world and each possible utility function, figure out the utility of that world. Weigh that by the probability that that utility is correct is that world, and by the probability of that world. Then choose the action that maximises the weighted sum of this across all utility functions and worlds.
Naive cake or death

Omega lies
Just developing my second idea at the end of my last post. It seems to me that in the Newcomb problem and in the counterfactual mugging, the completely trustworthy Omega lies to a greater or lesser extent.
This is immediately obvious in scenarios where Omega simulates you in order to predict your reaction. In the Newcomb problem, the simulated you is told "I have already made my decision...", which is not true at that point, and in the counterfactual mugging, whenever the coin comes up heads, the simulated you is told "the coin came up tails". And the arguments only go through because these lies are accepted by the simulated you as being true.
If Omega doesn't simulate you, but uses other methods to gauge your reactions, he isn't lying to you per se. But he is estimating your reaction in the hypothetical situation where you were fed untrue information that you believed to be true. And that you believed to be true, specifically because the source is Omega, and Omega is trustworthy.
Doesn't really change much to the arguments here, but it's a thought worth bearing in mind.
Naive TDT, Bayes nets, and counterfactual mugging
I set out to understand precisely why naive TDT (possibly) fails the counterfactual mugging problem. While doing this I ended up drawing a lot of Bayes nets, and seemed to gain some insight; I'll pass these on, in the hopes that they'll be useful. All errors are, of course, my own.
The grand old man of decision theory: the Newcomb problem
First let's look at the problem that inspired all this research: the Newcomb problem. In this problem, a supremely-insightful-and-entirely-honest superbeing called Omega presents two boxes to you, and tells you that you can either choose box A only ("1-box"), or take box A and box B ("2-box"). Box B will always contain $1K (one thousand dollars). Omega has predicted what your decision will be, though, and if you decided to 1-box, he's put $1M (one million dollars) in box A; otherwise he's put nothing in it. The problem can be cast as a Bayes net with the following nodes:

Decision theory and "winning"
With much help from crazy88, I'm still developing my Decision Theory FAQ. Here's the current section on Decision Theory and "Winning". I feel pretty uncertain about it, so I'm posting it here for feedback. (In the FAQ, CDT and EDT and TDT and Newcomblike problems have already been explained.)
One of the primary motivations for developing TDT is a sense that both CDT and EDT fail to reason in a desirable manner in some decision scenarios. However, despite acknowledging that CDT agents end up worse off in Newcomb's Problem, many (and perhaps the majority of) decision theorists are proponents of CDT. On the face of it, this may seem to suggest that these decision theorists aren't interested in developing a decision algorithm that "wins" but rather have some other aim in mind. If so then this might lead us to question the value of developing one-boxing decision algorithms.
However, the claim that most decision theorists don’t care about finding an algorithm that “wins” mischaracterizes their position. After all, proponents of CDT tend to take the challenge posed by the fact that CDT agents “lose” in Newcomb's problem seriously (in the philosophical literature, it's often referred to as the Why ain'cha rich? problem). A common reaction to this challenge is neatly summarized in Joyce (1999, p. 153-154 ) as a response to a hypothetical question about why, if two-boxing is rational, the CDT agent does not end up as rich as an agent that one-boxes:
Rachel has a perfectly good answer to the "Why ain't you rich?" question. "I am not rich," she will say, "because I am not the kind of person [Omega] thinks will refuse the money. I'm just not like you, Irene [the one-boxer]. Given that I know that I am the type who takes the money, and given that [Omega] knows that I am this type, it was reasonable of me to think that the $1,000,000 was not in [the box]. The $1,000 was the most I was going to get no matter what I did. So the only reasonable thing for me to do was to take it."
Irene may want to press the point here by asking, "But don't you wish you were like me, Rachel?"... Rachael can and should admit that she does wish she were more like Irene... At this point, Irene will exclaim, "You've admitted it! It wasn't so smart to take the money after all." Unfortunately for Irene, her conclusion does not follow from Rachel's premise. Rachel will patiently explain that wishing to be a [one-boxer] in a Newcomb problem is not inconsistent with thinking that one should take the $1,000 whatever type one is. When Rachel wishes she was Irene's type she is wishing for Irene's options, not sanctioning her choice... While a person who knows she will face (has faced) a Newcomb problem might wish that she were (had been) the type that [Omega] labels a [one-boxer], this wish does not provide a reason for being a [one-boxer]. It might provide a reason to try (before [the boxes are filled]) to change her type if she thinks this might affect [Omega's] prediction, but it gives her no reason for doing anything other than taking the money once she comes to believes that she will be unable to influence what [Omega] does.
In other words, this response distinguishes between the winning decision and the winning type of agent and claims that two-boxing is the winning decision in Newcomb’s problem (even if one-boxers are the winning type of agent). Consequently, insofar as decision theory is about determining which decision is rational, on this account CDT reasons correctly in Newcomb’s problem.
For those that find this response perplexing, an analogy could be drawn to the chewing gum problem. In this scenario, there is near unanimous agreement that the rational decision is to chew gum. However, statistically, non-chewers will be better off than chewers. As such, the non-chewer could ask, “if you’re so smart, why aren’t you healthy?”. In this case, the above response seems particularly appropriate. The chewers are less healthy not because of their decision but rather because they’re more likely to have an undesirable gene. Having good genes doesn’t make the non-chewer more rational but simply more lucky. The proponent of CDT simply extends this response to Newcomb’s problem.
One final point about this response is worth nothing. A proponent of CDT can accept the above argument but still acknowledge that, if given the choice before the boxes are filled, they would be rational to choose to modify themselves to be a one-boxing type of agent (as Joyce acknowledged in the above passage and as argued for in Burgess, 2004). To the proponent of CDT, this is unproblematic: if we are sometimes rewarded not for the rationality of our decisions in the moment but for the type of agent we were at some past moment then it should be unsurprising that changing to a different type of agent might be beneficial.
The response to this defense of two-boxing in Newcomb’s problem has been divided. Many find it compelling but others, like Ahmed and Price (2012) think it does not adequately address to the challenge:
It is no use the causalist's whining that foreseeably, Newcomb problems do in fact reward irrationality, or rather CDT-irrationality. The point of the argument is that if everyone knows that the CDT-irrational strategy will in fact do better on average than the CDT-rational strategy, then it's rational to play the CDT-irrational strategy.
Given this, there seem to be two positions one could take on these issues. If the response given by the proponent of CDT is compelling, then we should be attempting to develop a decision theory that two-boxes on Newcomb’s problem. Perhaps the best theory for this role is CDT but perhaps it is instead BT, which many people think reasons better in the psychopath button scenario. On the other hand, if the response given by the proponents of CDT is not compelling, then we should be developing a theory that one-boxes in Newcomb’s problem. In this case, TDT, or something like it, seems like the most promising theory currently on offer.
No Anthropic Evidence
Closely related to: How Many LHC Failures Is Too Many?
Consider the following thought experiment. At the start, an "original" coin is tossed, but not shown. If it was "tails", a gun is loaded, otherwise it's not. After that, you are offered a big number of rounds of decision, where in each one you can either quit the game, or toss a coin of your own. If your coin falls "tails", the gun gets triggered, and depending on how the original coin fell (whether the gun was loaded), you either get shot or not (if the gun doesn't fire, i.e. if the original coin was "heads", you are free to go). If your coin is "heads", you are all right for the round. If you quit the game, you will get shot at the exit with probability 75% independently of what was happening during the game (and of the original coin). The question is, should you keep playing or quit if you observe, say, 1000 "heads" in a row?
Intuitively, it seems as if 1000 "heads" is "anthropic evidence" for the original coin being "tails", that the long sequence of "heads" can only be explained by the fact that "tails" would have killed you. If you know that the original coin was "tails", then to keep playing is to face the certainty of eventually tossing "tails" and getting shot, which is worse than quitting, with only 75% chance of death. Thus, it seems preferable to quit.
On the other hand, each "heads" you observe doesn't distinguish the hypothetical where the original coin was "heads" from one where it was "tails". The first round can be modeled by a 4-element finite probability space consisting of options {HH, HT, TH, TT}, where HH and HT correspond to the original coin being "heads" and HH and TH to the coin-for-the-round being "heads". Observing "heads" is the event {HH, TH} that has the same 50% posterior probabilities for "heads" and "tails" of the original coin. Thus, each round that ends in "heads" doesn't change the knowledge about the original coin, even if there were 1000 rounds of this type. And since you only get shot if the original coin was "tails", you only get to 50% probability of dying as the game continues, which is better than the 75% from quitting the game.
(See also the comments by simon2 and Benja Fallenstein on the LHC post, and this thought experiment by Benja Fallenstein.)
The result of this exercise could be generalized by saying that counterfactual possibility of dying doesn't in itself influence the conclusions that can be drawn from observations that happened within the hypotheticals where one didn't die. Only if the possibility of dying influences the probability of observations that did take place, would it be possible to detect that possibility. For example, if in the above exercise, a loaded gun would cause the coin to become biased in a known way, only then would it be possible to detect the state of the gun (1000 "heads" would imply either that the gun is likely loaded, or that it's likely not).
Circular Preferences Don't Lead To Getting Money Pumped
Edit: for reasons given in the comments, I don't think the question of what circular preferences actually do is well defined, so this an answer to a wrong question.
If I like Y more than X, at an exchange rate of 0.9Y for 1X, and I like Z more than Y, at an exchange rate of 0.9Z for 1Y, and I like X more than Z, at an exchange rate of 0.9X for 1Z, you might think that given 1X and the ability to trade X for Y at an exchange rate of 0.95Y for 1X, and Y for Z at an exchange rate of 0.95Z for 1Y, and Z for X at an exchange rate of 0.95X for 1Z, I would trade in a circle until I had nothing left.
But actually, if I knew that I had circular preferences, and I knew that if I had 0.95Y I would trade it for (0.95^2)Z, which I would trade for (0.95^3)X, then actually I'd be trading 1X for (0.95^3)X, which I'm obviously not going to do.
Similarly, if the exchange rates are all 1:1, but each trade costs 1 penny, and I care about 1 penny much much less than any of 1X, 1Y, or 1Z, and I trade my X for Y, I know I'm actually going to end up with X - 3 cents, so I won't make the trade.
Unless I can set a Schelling fence, in which case I will end up trading once.
So if instead of being given X, I have a 1/3 chance of each of X, Y, and Z, I would hope I wouldn't set a Schelling fence, because then my 1/3 chance of each thing becomes a 1/3 chance of each thing minus the trading penalty. So maybe I'd want to be bad at precommitments, or would I precommit not to precommit?
Counterfactual Reprogramming Decision Theory
Surprisingly, Ben Goertzel's (2010) Counterfactual Reprogramming Decision Theory (CRDT) has not been discussed even once on Less Wrong, so I present this discussion post as an opportunity to do so.
Here is Goertzel's abstract:
A novel variant of decision theory is presented. The basic idea is that one should ask, at each point in time: What would I do if the reprogrammable parts of my brain were reprogrammed by a superintelligent Master Programmer with the goal of supplying me with a program that would maximize my utility averaged over possible worlds? Problems such as the Prisoner's Dilemma, the value of voting, Newcomb's Problem and the Psychopath Button are reviewed from this perspective and shown to be addressed in a satisfactory way.
His first footnote acknowledges some debt to Less Wrong and to Wei Dai in particular:
Some interesting, albeit often confusing, discussion on CDT and hypothetical replacement decision theories may be found online at the Less Wrong blog... The decision algorithm presented by Dai on that blog page bears some resemblance to CRDT, but due to the rough and very informal exposition there, I'm not sure what is the precise relationship.
He also discusses Vladimir Nesov's counterfactual mugging scenario, and attempts to work toward a formalization of CRDT by making use of AIXI and some other stuff.
Thoughts?
Decision Theories, Part 3.75: Hang On, I Think This Works After All
Followup to: Decision Theories, Part 3.5: Halt, Melt and Catch Fire, Decision Theories: A Semi-Formal Analysis, Part III
The thing about dead mathematical proofs is that it's practically always worth looting the corpse; sometimes you even find there's still a pulse! That appears to be the case with my recent post lamenting the demise of the TDT-like "Masquerade" algorithm. I think I've got a rigorous proof this time, but I'd like other opinions before I declare that the rumors of Masquerade's demise were greatly exaggerated...
To recap quickly, I've been trying to construct an algorithm that, given the payoff matrix of a game and the source code of its opponent, does some deductions and then outputs a move. I want this algorithm to do the commonsense right things (defect against both DefectBot and CooperateBot, and mutually cooperate against both FairBot and itself), and I want it to do so for simple and general reasons (that is, no gerrymandering of actions against particular opponents, and in particular no fair trying to "recognize itself", since there can be variants of any algorithm that are functionally identical but not provably so within either's powers of deduction). I'd also like it to be "un-exploitable" in a certain sense: it has a default move (which is one of its Nash equilibria), and no opponent can profit against the algorithm by forcing it below that default payoff. If the opponent does as well or better in expected value than it would by playing that Nash equilibrium, then so too does my algorithm.
The revised Masquerade algorithm does indeed have these properties.
In essence, there are two emendations that I needed: firstly, since some possible pairs of masks (like FairBot and AntiFairBot) can't knowingly settle on a fixed point, there's no way to determine what they do without a deductive capacity that strictly exceeds the weaker of them. That's a bad feature to have, so we'll just have to exclude potentially troublemaking masks from Masquerade's analysis. (In the special case of Prisoner's Dilemma I know that including DefectBot and FairBot will suffice; I've got what looks like a good solution in general, as well.)
The second emendation is that FairBot needs to alternate between trying harder to prove its opponent cooperates, and trying harder to prove its opponent defects. (There needs to be an asymmetry, like cooperation proofs going first, to guarantee that when FairBot plays against itself, it finds a Löbian proof of mutual cooperation rather than one of mutual defection.) The reason for this is so that when agents reason about masks, they should be able to find a proof of the mask's action without needing to exceed that mask's powers of deduction. Otherwise we get that arms race again.
This escalation of proof attempts can be represented in terms of proof limits (since there exists a constant C such that for N sufficiently large, a proof that "there are no proofs of X of length less than N" either exists with length less than C^N or not at all), but the simplest way to do this is with the formalism of PA+N. That is, PA is Peano Arithmetic; PA+1 is the formal system with the axioms of Peano Arithmetic and an extra axiom that PA is self-consistent (that is, if PA proves X, then PA does not prove not-X); PA+2 has those axioms and an extra one stating that PA+1 is self-consistent and so on. (Note that none of these formal systems know themselves to be self-consistent, and for good reason!) In every use, we'll assume that N is a fixed number (anything greater than 4 will work).
New And Improved Masks
So without further ado, let's define our masks for the Prisoner's Dilemma:
def DefectBot(X):
return D
def FairBot(X):
for i in range(N):
if "X(FairBot)=C" is provable in PA+i:
return C
if "X(FairBot)=D" is provable in PA+i:
return D
return D
Lemma 1: For any X, "DefectBot(X)=D" is provable in PA. (Obvious.)
Lemma 2: "FairBot(FairBot)=C" is provable in PA.
Proof: If we label the FairBots separately (after all, they might be logically equivalent but not know they are), we can quickly show that "if PA proves FairBot1(FairBot2)=C, then FairBot2(FairBot1)=C" is provable in PA (and the same with the subscripts reversed). This makes what we might term a Löbian cycle, which will come up a few more times in this post: if we have a formal system S, and "if S proves A, then B" and "if S proves B, then A" are theorems of S, then S indeed proves both A and B. (Proof reserved for a comment; hat tip to Eliezer and co.)
Lemma 3: "FairBot(DefectBot)=D" is provable in PA+1.
Proof: Note that this does not work in PA, even though PA can prove "DefectBot(FairBot)=D". Why not? Because PA can't prove that PA doesn't find a proof of "DefectBot(FairBot)=C" before it can look for a proof of defection! However, PA+1 can handle this just fine, because it proves "if PA proves DB(FB)=D, then PA does not prove DB(FB)=C".
Note the weird fact that it takes a stronger formal system to prove a result that's much more intuitive from outside the system!
New and Improved Masquerade
Let's split up the long definition of Masquerade with a helper function.
def Search(X,Y):
for i in range(N):
for j,k in [C,D]:
if "Y(X)=j and X(Y)=k" is provable in PA+i:
return (j,k)
return False
This function, given two agents, simply searches (using increasingly strong formal systems) for a proof of what they do against each other; once it finds one, it halts and returns that result, and if it doesn't find one in time, it announces its failure. We will incorporate this into the definition of Masquerade. (The other function call here is to the payoff matrix, to retrieve the utility U(j,k) of a certain pair of moves.)
def Masquerade(X):
utility = -∞
my_move = null
their_move = null
for Y in [DefectBot, FairBot]:
if Search(X,Y) != False:
(j,k) = Search(X,Y)
if U(j,k) > utility:
my_move = j
their_move = k
utility = U(j,k)
if utility > U(D,D):
for i in range(N):
if "X(Masquerade) = their_move" is provable in PA+i:
return my_move
return D
A few things to note: first, this agent actually halts and outputs an action against any opponent... but for it to provably do so in a system below PA+N, Search(X,Y) has to find proofs quickly. This is the reason that my prior attempt didn't work- it had to wait at one point for the old FairBot to run out of time/power before concluding what it did, and that made it impossible for the old FairBot to know what the old Masquerade did. But with the new and improved agents, we get to ground in a fixed number of steps.
For brevity, I'll label DefectBot, FairBot, and Masquerade as DB, FB, and M, respectively.
Lemma 4: "Search(DB,DB)=(D,D)" is provable in PA+1. (Follows from Lemma 1; note that it needs to use PA+1 in order to rule out finding proofs of other action-pairs.)
Lemma 5: "Search(FB,DB)=Search(DB,FB)=(D,D)" is provable in PA+2. (Follows from Lemma 3.)
Lemma 6: "Search(FB,FB)=(C,C)" is provable in PA. (Follows from Lemma 2; since (C,C) is the first one tried, we don't even need to go up to PA+1.)
Lemma 7: "Masquerade(DB)=D" is provable in PA+3.
Proof: Lemmas 4 and 5, plus the fact that PA+3 knows the consistency of PA+2. There is no sanity-check step, since utility=U(D,D) here.
Lemma 8: "Masquerade(FB)=C" and "FB(Masquerade)=C" are provable in PA+3.
Proof: Lemmas 5 and 6, and the consistency of PA+2, imply that when Masquerade arrives at the sanity-check stage, it has the variables set as utility=U(C,C), my_move=C and their_move=C. Thus PA+3 can prove that "if 'FB(M)=C' is provable in PA+3, then M(FB)=C". And of course, "if 'M(FB)=C' is provable in PA+3, then FB(M)=C" is provable in PA+3, since again PA+3 can prove that PA through PA+2 won't have found proofs of contrary conclusions before it gets around to trying to find cooperation in PA+3. Therefore we have the desired Löbian cycle!
Theorem: "Masquerade(Masquerade)=C" is provable in PA+4.
Proof: Lemmas 7 and 8, and the consistency of PA+3, allow PA+4 to prove that when each Masquerade arrives at the sanity-check stage, it has set utility=U(C,C), my_move=C and their_move=C. Thus we achieve the Löbian cycle, and find proofs of mutual cooperation!
Awesome! So, what next?
Well, assuming that I haven't made a mistake in one of my proofs, I'm going to run the same proof for my generalization: given a payoff matrix in general, Masquerade enumerates all of the constant strategies and all of the "mutually beneficial deals" of the FairBot form (that is, masks that hold out the "stick" of a particular Nash equilibrium and the "carrot" of another spot on the payoff matrix which is superior to the "stick" for both players). Then it alternates (at escalating PA+n levels) between trying to prove the various good deals that the opponent could agree to. There are interesting complexities here (and an idea of what bargaining problems might involve).
Secondly, I want to see if there's a good way of stating the general problem that Masquerade solves, something better than "it agrees with commonsense decision theory". The analogy here (and I know it's a fatuous one, but bear with me) is that I've come up with a Universal Turing Machine but not yet the Church-Turing Thesis. And that's unsatisfying.
But before anything else... I want to be really sure that I haven't made a critical error somewhere, especially given my false start (and false halt) in the past. So if you spot a lacuna, let me know!
A model of UDT with a concrete prior over logical statements
I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?
Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.
Decision Theories, Part 3.5: Halt, Melt and Catch Fire
Followup to: Decision Theories: A Semi-Formal Analysis, Part III
UPDATE: As it turns out, rumors of Masquerade's demise seem to have been greatly exaggerated. See this post for details and proofs!
I had the chance, over the summer, to discuss the decision theory outlined in my April post with a bunch of relevantly awesome people. The sad part is, there turned out to be a fatal flaw once we tried to formalize it properly. I'm laying it out here, not with much hope that there's a fix, but because sometimes false starts can be productive for others.
Since it's not appropriate to call this decision theory TDT, I'm going to use a name suggested in one of these sessions and call it "Masquerade", which might be an intuition pump for how it operates. So let's first define some simple agents called "masks", and then define the "Masquerade" agent.
Say that our agent has actions a1, ... , an, and the agent it's facing in this round has actions b1, ... , bm. Then for any triple (bi, aj, ak), we can define a simple agent Maskijk which takes in its opponent's source code and outputs an action:
def Mask_ijk(opp_src):
look for proof that Opp(Mask_ijk) = bi
if one is found, then output aj
otherwise, output ak
(This is slightly less general than what I outlined in my post, but it'll do for our purposes. Note that there's no need for aj and ak to be distinct, so constant strategies fall under this umbrella as well.)
A key example of such an agent is what we might call FairBot: on a Prisoner's Dilemma, FairBot tries to prove that the other agent cooperates against FairBot, and if it finds such a proof, then it immediately cooperates. If FairBot fails to find such a proof, then it defects. (An important point is that if FairBot plays against itself and both have sufficiently strong deductive capacities, then a short proof of one's cooperation gives a slightly longer proof of the other's cooperation, and thus in the right circumstances we have mutual cooperation via Löb's Theorem.)
The agent Masquerade tries to do better than any individual mask (note that FairBot foolishly cooperates against CooperateBot when it could trivially do better by defecting). My original formulation can be qualitatively described as trying on different masks, seeing which one fares the best, and then running a "sanity check" to see if the other agent treats Masquerade the same way it treats that mask. The pseudocode looked like this:
def Masquerade(opp_src):
for each (i,j,k), look for proofs of the form "Mask_ijk gets utility u against Opp"
choose (i,j,k) corresponding to the largest such u found
look for proof that Opp(Masquerade) = Opp(Mask_ijk)
if one is found, then output the same thing as Mask_ijk(Opp)
otherwise, output a default action
(The default should be something safe like a Nash equilibrium strategy, of course.)
Intuitively, when Masquerade plays the Prisoner's Dilemma against FairBot, Masquerade finds that the best utility against FairBot is achieved by some mask that cooperates, and then Masquerade's sanity-check is trying to prove that FairBot(Masquerade) = C as FairBot is trying to prove that Masquerade(FairBot) = C, and the whole Löbian circus goes round again. Furthermore, it's intuitive that when Masquerade plays against another Masquerade, the first one notices the proof of the above, and finds that the best utility against the other Masquerade is achieved by FairBot; thus both pass to the sanity-check stage trying to imitate FairBot, both seek to prove that the other cooperate against themselves, and both find the Löbian proof.
So what's wrong with this intuitive reasoning?
Problem: A deductive system can't count on its own consistency!
Let's re-examine the argument that Masquerade cooperates with FairBot. In order to set up the Löbian circle, FairBot needs to be able to prove that Masquerade selects a mask that cooperates with FairBot (like CooperateBot or FairBot). There are nice proofs that each of those masks attains the mutual-cooperation payoff against FairBot, but we also need to be sure that some other mask won't get the very highest (I defect, you cooperate) payoff against FairBot. Now you and I can see that this must be true, because FairBot simply can't be exploited that way. But crucially, FairBot can't deduce its own inexploitability without thereby becoming exploitable (for the same Gödelian reason that a formal system can't prove its own consistency unless it is actually inconsistent)!
Now, the caveats to this are important: if FairBot's deductive process is sufficiently stronger than the deductive process that's trying to exploit it (for example, FairBot might have an oracle that can answer questions about Masquerade's oracle, or FairBot might look for proofs up to length 2N while Masquerade only looks up to length N), then it can prove (by exhaustion if nothing else) that Masquerade will select a cooperative mask after all. But since Masquerade needs to reason about Masquerade at this level, this approach goes nowhere. (At first, I thought that having a weaker oracle for Masquerade's search through masks, and a stronger oracle both for each mask and for Masquerade's sanity-check, would solve this. But that doesn't get off the ground: the agent thus defined attains mutual cooperation with FairBot, but not with itself, because the weaker oracle can't prove that it attains mutual cooperation with FairBot.)
Another caveat is the following: FairBot may not be able to rule out the provability of some statement we know is false, but (given a large enough deductive capacity) it can prove that a certain result is the first of its kind in a given ordering of proofs. So if our agents act immediately on the first proof they find, then we could make a version of Masquerade work... as long as each search does find a proof, and as long as that fact is provable by the same deduction system. But there's an issue with this: two masks paired against each other won't necessarily have provable outcomes!
Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it defects if it finds one; if it doesn't find such a proof, then it cooperates. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do! Thus, Masquerade can't be sure that AntiFairBot won't get the highest payoff against FairBot (which of course it won't) unless it uses a stronger deduction system for the search through masks than FairBot uses for its proof search (which would mean that FairBot won't be able to tell what mask Masquerade picks).
I tried to fix this by iterating over only some of the masks; after all, there's no realistic opponent against whom AntiFairBot is superior to both FairBot and DefectBot. Unfortunately, at this point I realized two things: in order to play successfully against a reasonable range of opponents on the Prisoner's Dilemma, Masquerade needs to be able to imitate at least both FairBot and DefectBot; and FairBot cannot prove that FairBot defects against DefectBot. (There are variants of FairBot that can do so, e.g. it could search both for proofs of cooperation and proofs of defection and playing symmetrically if it finds one, but this variant is no longer guaranteed to cooperate against itself!)
If there are any problems with this reasoning, or an obvious fix that I've missed, please bring it to my attention; but otherwise, I've decided that my approach has failed drastically enough that it's time to do what Eliezer calls "halt, melt, and catch fire". The fact that Löbian cooperation works is enough to keep me optimistic about formalizing this side of decision theory in general, but the ideas I was using seem insufficient to succeed. (Some variant of "playing chicken with my deductive system" might be a crucial component.)
Many thanks to all of the excellent people who gave their time and attention to this idea, both on and offline, especially Eliezer, Vladimir Slepnev, Nisan, Paul Christiano, Critch, Alex Altair, Misha Barasz, and Vladimir Nesov. Special kudos to Vladimir Slepnev, whose gut intuition on the problem with this idea was immediate and correct.
Formalising cousin_it's bounded versions of Gödel's theorem
In this post, I'll try and put cousin_it's bounded version of Gödel's theorem on a rigorous footing. This will be logic-heavy, not programming heavy. The key unproved assumptions are listed at the bottom of the post.
Notes on terminology:
- T is our deductive system.
- ⊥ is a canonical contradiction in T. ¬A is defined to be A→⊥.
- T ⊢ A means that T proves the proposition A.
- T ⊢j A means that T proves A in a proof requiring no more than j symbols.
- □ A is the statement "there exists a number that is an encoding in Gödel numbering of a valid proof in T of A".
- □n A is the statement "there exists a number that is an encoding in Gödel numbering of a valid proof in A with no more than n symbols in the proof".
The first needed fact is that there is a computable function g such that if T ⊢n A, then T ⊢g(n) □n A. In informal language, this means that if T can prove A in j symbols, it can prove that it can do so, in g(j) symbols. This g will be the critical piece of the infrastructure.
Now, these arguments work with general definitions of proofs, but I like to put exact bounds where I can and move uncertainty to a single place. So I will put some restrictions on the language we use, and what counts as a proof. First, I will assume the symbols "(", ")", "j", and "=" are part of the language - ( and ) as parentheses, and j as a free variable.
I will require that proofs start and end with parenthesis. A proof need not state what statement it is proving! As long as there is a computable process that can check that "p is a proof of A", p need not mention A. This means that I can make the following assumptions:
- If p is a proof of A→B and q a proof of A, then (pq) is a proof of B (modus ponens).
- If p is a proof of A→B and q a proof of ¬B, then (pq) is a proof of ¬A (modus tollens).
- A proof that A↔B will be treated as a valid proof of A→B and B→A also.
- Well-formed sentences with free variables can also have proofs (as in "j=0 or j>0" and "j>3 → j2>9").
- If A and B have free variable j, and p is a proof that A→B, then (p(j=n)) is a proof that A(n)→B(n).
Bounded versions of Gödel's and Löb's theorems
While writing up decision theory results, I had to work out bounded versions of Gödel's second incompleteness theorem and Löb's theorem. Posting the tentative proofs here to let people find any major errors before adding formality. Any comments are welcome!
Bounded Gödel's theorem
Informal meaning: if a theory T has a short proof that any proof of T's inconsistency must be long, then T is inconsistent.
Formal statement: there exists a total computable function f such that for any effectively generated theory T that includes Peano arithmetic, and for any integer n, if T proves in n symbols that "T can't prove a falsehood in f(n) symbols", then T is inconsistent.
Proof
If the theory T takes more than n symbols to describe, the condition of the theorem can't hold, so let's assume T is smaller than that.
Let's take g(n)>>n and f(n)>>exp(g(n)). For any integer n, let's define a program R that enumerates all proofs in the theory T up to length g(n), looking for either a proof that R prints something (in which case R immediately halts without printing anything), or a proof that R doesn't print anything (in which case R prints something and halts). If no proof is found, R halts without printing anything.
Assume that the theory T is consistent. Note that if the program R finds some proof, then it makes the proved statement false. Also note that the theory T can completely simulate the execution of the program R in exp(g(n)) symbols. Therefore if the program R finds any proof, the resulting contradiction in the theory T can be "exposed" using exp(g(n)) symbols. Since f(n)>>exp(g(n)), and the theory T proves in n symbols that "T can't prove a falsehood in f(n) symbols", we see that T proves in slightly more than n symbols that the program R won't find any proofs. Therefore the theory T proves in slightly more than n symbols that the program R won't print anything. Since g(n)>>n, the program R will find that proof and print something, making the theory T inconsistent. QED.
(The main idea of the proof is taken from Ron Maimon, originally due to Rosser, and adapted to the bounded case.)
Bounded Löb's theorem
Informal meaning: if having a bounded-size proof of statement S would imply the truth of S, and that fact has a short proof, then S is provable.
Formal statement: there exists a total computable function f such that for any effectively generated theory T that includes Peano arithmetic, any statement S in that theory, and any integer n, if T proves in n symbols that "if T proves S in f(n) symbols, then S is true", then T proves S.
Proof
Taking the contrapositive, the theory T proves in n symbols that "if the statement S is false, then T doesn't prove S in f(n) symbols". Therefore the theory T+¬S proves in n symbols that "T+¬S doesn't prove a falsehood in f(n) symbols" (I'm glossing over the tiny changes to n and f(n) here). Taking the function f from the bounded Gödel's theorem, we obtain that the theory T+¬S is inconsistent. That means T proves S. QED.
(The main idea of the proof is taken from Scott Aaronson, originally due to Kripke or Kreisel.)
As a sanity check, the regular versions of the theorems seem to be easy corollaries of the bounded versions. But of course there could still be errors...
The Creating Bob the Jerk problem. Is it a real problem in decision theory?
I was recently reading about the Transparent Newcomb with your Existence at Stake problem, which, to make a long story short, states that you were created by Prometheus, who foresaw that you would one-box on Newcomb's problem and wouldn't have created you if he had foreseen otherwise. The implication is that you might need to one-box just to exist. It's a disturbing problem, and as I read it another even more disturbing problem started to form in my head. However, I'm not sure it's logically coherent (I'm really hoping it's not) and wanted to know what the rest of you thought. The problem goes:
One day you start thinking about a hypothetical nonexistant person named Bob who is a real jerk. If he existed he would make your life utterly miserable. However, if he existed he would want to make a deal with you. If he ever found himself existing in a universe where you have never existed he would create you, on the condition that if you found yourself existing in a universe where he had never existed you would create him. Hypothetical Bob is very good at predicting the behavior of other people, not quite Omega quality, but pretty darn good. Assume for the sake of the argument that you like your life and enjoy existing.
At first you dismiss the problem because of technical difficulties. Science hasn't advanced to the point where we can make people with such precision. Plus, there is a near-infinite number of far nicer hypothetical people who would make the same deal, when science reaches that point you should give creating them priority.
But then you see Omega drive by in its pickup truck. A large complicated machine falls off the back of the truck as it passes you by. Written on it, in Omega's handwriting, is a note that says "This is the machine that will create Bob the Jerk, a hypothetical person that [insert your name here] has been thinking about recently, if one presses the big red button on the side." You know Omega never lies, not even in notes to itself.
Do Timeless Decision Theory and Updateless Decision Theory say you have a counterfactual obligation to create Bob the Jerk, the same way you have an obligation to pay Omega in the Counterfactual Mugging, and the same way you might (I'm still not sure about this) have an obligation to one-box when dealing with Prometheus? Does this in turn mean that when we develop the ability to create people from scratch we should tile the universe with people who would make the counterfactual deal? Obviously it's that last implication that disturbs me.
I can think of multiple reasons why it might not be rational to create Bob the Jerk:
- It might not be logically coherent to not update to acknowledge the fact of your own existence, even in UDT (this also implies one should two-box when dealing with Prometheus).
- An essential part of who you are is the fact that you were created by your parents, not by Bob the Jerk, so the counterfactual deal isn't logically coherent. Someone he creates wouldn't be you, it would be someone else. At his very best he could create someone with a very similar personality who has falsified memories, which would be rather horrifying.
- An essential part of who Bob the Jerk is is that he was created by you, with some help from Omega. He can't exist in a universe where you don't, so the hypothetical bargain he offered you isn't logically coherent.
- Prometheus will exist no matter what you do in his problem, Bob the Jerk won't. This makes these two problems qualitatively different in some way I don't quite understand.
- You have a moral duty to not inflict Bob the Jerk on others, even if it means you don't exist in some other possibility.
- You have a moral duty to not overpopulate the world, even if it means you might not exist in some other possibility, and the end result of the logic of this problem implies overpopulating the world.
- Bob the Jerk already exists because we live in a Big World, so you have no need to fulfill your part of the bargain because he's already out there somewhere.
- Making these sorts of counterfactual deals is individually rational, but collectively harmful in the same way that paying a ransom is. If you create Bob the Jerk some civic-minded vigilante decision theorist might see the implications and find some way to punish you.
- While it is possible to want to keep on existing if you already exist, it isn't logically possible to "want to exist" if you don't already, this defeats the problem in some way.
- After some thought you spend some time thinking about a hypothetical individual called Bizarro-Bob. Bizarro-Bob doesn't want Bob the Jerk to be created and is just as good at modeling your behavior as Bob the Jerk is. He has vowed that if he ends up existing in a universe where you'll end up creating Bob the Jerk he'll kill you. As you stand by Omega's machine you start looking around anxiously for the glint of light off a gun barrel.
- I don't understand UDT or TDT properly, they don't imply I should create Bob the Jerk for some other reason I haven't thought of because of my lack of understanding.
Are any of these objections valid, or am I just grasping at straws? I find the problem extremely disturbing because of its wider implications, so I'd appreciate it if someone with a better grasp of UDT and TDT analyzed it. I'd very much like to be refuted.
A novel approach to axiomatic decision theory
In the standard approach to axiomatic Bayesian decision theory, an agent (a decision maker) doesn't prefer Act #1 to Act #2 because the expected utility of Act #1 exceeds that of Act #2. Instead, the agent states its preferences over a set of risky acts, and if these stated preferences are consistent with a certain set of axioms (e.g. the VNM axioms, or the Savage axioms), it can be proven that the agent's decisions can be described as if the agent were assigning probabilities and utilities to outcomes and then maximizing expected utility. (Let's call this the ex post approach.)
Peterson (2004) introduces a different approach, which he calls the ex ante approach. In many ways, this approach is more intuitive. The agent assigns probabilities and utilities directly to outcomes (not acts), and these assignments are used to generate preferences over acts. Using this approach, Peterson claims to have shown that the principle of expected utility maximization can be derived from just four axioms.
As Peterson (2009:75,77) explains:
The aim of the axiomatization [in the ex ante approach] is to show that the utility of an act equals the expected utility of its outcomes.
...The axioms... entail that the utility of an act equals the expected utility of its outcomes. Or, put in slightly different words, the act that has the highest utility (is most attractive) will also have the highest expected utility, and vice versa. This appears to be a strong reason for letting the expected utility principle guide one's choices in decision under risk.
Jensen (2012:428) calls the ex ante approach "controversial," but I can't find any actual published rebuttals to Peterson (2004), so maybe Jensen just means that Peterson's result is "new and not yet percolated to the broad community."
Peterson (2008) explores the ex ante approach in more detail, under the unfortunate title of "non-Bayesian decision theory." (No, Peterson doesn't reject Bayesianism.) Cozic (2011) is a review of Peterson (2008) that may offer the quickest entry point into the subject of ex ante axiomatic decision theory.
Peterson (2009:210) illustrates the controversy nicely:
...even if [this] discussion may appear a bit theoretical... the controversy over [ex post and ex ante approaches] is likely to have important practical implications. For example, a forty-year-old woman seeking advice about whether to, say, divorce her husband, is likely to get very different answers from the [two approaches]. The [ex post approach] will advise the woman to first figure out what her preferences are over a very large set of risky acts, including the one she is thinking about performing, and then just make sure that all preferences are consistent with certain structural requirements. Then, as long as none of the structural requirements is violated, the woman is free to do whatever she likes, no matter what her beliefs and desires actually are... The [ex ante approach] will [instead] advise the woman to first assign numerical utilities and probabilities to her desires and beliefs, and then aggregate them into a decision by applying the principle of maximizing expected utility.
I'm not a decision theory expert, so I'd be very curious to hear what LW's decision theorists think of the axiomatization in Peterson (2004) — whether it works, and how significant it is.
Consequentialist Formal Systems
This post describes a different (less agent-centric) way of looking at UDT-like decision theories that resolves some aspects of the long-standing technical problem of spurious moral arguments. It's only a half-baked idea, so there are currently a lot of loose ends.
On spurious arguments
UDT agents are usually considered as having a disinterested inference system (a "mathematical intuition module" in UDT and first order proof search in ADT) that plays a purely epistemic role, and preference-dependent decision rules that look for statements that characterize possible actions in terms of the utility value that the agent optimizes.
The statements (supplied by the inference system) used by agent's decision rules (to pick one of the many variants) have the form [(A=A1 => U=U1) and U<=U1]. Here, A is a symbol defined to be the actual action chosen by the agent, U is a similar symbol defined to be the actual value of world's utility, and A1 and U1 are some particular possible action and possible utility value. If the agent finds that this statement is provable, it performs action A1, thereby making A1 the actual action.
The use of this statement introduces the problem of spurious arguments: if A1 is a bad action, but for some reason it's still chosen, then [(A=A1 => U=U1) and U<=U1] is true, since utility value U will in that case be in fact U1, which justifies (by the decision rule) choosing the bad action A1. In usual cases, this problem results in the difficulty of proving that an agent will behave in the expected manner (i.e. won't choose a bad action), which is resolved by adding various compilicated clauses to its decision algorithm. But even worse, it turns out that if an agent is hapless enough to take seriously a (formally correct) proof of such a statement supplied by an enemy (or if its own inference system is malicious), it can be persuaded to take any action at all, irrespective of agent's own preferences.
No independence of irrelevant alternatives (picture proof)
Back in the old days, when people were wise and the government was just, I did a post on the Nash bargaining solution for two player games. Here each player has their own utility function and they're choosing amongst joint options, and trying to bargain to find the best one. What was nice about this solution is that it is independent of irrelevant alternatives (IIA): once you've found the best solution, you can erase any other option, and it remains the best.
In order to do that, the Nash bargaining solution makes use of a "disagreement point", a special point that provides a zero to both utilities. This seems - and is - ugly. Can we preserve IIA without this clunky disagreement point?
By the title of the this post, you may have guessed that we can't. Specifically, assume the outcome is symmetric across both players (i.e. permuting the two utility functions preserves the outcome choice), the outcome is Pareto-optimal (any change will reduce the utility of at least one player) and there is no outside canonical choices for the utility functions (no special scales, no zeroes, no disagreement points). Then IIA must fail. It fails under weaker conditions as well, but the above lead to an easy picture-proof. And picture proofs are nice.
Adaptive Logics: the best of both worlds?
Consider the following premises from which we would like to reason:
- Abraham Lincoln was born in Michigan.
- Abraham Lincoln was born in Kentucky.
- Eliezer Yudkowsky has a beard or he has blond hair.
- Eliezer Yudkowsky doesn't have blond hair.
We'd like to be able to conclude that Eliezer Yudkowsky has a beard, but we have a problem doing that. Classical logic can certainly derive that result, but because (1) and (2) contradict each other, by the principle of explosion, classical logic can derive any result - including the fact that Eliezer doesn't have a beard. If we use paraconsistent logics then we're safe from explosion. But because we've been rejecting the disjunctive syllogism in order to avoid explosion, we can't actually put (3) and (4) together in order to get the result we want.
Informally, what we'd want to do is something like "use classical logic wherever there's no contradictions, but restrict to paraconsistent logic whenever there are". Adaptive logics attempt to implement this idea. The presentation in this post will be even more informal than usual, as the ideas of adaptive logics are more useful than the specific implementation I've seen.
Between sky and earth, between classical and relevance
Adaptive logics make use of two logical systems: an upper limit logic ULL (generally take to be classical logic) and a weaker lower limit logic LLL (we will be using a relevance logic for this). Adaptive logics will then start with a collection of premises (that may include contradictions), and reason from there.
The basic rules are that proofs using the LLL are always valid (since it is a weaker logic, they are also valid for the ULL). Proofs using the ULL are valid if they did not use a contradiction in their reasoning. In the example above, we would freely apply classical logic when using premises (3) and (4), but would restrict to relevance logic when using premises (1) and (2).
This might be enough for UDT; we would label the antecedent A()==a in the (A()==a → U()==u) statement as (potentially) contradictory, and only allow relevance logic to use it as a premise, while hitting everything else that moves with classical logic.
If you can't be right, at least be relevant
We've done the motivation, and the warm-ups, and it's now time to get to the meat of this arc: relevance logics. These try to recapture a more intuitive meaning to the material implication, "if... then..." and "→". Consider for instance the following:
- "It rains in Spain" → ("Darwin died for our sins" → "It rains in Spain")
- ("It rains in Spain" and "It doesn't rain in Spain") → "Ravens are the stupidest of birds"
Both of these are perfectly acceptable classical logical theorems. The problem is that there doesn't seem to be any connection between the different components: no logical leap between rain, Darwin or ravens. These are all tautologies, so we could substitute any other propositions in here and it would still be true. In common speech, "if A then B" implies some sort of relevant connection between A and B. Relevance logics attempt to do the same for the formal "→".
Unfortunately, it is easier to motivate relevance logics, and write formal axiomatic systems for them, than it is to pin down exactly what "relevance" means - see the later section on relevance semantics. One oft-mentioned requirement is that the expressions on both sides of "→" share propositional variables. A propositional variable is a basic sentence such as A:"It rains in Spain". Sharing A would mean that the whole expression would be something like:
(blah ∧ blah ∨ A ∧ blah) → (blah ∧ ¬A ∨ blah ∧ blah)
Unfortunately, though this is a necessary condition for relevance, it is not sufficient. Even the hated disjunctive syllogism (the very thing we are trying to avoid) shares variables. In theorem form, it is:
(A ∨ B) ∧ ¬B → A
So let's defer worrying about what relevance means, and look at what relevance is.
View more: Next
= 783df68a0f980790206b9ea87794c5b6)
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)