## [Clearing out my Drafts folder] Rationality and Decision Theory Curriculum Idea

*Note: The following is a draft post I've had since 2009, and it's not great but it's worth posting for discussion. I do like the way that it prefigures some of the problems of Quirrell Points when traitors are allowed...*

Need to see if this can be easily gamed, but...

Step 1. Introduce Prisoner's Dilemma. Set up computer system so that they can log in and play it in partners with investments of points (caution them: this is their actual grade at stake). Let them know that they currently don't have enough points for a passing grade on that part of the course, but that maximum investment and mutual cooperation will result in A's for (almost) everyone on it (with high probability); also that points are converted to grades on a logarithmic scale. Let them know that creating institutions and alliances is a good strategy in such games.

Initially, each student is allowed to play once per day, with 1 partner. Students log in, enter the name of their requested PD partner, enter how much they're willing to invest, and enter C or D. They'll get a "bank statement" daily as well.

If both enter C, they each get 1.2 points back (per initial point invested). If one enters C and the other D, the cooperator gets nothing back and the defector gets 2 points (per point invested). If both enter D, then each gets 0.5 points back.

Once they've had some practice with this, we move to

Step 2: Bigger investments, luck, observation.

Introduce larger group investments with higher rates of return. E.g. a five-person opportunity that pays each C player 0.2 guaranteed plus 0.4 for each C (not counting themselves), and each D player gets 0.5 guaranteed plus 0.5 for each C. (Set these up to be balanced in some way.)

Add a factor of luck, so that people can't just (be forced to) show one another their bank statements as proof of their cooperation. People should average the proper amounts, but have enough variation that it's often difficult to tell whether they cooperated or defected. (Or is there another way to handle this?)

Finally, allow a third option of observation within a deal. One idea: if you choose to observe, then you take up a spot in the investment, keep your own point, and don't count as a C for others, but you get to see some or all of the other players' choices. Make sure that information is proportionately expensive.

Step 3: Keep scaling it up, adjust (only) as necessary.

If things get too unbalanced, some progressive taxation and welfare might be in order; but disrupting the system too much will tend to destroy the things they need to learn.

Ideally, larger and larger opportunities should be offered, with the kicker being a gigantic one-shot Prisoner's Dilemma that the whole class can partake in. C might get guaranteed 0.2 plus 0.2 for every other C, while D gets guaranteed 1 plus 0.3 for every C.

Implementation:

Need a non-hackable computer program to work off of. Also, have students (for real points) give critiques and suggestions for improvement.

## An Introduction to Löb's Theorem in MIRI Research

Would you like to see a primer on several MIRI research topics (assuming only the background of having taken a course with proofs in math or computer science)? Or are you curious why MIRI does so much with mathematical logic, and why people on Less Wrong keep referring to Löb's Theorem?

If you answered yes to either question, you may be interested in my lecture notes, An Introduction to Löb's Theorem in MIRI Research! These came out of an introductory talk that I gave at a MIRIx workshop.

Since I've got some space here, I'll just copy and paste the table of contents and the introduction section...

## New forum for MIRI research: Intelligent Agent Foundations Forum

Today, the Machine Intelligence Research Institute is launching a new forum for research discussion: the Intelligent Agent Foundations Forum! It's already been seeded with a bunch of new work on MIRI topics from the last few months.

We've covered most of the (what, why, how) subjects on the forum's new welcome post and the How to Contribute page, but this post is an easy place to comment if you have further questions (or if, maths forbid, there are technical issues *with* the forum instead of *on* it).

But before that, go ahead and check it out!

(Major thanks to Benja Fallenstein, Alice Monday, and Elliott Jin for their work on the forum code, and to all the contributors so far!)

**EDIT 3/22:** Jessica Taylor, Benja Fallenstein, and I wrote forum digest posts summarizing and linking to recent work (on the IAFF and elsewhere) on reflective oracle machines, on corrigibility, utility indifference, and related control ideas, and on updateless decision theory and the logic of provability, respectively! These are pretty excellent resources for reading up on those topics, in my biased opinion.

## Robust Cooperation in the Prisoner's Dilemma

I'm proud to announce the preprint of Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic, a joint paper with Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire (me), and Eliezer Yudkowsky.

This paper was one of three projects to come out of the 2nd MIRI Workshop on Probability and Reflection in April 2013, and had its genesis in ideas about formalizations of decision theory that have appeared on LessWrong. (At the end of this post, I'll include links for further reading.)

Below, I'll briefly outline the problem we considered, the results we proved, and the (many) open questions that remain. Thanks in advance for your thoughts and suggestions!

## Background: Writing programs to play the PD with source code swap

(If you're not familiar with the Prisoner's Dilemma, see here.)

The paper concerns the following setup, which has come up in academic research on game theory: say that you have the chance to write a computer program **X**, which takes in one input and returns either *Cooperate* or *Defect*. This program will face off against some other computer program **Y**, but with a twist: **X** will receive the source code of **Y** as input, and **Y** will receive the source code of **X** as input. And you will be given your program's winnings, so you should think carefully about what sort of program you'd write!

Of course, you could simply write a program that defects regardless of its input; we call this program **DefectBot**, and call the program that cooperates on all inputs **CooperateBot**. But with the wealth of information afforded by the setup, you might wonder if there's some program that might be able to achieve mutual cooperation in situations where **DefectBot** achieves mutual defection, without thereby risking a sucker's payoff. (Douglas Hofstadter would call this a perfect opportunity for superrationality...)

## Previously known: CliqueBot and FairBot

And indeed, there's a way to do this that's been known since at least the 1980s. You can write a computer program that knows its own source code, compares it to the input, and returns *C* if and only if the two are identical (and *D* otherwise). Thus it achieves mutual cooperation in one important case where it intuitively ought to: when playing against itself! We call this program **CliqueBot**, since it cooperates only with the "clique" of agents identical to itself.

There's one particularly irksome issue with **CliqueBot**, and that's the fragility of its cooperation. If two people write functionally analogous but syntactically different versions of it, those programs will defect against one another! This problem can be patched somewhat, but not fully fixed. Moreover, mutual cooperation might be the best strategy against some agents that are not even functionally identical, and extending this approach requires you to explicitly delineate the list of programs that you're willing to cooperate with. Is there a more flexible and robust kind of program you could write instead?

As it turns out, there is: in a 2010 post on LessWrong, cousin_it introduced an algorithm that we now call **FairBot**. Given the source code of **Y**, **FairBot** searches for a proof (of less than some large fixed length) that **Y** returns *C* when given the source code of **FairBot**, and then returns *C* if and only if it discovers such a proof (otherwise it returns *D*). Clearly, if our proof system is consistent, **FairBot** only cooperates when that cooperation will be mutual. But the really fascinating thing is what happens when you play two versions of **FairBot** against each other. Intuitively, it seems that *either* mutual cooperation or mutual defection would be stable outcomes, but it turns out that if their limits on proof lengths are sufficiently high, they will achieve mutual cooperation!

The proof that they mutually cooperate follows from a bounded version of Löb's Theorem from mathematical logic. (If you're not familiar with this result, you might enjoy Eliezer's Cartoon Guide to Löb's Theorem, which is a correct formal proof written in much more intuitive notation.) Essentially, the asymmetry comes from the fact that both programs are searching for the same outcome, so that a short proof that one of them cooperates leads to a short proof that the other cooperates, and vice versa. (The opposite is not true, because the formal system can't know it won't find a contradiction. This is a subtle but essential feature of mathematical logic!)

## Generalization: Modal Agents

Unfortunately, **FairBot** isn't what I'd consider an ideal program to write: it happily cooperates with **CooperateBot**, when it could do better by defecting. This is problematic because in real life, the world isn't separated into agents and non-agents, and any natural phenomenon that doesn't predict your actions can be thought of as a **CooperateBot** (or a **DefectBot**). You don't want your agent to be making concessions to rocks that happened not to fall on them. (There's an important caveat: some things have utility functions that you care about, but don't have sufficient ability to predicate their actions on yours. In that case, though, it wouldn't be a true Prisoner's Dilemma if your values actually prefer the outcome (*C*,*C*) to (*D*,*C*).)

However, **FairBot** belongs to a promising class of algorithms: those that decide on their action by looking for short proofs of logical statements that concern their opponent's actions. In fact, there's a really convenient mathematical structure that's analogous to the class of such algorithms: the modal logic of provability (known as GL, for Gödel-Löb).

So that's the subject of this preprint: **what can we achieve in decision theory by considering agents defined by formulas of provability logic?**

## Compromise: Send Meta Discussions to the Unofficial LessWrong Subreddit

After a recent comment thread degenerated into an argument about trolling, moderation, and meta discussions, I came to the following conclusions:

- Meta conversations are annoying to stumble across, I'd rather not see them unless I think it's important, and I think other people mostly feel the same way. Moreover, moderators can't easily ignore those conversations when they encounter them, because they're usually attacks on the moderators themselves; and people can't simply avoid encountering them on a regular basis without avoiding LW altogether. This is a perfect recipe for a flamewar taking over Top Comments even when most people don't care that much.
- Officially banning all meta conversations, however, is a bad precedent, and I don't want LW to do that.

Ideally, Less Wrong would implement a separate "META" area (so that people can read the regular area for all the object-level discussions, and then sally into the meta area only when they're ready). After talking to Luke (who also wants this), though, it seems clear that nobody is able to implement it very soon. So as a stopgap measure, I'm personally going to start doing the following, and I hope you join me:

**Whenever a conversation starts getting bitterly meta in a thread that's not originally about a LW site meta issue, I'm going to tell people to start a thread on the LW Uncensored Reddit Thread instead. Then I'm going to downvote anyone who continues the meta war on the original thread.**

I know it's annoying to send people somewhere that has a different login system, but it's as far as I can tell the best fix we currently have. Since some meta conversations are important, I'm not going to punish people for linking to meta thread discussions that they think are significant, and the relevant place for those links is usually the Open Thread. I don't want LessWrong to be a community devoted to arguing about the mechanics of LessWrong, so that's my suggestion.

Thoughts? (And yes, this thread is obviously open to meta discussion. I'm hopefully *doing something constructive* about the problem, instead of just complaining about it, though.)

**EDIT:** Changed the link to the uncensored thread more specifically, at Luke's request; originally I linked to the general LW subreddit, which is more heavily moderated.

## Welcome to Less Wrong! (5th thread, March 2013)

## Robin Hanson's Cryonics Hour

I'm writing to recommend something awesome to anyone who's recently signed up for cryonics (and to the future self of anyone who's about to do so). Robin Hanson has a longstanding offer that anyone who's newly signed up for cryonics can have an hour's discussion with him on any topic, and I took him up on that last week.

I expected to have a fascinating and wide-ranging discussion on various facets of futurism. **My expectations were exceeded.** Even if you've been reading Overcoming Bias for a long time, talking with Robin is an order of magnitude more stimulating/persuasive/informative than reading OB or even watching him debate someone else, and I'm now reconsidering my thinking on a number of topics as a result.

So if you've recently signed up, email Robin; and if you're intending to sign up, let this be one more incentive to quit procrastinating!

**Relevant links:**

The LessWrong Wiki article on cryonics is a good place to start if you have a bunch of questions about the topic.

If you want to argue about whether signing up for cryonics is a good idea, two good and relatively recent threads on that subject are under the posts on A survey of anti-cryonics writing and More Cryonics Probability Estimates.

And if you are cryocrastinating (you've decided that you should sign up for cryonics, but you haven't yet), here's a LW thread about taking the first step.

## Does My Vote Matter?

*(Cross-posted from elsewhere; I thought some readers here might like it too. Please excuse the unusually informal style.)*

**Short answer:** Actually, yes!

**Slightly longer answer:** Yes, if the election is close. You'll never get to *know* that your vote was decisive, but one vote can substantially change the odds on Election Day nonetheless. Even if the election is a foregone conclusion (or if you don't care about the major candidates), the same reasoning applies to third parties- there are thresholds that really matter to them, and if they reach those now they have a significantly better chance in the next election. And finally, local elections matter in the long run just as state or nation elections do. So, in most cases, voting is rational if you care about the outcome.

**Full answer:** Welcome! This is a nonpartisan analysis, written by a math PhD, of when and how a single vote matters in a large election. I've got a table of contents below; please skip to whatever section interests you most. And feel free to share this!

**Sections:**

**1. How can a single vote matter in a huge election?**

**2. What if I know it's not going to be close?**

**3. Do local elections matter?**

In what follows, I'm going to be assuming an American-style voting system (first-past-the-post, for you voting-system buffs), but most of what I say carries over to other voting systems found around the world.

**1. How can a single vote matter in a huge election?**

To answer this, let's imagine a different voting system. In the land of Erewhon, voters cast their ballots for president just as they do here; but instead of decreeing that the candidate with the most votes is the winner, each vote is turned into a lottery ball, and one is chosen at random to determine the next president.

While this system has its drawbacks (they get fringe candidates elected every so often, and they've had to outlaw write-in campaigns to prevent every voter from simply voting themselves in), the citizens of Erewhon agree on its main advantage: every one of them knows that their vote *counts*, that they increase the chance of their candidate winning by just that much.

**I'm going to argue that if you would bother to vote if an election were held in Erewhon, then you should also vote—for the same reason—if the same election were held the normal way, and if it looked like the outcome might be close.** That is, your effect on the outcome is about equivalent if the pre-election polls fall within the margin of error (about ±3% for each candidate, or a margin of less than 6% between two candidates, for an electorate of millions). And if the polls are nearly tied, then voting in our system might have the same effect as voting hundreds or even thousands of times in Erewhon's system!

This seems counterintuitive, because we imagine that the votes of everyone else are "locked in" somehow, and that we're only deciding whether to add ours to the pile- in which case, the only way that it could matter is in the event that it makes or breaks an exact tie. And not only are those exceedingly rare (the largest such example I could find was a recent Massachusetts election for state representative, in which 13,500 ballots were cast), but if the initial count for a massive election did show a margin of one or zero, we would be headed for an interminable recount. (See, for instance, the 2008 Minnesota Senate election, which was decided by about 300 votes; the extensive recount delayed the winner's inauguration by about six months.) What this messes with, though, isn't your chance of helping decide the election, but your chance of *knowing* that you helped decide the election.

That is, in modern elections, there's not a perfect sharp boundary between "Candidate A wins" and "Candidate B wins", but a gigantic muddle; and if A is narrowly ahead, then every additional vote for A represents a greater chance that the recount will eventually end, or end earlier, or not be contested at all; while every vote for candidate B means a greater chance that there will be a recount, or that it will go on longer, or that it might turn out victorious for B after all. You'd never know that *your* vote, which changed the lead from 412 to 413, was the straw that broke the proverbial camel's back and led the other candidate to concede, but at some point that's what happens.

And even more significantly, we *can't* consider everyone else's votes as "locked in". If you had the ability to re-play Election Day over and over, the chaotic dynamics of everyday life would affect the vote totals. Someone makes a light about to turn red, and so she ends up at the next light behind a bumper sticker that infuriates her, and so she remembers to vote. Or someone else bumps into an old friend, and starts a conversation, and then he realizes he doesn't have enough of his lunch break left to get to the polling place anyhow. It's the butterfly effect in action, and when multiplied by millions of people it leads to fluctuations in the hundreds or thousands. (The exact mechanism for this estimate is the Central Limit Theorem.) And what that means is that the margin isn't a fixed number pending your decision to vote; it's a mix of different possibilities, and your decision changes the odds as surely as adding a lottery ball to the Erewhon election does.

Of course, if one candidate is polling 10 points ahead of the other, then those fluctuations will be irrelevant. If the margin is closer, then it just might have an effect; remember that polls have several sources of error in them, so you can't be exactly sure of the margin before voting actually happens. And in those cases where the actual vote turns out to be extremely close, like Minnesota in 2008, your choice to vote swings those odds just as if you'd poured in thousands of Erewhonian lottery balls with your candidate's name on them. (That is, it changes those odds by a fraction of a percent, but the ability to nudge the odds by that much in an electorate of millions is pretty impressive!)

**2. What if I know it's not going to be close?**

If the difference between the top candidates is outside the combined margin of error (7 points or more in a big election), then it's true that your vote won't affect the outcome of the *current* election, but it can matter greatly for the *next* one. This especially holds when there's a third party you prefer to the current major parties, but there are other relevant reasons to vote even if that's not the case.

First, about third parties: the conventional wisdom that they can't win is an outright falsehood. We tend to forget that Ross Perot nearly became President: he held a six-point lead in June 1992 over both Bush and Clinton, and despite his candidacy crashing and burning later on (for reasons more personal than systemic), he still won nearly 20% of the popular vote in the end. In addition, there are two current Senators who were elected as independents.

For a third party that's not polling well enough yet, votes *now* matter for *future* elections, and there are several significant thresholds. If they get rounded up to 1% on Election Day, then they get mentioned in election coverage. (Next, getting rounded up to 2% sounds much more impressive than getting rounded down to 1%.) If they get to 5%, then they can qualify for FEC matching funds, and double their ability to reach voters next time. And 10% is a significant number for media exposure (as well as invitations to the main debates). Higher than 20%, and we're talking about a viable candidate; there's a runaway dynamic in three-candidate races where as soon as the third party looks viable, any voters from the other parties who prefer the third party will suddenly switch (now that they're no longer worried about supporting an obvious loser), and suddenly the third party becomes the front-runner. (This kind of behavior is common in game theory in what's known as a coordination problem.)

Even if there's not a third party you prefer to the main ones, the margin of victory now helps determine what sort of candidates the parties select next time. If one party wins an election by more than 10% (this round number has a psychological hold on people, so it's what usually counts as a "landslide victory"), then next time that party is more likely to nominate a less moderate candidate, while the defeated party is likely to nominate a more moderate candidate. (Also, it goes without saying that you should be voting in primaries as well!)

All of these thresholds can hinge just as easily on a few hundred votes, and so your vote can matter greatly. You just need to care about what happens two, or four, or six years from now.

**3. Do local elections matter?**

Most certainly. First, your vote affects the odds more strongly in a local election. (And, of course, there *can* be margins of zero or one votes, without recounts, in local elections!) Secondly, there are a number of important issues decided at the local level, both directly (tax and bond issues) and through representatives (especially local school boards). And finally, many big names in politics start out in local elections: the current President began as an Illinois state senator, and the current Vice President began as a city councilman.

It's more difficult to do research at the local level. Project VoteSmart may have data on some of the candidates (including the voting record of incumbents, public statements, and response to a questionnaire about their positions). Otherwise, your local newspaper may be the best bet (short of going to campaign events yourself). But a little bit of research can go a long way here.

*( Addendum: If I'd been composing this for the Less Wrong crowd, I'd have also noted that the decisions of people similar to you should be correlated, which adds another multiplier to the effectiveness of voting. I might like I bumper sticker that says "I'm a timeless decision theorist, therefore I vote!")*

*( Second addendum: I'm mindful of the dangers of posting this within a few days of a major election. I've done my very best to keep from mindkilling language; I hope you do the same in the comments.)*

## Decision Theories, Part 3.75: Hang On, I Think This Works After All

**Followup to:** Decision Theories, Part 3.5: Halt, Melt and Catch Fire, Decision Theories: A Semi-Formal Analysis, Part III

The thing about dead mathematical proofs is that it's practically always worth looting the corpse; sometimes you even find there's still a pulse! That appears to be the case with my recent post lamenting the demise of the TDT-like "Masquerade" algorithm. I think I've got a rigorous proof this time, but I'd like other opinions before I declare that the rumors of Masquerade's demise were greatly exaggerated...

To recap quickly, I've been trying to construct an algorithm that, given the payoff matrix of a game and the source code of its opponent, does some deductions and then outputs a move. I want this algorithm to do the commonsense right things (defect against both DefectBot and CooperateBot, and mutually cooperate against both FairBot and itself), and I want it to do so for simple and general reasons (that is, no gerrymandering of actions against particular opponents, and in particular no fair trying to "recognize itself", since there can be variants of any algorithm that are functionally identical but not provably so within either's powers of deduction). I'd also like it to be "un-exploitable" in a certain sense: it has a default move (which is one of its Nash equilibria), and no opponent can *profit* against the algorithm by forcing it below that default payoff. If the opponent does as well or better in expected value than it would by playing that Nash equilibrium, then so too does my algorithm.

The revised Masquerade algorithm does indeed have these properties.

In essence, there are two emendations that I needed: firstly, since some possible pairs of masks (like FairBot and AntiFairBot) can't knowingly settle on a fixed point, there's no way to determine what they do without a deductive capacity that strictly exceeds the weaker of them. That's a bad feature to have, so we'll just have to exclude potentially troublemaking masks from Masquerade's analysis. (In the special case of Prisoner's Dilemma I know that including DefectBot and FairBot will suffice; I've got what looks like a good solution in general, as well.)

The second emendation is that FairBot needs to alternate between trying harder to prove its opponent cooperates, and trying harder to prove its opponent defects. (There needs to be an asymmetry, like cooperation proofs going first, to guarantee that when FairBot plays against itself, it finds a Löbian proof of mutual cooperation rather than one of mutual defection.) The reason for this is so that when agents reason about masks, they should be able to find a proof of the mask's action *without* needing to exceed that mask's powers of deduction. Otherwise we get that arms race again.

This escalation of proof attempts can be represented in terms of proof limits (since there exists a constant C such that for N sufficiently large, a proof that "there are no proofs of X of length less than N" either exists with length less than C^N or not at all), but the simplest way to do this is with the formalism of PA+N. That is, PA is Peano Arithmetic; PA+1 is the formal system with the axioms of Peano Arithmetic and an extra axiom that PA is self-consistent (that is, if PA proves X, then PA does not prove not-X); PA+2 has those axioms and an extra one stating that PA+1 is self-consistent and so on. (Note that none of these formal systems know themselves to be self-consistent, and for good reason!) In every use, we'll assume that N is a fixed number (anything greater than 4 will work).

### New And Improved Masks

So without further ado, let's define our masks for the Prisoner's Dilemma:

def DefectBot(X):

return D

def FairBot(X):

for i in range(N):

if "X(FairBot)=C" is provable in PA+i:

return C

if "X(FairBot)=D" is provable in PA+i:

return D

return D

**Lemma 1:** For any X, "DefectBot(X)=D" is provable in PA. (Obvious.)**Lemma 2:** "FairBot(FairBot)=C" is provable in PA.

*Proof:* If we label the FairBots separately (after all, they might be logically equivalent but not know they are), we can quickly show that "if PA proves FairBot1(FairBot2)=C, then FairBot2(FairBot1)=C" is provable in PA (and the same with the subscripts reversed). This makes what we might term a Löbian cycle, which will come up a few more times in this post: if we have a formal system S, and "if S proves A, then B" and "if S proves B, then A" are theorems of S, then S indeed proves both A and B. (Proof reserved for a comment; hat tip to Eliezer and co.)

**Lemma 3:** "FairBot(DefectBot)=D" is provable in PA+1.

*Proof:* Note that this does *not* work in PA, even though PA can prove "DefectBot(FairBot)=D". Why not? Because PA can't prove that PA doesn't find a proof of "DefectBot(FairBot)=C" before it can look for a proof of defection! However, PA+1 can handle this just fine, because it proves "if PA proves DB(FB)=D, then PA does not prove DB(FB)=C".

Note the weird fact that it takes a stronger formal system to prove a result that's much more intuitive from outside the system!

### New and Improved Masquerade

Let's split up the long definition of Masquerade with a helper function.

def Search(X,Y):

for i in range(N):

for j,k in [C,D]:

if "Y(X)=j and X(Y)=k" is provable in PA+i:

return (j,k)

return False

This function, given two agents, simply searches (using increasingly strong formal systems) for a proof of what they do against each other; once it finds one, it halts and returns that result, and if it doesn't find one in time, it announces its failure. We will incorporate this into the definition of Masquerade. (The other function call here is to the payoff matrix, to retrieve the utility U(j,k) of a certain pair of moves.)

def Masquerade(X):

utility = -∞

my_move = null

their_move = null

for Y in [DefectBot, FairBot]:

if Search(X,Y) != False:

(j,k) = Search(X,Y)

if U(j,k) > utility:

my_move = j

their_move = k

utility = U(j,k)

if utility > U(D,D):

for i in range(N):

if "X(Masquerade) = their_move" is provable in PA+i:

return my_move

return D

A few things to note: first, this agent actually halts and outputs an action against any opponent... but for it to provably do so in a system below PA+N, Search(X,Y) has to find proofs quickly. This is the reason that my prior attempt didn't work- it had to wait at one point for the old FairBot to run out of time/power before concluding what it did, and that made it impossible for the old FairBot to know what the old Masquerade did. But with the new and improved agents, we get to ground in a fixed number of steps.

For brevity, I'll label DefectBot, FairBot, and Masquerade as DB, FB, and M, respectively.

**Lemma 4:** "Search(DB,DB)=(D,D)" is provable in PA+1. (Follows from Lemma 1; note that it needs to use PA+1 in order to rule out finding proofs of other action-pairs.)**Lemma 5:** "Search(FB,DB)=Search(DB,FB)=(D,D)" is provable in PA+2. (Follows from Lemma 3.)**Lemma 6:** "Search(FB,FB)=(C,C)" is provable in PA. (Follows from Lemma 2; since (C,C) is the first one tried, we don't even need to go up to PA+1.)**Lemma 7:** "Masquerade(DB)=D" is provable in PA+3.

*Proof:* Lemmas 4 and 5, plus the fact that PA+3 knows the consistency of PA+2. There is no sanity-check step, since utility=U(D,D) here.**Lemma 8:** "Masquerade(FB)=C" and "FB(Masquerade)=C" are provable in PA+3.

*Proof:* Lemmas 5 and 6, and the consistency of PA+2, imply that when Masquerade arrives at the sanity-check stage, it has the variables set as utility=U(C,C), my_move=C and their_move=C. Thus PA+3 can prove that "if 'FB(M)=C' is provable in PA+3, then M(FB)=C". And of course, "if 'M(FB)=C' is provable in PA+3, then FB(M)=C" is provable in PA+3, since again PA+3 can prove that PA through PA+2 won't have found proofs of contrary conclusions before it gets around to trying to find cooperation in PA+3. Therefore we have the desired Löbian cycle!**Theorem:** "Masquerade(Masquerade)=C" is provable in PA+4.

*Proof:* Lemmas 7 and 8, and the consistency of PA+3, allow PA+4 to prove that when each Masquerade arrives at the sanity-check stage, it has set utility=U(C,C), my_move=C and their_move=C. Thus we achieve the Löbian cycle, and find proofs of mutual cooperation!

### Awesome! So, what next?

Well, assuming that I haven't made a mistake in one of my proofs, I'm going to run the same proof for my generalization: given a payoff matrix in general, Masquerade enumerates all of the constant strategies and all of the "mutually beneficial deals" of the FairBot form (that is, masks that hold out the "stick" of a particular Nash equilibrium and the "carrot" of another spot on the payoff matrix which is superior to the "stick" for both players). Then it alternates (at escalating PA+n levels) between trying to prove the various good deals that the opponent could agree to. There are interesting complexities here (and an idea of what bargaining problems might involve).

Secondly, I want to see if there's a good way of stating the general problem that Masquerade solves, something better than "it agrees with commonsense decision theory". The analogy here (and I know it's a fatuous one, but bear with me) is that I've come up with a Universal Turing Machine but not yet the Church-Turing Thesis. And that's unsatisfying.

But before anything else... I want to be *really sure* that I haven't made a critical error somewhere, especially given my false start (and false halt) in the past. So if you spot a lacuna, let me know!

## Decision Theories, Part 3.5: Halt, Melt and Catch Fire

**Followup to:** Decision Theories: A Semi-Formal Analysis, Part III

**UPDATE: As it turns out, rumors of Masquerade's demise seem to have been greatly exaggerated. See this post for details and proofs!**

I had the chance, over the summer, to discuss the decision theory outlined in my April post with a bunch of relevantly awesome people. The sad part is, **there turned out to be a fatal flaw **once we tried to formalize it properly. I'm laying it out here, not with much hope that there's a fix, but because sometimes false starts can be productive for others.

Since it's not appropriate to call this decision theory TDT, I'm going to use a name suggested in one of these sessions and call it "Masquerade", which might be an intuition pump for how it operates. So let's first define some simple agents called "masks", and then define the "Masquerade" agent.

Say that our agent has actions *a _{1}*, ... ,

*a*, and the agent it's facing in this round has actions

_{n}*b*, ... ,

_{1}*b*. Then for any triple (

_{m}*b*,

_{i}*a*,

_{j}*a*), we can define a simple agent Mask

_{k}*which takes in its opponent's source code and outputs an action:*

_{ijk}def Mask_ijk(opp_src):

look for proof that Opp(Mask_ijk) =b_{i}

if one is found, then outputa_{j}

otherwise, outputa_{k}

(This is slightly less general than what I outlined in my post, but it'll do for our purposes. Note that there's no need for *a _{j}* and

*a*to be distinct, so constant strategies fall under this umbrella as well.)

_{k}A key example of such an agent is what we might call FairBot: on a Prisoner's Dilemma, FairBot tries to prove that the other agent cooperates against FairBot, and if it finds such a proof, then it immediately cooperates. If FairBot fails to find such a proof, then it defects. (An important point is that if FairBot plays against itself and both have sufficiently strong deductive capacities, then a short proof of one's cooperation gives a slightly longer proof of the other's cooperation, and thus in the right circumstances we have mutual cooperation via Löb's Theorem.)

The agent Masquerade tries to do better than any individual mask (note that FairBot foolishly cooperates against CooperateBot when it could trivially do better by defecting). My original formulation can be qualitatively described as trying on different masks, seeing which one fares the best, and then running a "sanity check" to see if the other agent treats Masquerade the same way it treats that mask. The pseudocode looked like this:

def Masquerade(opp_src):

for each (i,j,k), look for proofs of the form "Mask_ijk gets utilityuagainst Opp"

choose (i,j,k) corresponding to the largest suchufound

look for proof that Opp(Masquerade) = Opp(Mask_ijk)

if one is found, then output the same thing as Mask_ijk(Opp)

otherwise, output a default action

(The default should be something safe like a Nash equilibrium strategy, of course.)

Intuitively, when Masquerade plays the Prisoner's Dilemma against FairBot, Masquerade finds that the best utility against FairBot is achieved by some mask that cooperates, and then Masquerade's sanity-check is trying to prove that FairBot(Masquerade) = C as FairBot is trying to prove that Masquerade(FairBot) = C, and the whole Löbian circus goes round again. Furthermore, it's intuitive that when Masquerade plays against another Masquerade, the first one notices the proof of the above, and finds that the best utility against the other Masquerade is achieved by FairBot; thus both pass to the sanity-check stage trying to imitate FairBot, both seek to prove that the other cooperate against themselves, and both find the Löbian proof.

So what's wrong with this intuitive reasoning?

### Problem: A deductive system can't count on its own consistency!

Let's re-examine the argument that Masquerade cooperates with FairBot. In order to set up the Löbian circle, FairBot needs to be able to prove that Masquerade selects a mask that cooperates with FairBot (like CooperateBot or FairBot). There are nice proofs that each of those masks attains the mutual-cooperation payoff against FairBot, but we also need to be sure that some other mask won't get the very highest (I defect, you cooperate) payoff against FairBot. Now you and I can see that this must be true, because FairBot simply can't be exploited that way. But crucially, *FairBot can't deduce its own inexploitability* without thereby becoming exploitable (for the same Gödelian reason that a formal system can't prove its own consistency unless it is actually inconsistent)!

Now, the caveats to this are important: if FairBot's deductive process is sufficiently stronger than the deductive process that's trying to exploit it (for example, FairBot might have an oracle that can answer questions about Masquerade's oracle, or FairBot might look for proofs up to length 2^{N} while Masquerade only looks up to length N), then it can prove (by exhaustion if nothing else) that Masquerade will select a cooperative mask after all. But since Masquerade needs to reason about Masquerade at this level, this approach goes nowhere. (At first, I thought that having a weaker oracle for Masquerade's search through masks, and a stronger oracle both for each mask and for Masquerade's sanity-check, would solve this. But that doesn't get off the ground: the agent thus defined attains mutual cooperation with FairBot, but not with itself, because the weaker oracle can't prove that it attains mutual cooperation with FairBot.)

Another caveat is the following: FairBot may not be able to rule out the provability of some statement we know is false, but (given a large enough deductive capacity) it can prove that a certain result is the first of its kind in a given ordering of proofs. So if our agents act immediately on the first proof they find, then we could make a version of Masquerade work... as long as each search *does* find a proof, and as long as *that* fact is provable by the same deduction system. But there's an issue with this: two masks paired against each other won't necessarily have provable outcomes!

Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it *defects* if it finds one; if it doesn't find such a proof, then it *cooperates*. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do! Thus, Masquerade can't be sure that AntiFairBot won't get the highest payoff against FairBot (which of course it won't) unless it uses a stronger deduction system for the search through masks than FairBot uses for its proof search (which would mean that FairBot won't be able to tell what mask Masquerade picks).

I tried to fix this by iterating over only some of the masks; after all, there's no realistic opponent against whom AntiFairBot is superior to both FairBot and DefectBot. Unfortunately, at this point I realized two things: in order to play successfully against a reasonable range of opponents on the Prisoner's Dilemma, **Masquerade needs to be able to imitate at least both FairBot and DefectBot**; and **FairBot cannot prove that FairBot defects against DefectBot**. (There are variants of FairBot that *can* do so, e.g. it could search both for proofs of cooperation and proofs of defection and playing symmetrically if it finds one, but this variant is no longer guaranteed to cooperate against itself!)

If there are any problems with this reasoning, or an obvious fix that I've missed, please bring it to my attention; but otherwise, I've decided that my approach has failed drastically enough that it's time to do what Eliezer calls "halt, melt, and catch fire". The fact that Löbian cooperation works is enough to keep me optimistic about formalizing this side of decision theory in general, but the ideas I was using seem insufficient to succeed. (Some variant of "playing chicken with my deductive system" might be a crucial component.)

Many thanks to all of the excellent people who gave their time and attention to this idea, both on and offline, especially Eliezer, Vladimir Slepnev, Nisan, Paul Christiano, Critch, Alex Altair, Misha Barasz, and Vladimir Nesov. Special kudos to Vladimir Slepnev, whose gut intuition on the problem with this idea was immediate and correct.

View more: Next