The quining-cooperator algorithm in the Prisoner's Dilemma forms a "Nash equilibrium in algorithms" against itself. This is a very desirable property to have, which roughly corresponds to "reflective consistency" in single-player decision theory. However, the algorithm you're trying to develop (which includes maximizing utility against an opponent that plays a strategy unconditionally) will not form a "Nash equilibrium in algorithms" against itself, even in symmetric games.
On reflection: I agree that "Nash equilibrium in algorithms" would be desirable, but you seem to find it a more knock-out argument than I do. If we're discussing what kinds of algorithms to enter into something like Axelrod's tournament, then clearly, Nash equilibrium in algorithms would be a very compelling property. But if world() really is (a utility function over) the whole universe, then it's less clear to me that "take 1" is not the right thing to do when encountering a "take 9" rock.
Intuitively, the reason you wouldn't want to "take 1" in this case is that you would not want someone else -- Clippy, say -- a motive to leave "take 9" rocks around for you to find. But there's the counter-intuition that:
(a) If Clippy does this as a result of reasoning about how you behave, then that's in fact a different situation (the assumption being that you're reasoning about the source of the whole world, not just about the source of the "take 9" rock in front of you). What you do in this situation influences whether Clippy will have left behind a "take 9" rock, so your decision algorithm should not be able to conclude that you maximize utility by "taking 1." [I'm assuming an UDT-like agent which one can think of as taking the source of the world and returning the actions to perform in different contingencies, so it does not make sense to say that we cannot influence what Clippy does because our sensory input tells us that Clippy has "already" left behind a "take 9" rock.]
(b) If Clippy does not do this as a result of actually reasoning about how you behave, then if you "take 5" (say), Clippy will still have left a "take 9" rock behind -- by assumption, there's no logical causality from what you do to what Clippy does.
I can't formalize this argument, and I'm not sure that (b) especially is the right way of thinking about this problem, but this counter-intuition seems at least as compelling to me as the original intuition. Am I missing something?
-- All this said, I no longer think that the idea behind my proposed algorithm is useful. The problem is as far as I can see, if agent 1 runs "do X unless I can prove it's better to do Y" and agent 2 runs "do A unless I can prove that it's better to do B," then the agents won't play (X,B) or (Y,A) even if this would be the most desirable outcome for both of them -- or at least I can't prove that these outcomes are possible.
"Play minimax unless I can prove the other strategy is better" doesn't exactly run into this problem in symmetric games as long as the two available strategies have different minimax values, but what I'm really after is a decision theory that would act correctly if it finds such a game embedded in the world, and this does seem to run into the problem described above.
Hello again.
User Perplexed pointed me to Nash's 1953 paper Two-person cooperative games that seems to give a unique "fair" solution for all two-person competitive games in our setting. I've been thinking how to spin the algorithm in the paper into a flavor of utility maximization, but failing so far.
By requests from Blueberry and jimrandomh, here's an expanded repost of my comment which was itself a repost of my email sent to decision-theory-workshop.
(Wait, I gotta take a breath now.)
A note on credit: I can only claim priority for the specific formalization offered here, which builds on Vladimir Nesov's idea of "ambient control", which builds on Wei Dai's idea of UDT, which builds on Eliezer's idea of TDT. I really, really hope to not offend anyone.
(Whew!)
Imagine a purely deterministic world containing a purely deterministic agent. To make it more precise, agent() is a Python function that returns an integer encoding an action, and world() is a Python function that calls agent() and returns the resulting utility value. The source code of both world() and agent() is accessible to agent(), so there's absolutely no uncertainty involved anywhere. Now we want to write an implementation of agent() that would "force" world() to return as high a value as possible, for a variety of different worlds and without foreknowledge of what world() looks like. So this framing of decision theory makes a subprogram try to "control" the output of a bigger program it's embedded in.
For example, here's Newcomb's Problem:
A possible algorithm for agent() may go as follows. Look for machine-checkable mathematical proofs, up to a specified max length, of theorems of the form "agent()==A implies world()==U" for varying values of A and U. Then, after searching for some time, take the biggest found value of U and return the corresponding A. For example, in Newcomb's Problem above there are easy theorems, derivable even without looking at the source code of agent(), that agent()==2 implies world()==1000 and agent()==1 implies world()==1000000.
The reason this algorithm works is very weird, so you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A. In other words, an agent implemented like that cannot derive a contradiction from the logically inconsistent premises it uses, because then it would "imagine" it could obtain arbitrarily high utility (a contradiction implies anything, including that), therefore the agent would output the corresponding action, which would prove the Peano axioms inconsistent or something.
To recap: the above describes a perfectly deterministic algorithm, implementable today in any ordinary programming language, that "inspects" an unfamiliar world(), "imagines" itself returning different answers, "chooses" the best one according to projected consequences, and cannot ever "notice" that the other "possible" choices are logically inconsistent with determinism. Even though the other choices are in fact inconsistent, and the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants. (All scare quotes are intentional.)
This is progress. We started out with deterministic programs and ended up with a workable concept of "could".
Hopefully, results in this vein may someday remove the need for separate theories of counterfactual reasoning based on modal logics or something. This particular result only demystifies counterfactuals about yourself, not counterfactuals in general: for example, if agent A tries to reason about agent B in the same way, it will fail miserably. But maybe the approach can be followed further.