By requests from Blueberry and jimrandomh, here's an expanded repost of my comment which was itself a repost of my email sent to decision-theory-workshop.
(Wait, I gotta take a breath now.)
A note on credit: I can only claim priority for the specific formalization offered here, which builds on Vladimir Nesov's idea of "ambient control", which builds on Wei Dai's idea of UDT, which builds on Eliezer's idea of TDT. I really, really hope to not offend anyone.
(Whew!)
Imagine a purely deterministic world containing a purely deterministic agent. To make it more precise, agent() is a Python function that returns an integer encoding an action, and world() is a Python function that calls agent() and returns the resulting utility value. The source code of both world() and agent() is accessible to agent(), so there's absolutely no uncertainty involved anywhere. Now we want to write an implementation of agent() that would "force" world() to return as high a value as possible, for a variety of different worlds and without foreknowledge of what world() looks like. So this framing of decision theory makes a subprogram try to "control" the output of a bigger program it's embedded in.
For example, here's Newcomb's Problem:
def world():
box1 = 1000
box2 = (agent() == 2) ? 0 : 1000000
return box2 + ((agent() == 2) ? box1 : 0)
A possible algorithm for agent() may go as follows. Look for machine-checkable mathematical proofs, up to a specified max length, of theorems of the form "agent()==A implies world()==U" for varying values of A and U. Then, after searching for some time, take the biggest found value of U and return the corresponding A. For example, in Newcomb's Problem above there are easy theorems, derivable even without looking at the source code of agent(), that agent()==2 implies world()==1000 and agent()==1 implies world()==1000000.
The reason this algorithm works is very weird, so you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A. In other words, an agent implemented like that cannot derive a contradiction from the logically inconsistent premises it uses, because then it would "imagine" it could obtain arbitrarily high utility (a contradiction implies anything, including that), therefore the agent would output the corresponding action, which would prove the Peano axioms inconsistent or something.
To recap: the above describes a perfectly deterministic algorithm, implementable today in any ordinary programming language, that "inspects" an unfamiliar world(), "imagines" itself returning different answers, "chooses" the best one according to projected consequences, and cannot ever "notice" that the other "possible" choices are logically inconsistent with determinism. Even though the other choices are in fact inconsistent, and the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants. (All scare quotes are intentional.)
This is progress. We started out with deterministic programs and ended up with a workable concept of "could".
Hopefully, results in this vein may someday remove the need for separate theories of counterfactual reasoning based on modal logics or something. This particular result only demystifies counterfactuals about yourself, not counterfactuals in general: for example, if agent A tries to reason about agent B in the same way, it will fail miserably. But maybe the approach can be followed further.
Note that world() essentially doesn't talk about what can happen, instead it talks about how to compute utility, and computation of utility is a single fixed program without parameters (not even depending on the agent), that the agent "controls" from the inside.
To clarify, in the Newcomb example, the preference (world) program could be:
while the agent itself knows its code in the form agent3(). If it can prove both agent1() and agent2() to be equivalent to agent3(), it can decide in exactly the same way, but now the preference program doesn't contain the agent even once, there is no explicit dependency of the world program (utility of the outcome) on the agent's actions. Any dependency is logically inferred.
And we now have a prototype of the notion of preference: a fixed program that computes utility.
One possible problem: There is a difference between a program and the function it computes. The notion of preference is perhaps captured best by a function, not a program. Many programs could be written, all computing the same function. This wouldn't matter much if we were only going to call or invoke the program, since all versions of the program compute the same result. But, this is not what cousin_it suggests here. He wants us to examine the source of ... (read more)