If PA is the system the agent uses, then by "the proof system," I mean PA. I was speaking generically, in case you'd prefer to use ZFC or PA + Con(PA) or whatever.
If the algorithm finds a proof that defection is better, it defects, period.
Er, are you saying that as soon as the algorithm finds such a proof, it stops searching and returns "defect"? That will not work, because then the Löbian reasoning won't go through, for the reason cousin_it gave. [Because PA cannot prove Con(PA), it also cannot prove that "there's a short proof that cooperate is better" implies "the agent cooperates" -- for all PA knows, PA might be inconsistent, and there might be an even shorter proof that defect is better.]
Or do you mean that the algorithm looks through the whole list of proofs, and if it never finds a proof that cooperate is better, it defects? Then we're in agreement; I was simply pointing out that in this case, we know that if the algorithm proves "defect is better," then it will not cooperate, because we know that PA is consistent, so if the algorithm proves "defect is better" then it will not prove "cooperate is better."
Because PA cannot prove Con(PA), it also cannot prove that "there's a short proof that cooperate is better" implies "the agent cooperates"
It can enumerate all shorter proofs, and see if any of them imply a different action of the agent. (Note that PA is reflexive, that is can prove that any given finite set of its statements is logically consistent.)
By requests from Blueberry and jimrandomh, here's an expanded repost of my comment which was itself a repost of my email sent to decision-theory-workshop.
(Wait, I gotta take a breath now.)
A note on credit: I can only claim priority for the specific formalization offered here, which builds on Vladimir Nesov's idea of "ambient control", which builds on Wei Dai's idea of UDT, which builds on Eliezer's idea of TDT. I really, really hope to not offend anyone.
(Whew!)
Imagine a purely deterministic world containing a purely deterministic agent. To make it more precise, agent() is a Python function that returns an integer encoding an action, and world() is a Python function that calls agent() and returns the resulting utility value. The source code of both world() and agent() is accessible to agent(), so there's absolutely no uncertainty involved anywhere. Now we want to write an implementation of agent() that would "force" world() to return as high a value as possible, for a variety of different worlds and without foreknowledge of what world() looks like. So this framing of decision theory makes a subprogram try to "control" the output of a bigger program it's embedded in.
For example, here's Newcomb's Problem:
A possible algorithm for agent() may go as follows. Look for machine-checkable mathematical proofs, up to a specified max length, of theorems of the form "agent()==A implies world()==U" for varying values of A and U. Then, after searching for some time, take the biggest found value of U and return the corresponding A. For example, in Newcomb's Problem above there are easy theorems, derivable even without looking at the source code of agent(), that agent()==2 implies world()==1000 and agent()==1 implies world()==1000000.
The reason this algorithm works is very weird, so you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A. In other words, an agent implemented like that cannot derive a contradiction from the logically inconsistent premises it uses, because then it would "imagine" it could obtain arbitrarily high utility (a contradiction implies anything, including that), therefore the agent would output the corresponding action, which would prove the Peano axioms inconsistent or something.
To recap: the above describes a perfectly deterministic algorithm, implementable today in any ordinary programming language, that "inspects" an unfamiliar world(), "imagines" itself returning different answers, "chooses" the best one according to projected consequences, and cannot ever "notice" that the other "possible" choices are logically inconsistent with determinism. Even though the other choices are in fact inconsistent, and the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants. (All scare quotes are intentional.)
This is progress. We started out with deterministic programs and ended up with a workable concept of "could".
Hopefully, results in this vein may someday remove the need for separate theories of counterfactual reasoning based on modal logics or something. This particular result only demystifies counterfactuals about yourself, not counterfactuals in general: for example, if agent A tries to reason about agent B in the same way, it will fail miserably. But maybe the approach can be followed further.