Comment Permalink

Eliezer Yudkowsky16y250

Methodological remark: One should write at some point on a very debilitating effect that I've noticed in decision theory, philosophy generally, and Artificial Intelligence, which one might call Complete Theory Bias. This is the academic version of Need for Closure, the desire to have a complete theory with all the loose ends sewn up for the sake of appearing finished and elegant. When you're trying to eat Big Confusing Problems, like anything having to do with AI, then Complete Theory Bias torpedoes your ability to get work done by preventing you from navigating the space of partial solutions in which you can clearly say what you're trying to solve or not solve at a given time.

This is very much on display in classical causal decision theory; if you look at Joyce's Foundations of Causal Decision Theory, for example, it has the entire counterfactual distribution falling as mana from heaven. This is partially excusable because Pearl's book on how to compute counterfactual distributions had only been published, and hence only really started to be popularized, one year earlier. But even so, the book (and any other causal decision theories that did the same thing) should have carried a big sign saying, "This counterfactual distribution, where all the interesting work of the theory gets carried out, falls on it as manna from heaven - though we do consider it obvious that a correct counterfactual for Newcomb ought to say that if-counterfactual you one-box, it has no effect on box B." But this would actually get less credit in academia, if I understand the real rules of academia correctly. You do not earn humility points for acknowledging a problem unless it is a convention of the field to acknowledge that particular problem - otherwise you're just being a bother, and upsetting the comfortable pretense that nothing is wrong.

Marcello and I have all sorts of tricks for avoiding this when we navigate the space of fragmentary solutions in our own work, such as calling things "magic" to make sure we remember we don't understand them.

TDT is very much a partial solution, a solution-fragment rather than anything complete. After all, if you had the complete decision process, you could run it as an AI, and I'd be coding it up right now.

TDT does say that you ought to use Pearl's formalism for computing counterfactuals, which is progress over classical causal decision theory; but it doesn't say how you get the specific causal graph... since factoring the causal environment is a very open and very large AI problem.

Just like the entire problem of factoring the environment into a causal graph, there's a whole entire problem of reasoning under logical uncertainty using limited computing power. Which is another huge unsolved open problem of AI. Human mathematicians had this whole elaborate way of believing that the Taniyama Conjecture implied Fermat's Last Theorem at a time when they didn't know whether the Taniyama Conjecture was true or false; and we seem to treat this sort of implication in a rather different way than "2=1 implies FLT", even though the material implication is equally valid.

TDT assumes there's a magic module bolted on that does reasoning over impossible possible worlds. TDT requires this magic module to behave in certain ways. For the most part, my methodology is to show that the magic module has to behave this way anyway in order to get commonsense logical reasoning done - i.e., TDT is nothing special, even though the whole business of reasoning over impossible possible worlds is an unsolved problem.

To answer Robin's particular objection, what we want to do is drop out of TDT and show that an analogous class of reasoning problems apply to, say, pocket calculators. Let's say I know the transistor diagram for a pocket calculator. I type in 3 + 3, not knowing the answer; and upon the screen flashes the LED structure for "6". I can interpret this as meaning 3 + 3 = 6, or I can interpret it as a fact about the output of this sort of transistor diagram, or I can interpret it as saying that 3 + 3 is an even number, or that 2 3 is 6. And these may all tell me different things, at first, about the output of another, similar calculator. But all these different interpretations should generally give me compatible logical deductions about the other calculator and the rest of the universe. If I arrive at contradictory* implications by forming different abstractions about the calculator, then my magic logic module must not be sound.

The idea that you want to regard "all computations similar to yourself as having the same output" is just a gloss on the real structure. In the real version, there's a single canonical mathematical fact of which you are presently logically uncertain, the output of the Godelian diagonal:

Argmax[A in Actions] in Sum[O in Outcomes](Utility(O)P(*this computation yields A []-> O|rest of universe))

The this computation above is not a reference to your entire brain. It is a reference to that one equation above, the canonical diagonal form. It's assumed, in TDT, that you're implementing that particular equation - that TDT is how you make your decisions.

Then you assume that particular equation has a particular output, and update your view of the rest of the physical universe accordingly. In "almost" the same way you would update your view of the universe when you saw the calculator output "6". It might indeed depend on your logical reasoning engine. There might be things similar to yourself that you did not know were similar to yourself. If so, then you'll (all) do worse, because your logical reasoning engine is weaker. But you should at least not arrive at a contradiction, if your logical reasoning engine is at least sound.

What if you can only approximate that equation instead of computing it directly, so that it's possible that you and the equation will have different outputs? Should the equation be about your approximation of it, or should you just try to approximate the original equation? This is an open problem in TDT, which reflects the underlying open problem in AI; I just assumed there was enough computing power to do the above finite well-ordered computation directly. If you could show me a particular approximation, I might be able to answer better. Or someone could deliver a decisive argument for why any approximation ought to be treated a particular way, and that would make the problem less open in TDT, even though which approximation to use would still be open in AI.

(I also note at this point that the only way your counterfactual can apparently control the laws of physics, is if you know that the laws of physics imply that at least one answer is not compatible with physics, in which case you already know that option is not the output of the TDT computation, in which case you know it is not the best thing to do, in which case you are done considering it. So long as all answers seem not-visibly-incompatible with physics relative to your current state of logical knowledge, supposing a particular output should not tell you anything about physics.)

An example of a much more unsolved problem within TDT, which is harder to dispose of by appeal to normal non-TDT logical reasoning, is something that I only realized existed after reading Drescher; you actually can't update on the subjunctive / counterfactual output of TDT in exactly the same way you can update on the actually observed output of a calculator. In particular, if you actually observed something isomorphic to your decision mechanism output action A2, you could infer that A2 had higher expected utility than A1, including any background facts about the world or one's beliefs about it, that this would require; but if we only suppose that the mechanism is outputting A2, we don't want to presume we've just calculated that A2 > A1, but we do want to suppose that other decision mechanisms will output A2.

The two ways that have occurred to me for resolving this situation would be to (1) stratify the deductions into the physical and the logical, so that we can deduce within the counterfactual that other physical mechanisms will output "A2", but not deduce within our own logic internal to the decision process that A2 > A1. Or (2) to introduce something akin to a causal order within logical deductions, so that "A2 > A1" is a parent of "output = A2" and we can perform counterfactual surgery on "output = A2" without affecting the parent node.

36

What Program Are You?

36

36