I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?
Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.
*
Let S be the set of sentences of Peano Arithmetic less than 3^^^3 symbols long, and let Pow(S) := the set of all subsets of S. We interpret elements X of Pow(S) as "logical worlds", in which the sentences in X are true and the sentences in (S \ X) are false. Each world X can be represented by a single sentence sX, which is the conjunction of the sentences ({x | x in X} union {not y | y in S \ X}). Now, we exclude those X that are proven contradictory by one of the first 4^^^^4 theorems of PA; that is, we define the set W of possible worlds to be
W := {X in Pow(S) | "not sX" is not among the first 4^^^^4 theorems of PA}.
W is our probability space. Note that it's finite. For lack of a better idea, I propose to use the uniform prior on it. (Possibly useful: another way to think about this is that we choose the uniform prior on Pow(S), and after looking at the 4^^^^4 theorems, we do a Bayesian update.)
Individual sentences in S induce events over this probability space: a sentence x corresponds to the event {X in Pow(S) | x in X}. Clearly, all of the above can be carried out by a computable function, and in particular we can write a computable function P(.) which takes a statement in S and returns the probability of the corresponding event (and the source of this function can be short, i.e., it doesn't need to contain any 3^^^3-sized lookup tables).
*
The decision algorithm makes use of two global variables which specify the problem.
actions
is a Python dictionary that maps possible inputs to the algorithm to a list of outputs the algorithm is allowed to return when receiving that input. For example, in the problem from Wei Dai's UDT1.1 post,actions = {1: ['A','B'], 2: ['A','B']}
, meaning that your input is either '1' or '2', and in both cases you may choose between options 'A' and 'B'. We'll assume there's nothing 3^^^3-sized inactions
.worlds
is a list of triples of the form(p,U,us)
, representing possible physical worlds the agent might find itself in, wherep
is the probability of being in that world,U
is the source of a function that computes and returns the agent's utility if that world is the true one (by simulating that world and running a utility function over the result), andus
is a list of valuesU
might return. The probabilities must add to 1. We'll assume that there's nothing 3^^^3-sized here either, and that it's provable in much less that 4^^^^4 steps that if the decision algorithm halts on all inputs specified byactions
and returns one of the allowable actions, then eachU
will halt and return a value in the correspondingus
. (The reason for the condition is that the functionsU
may contain copies of the agent's source, and may make calls to the agent, so if the agent didn't halt, neither couldU
.)
With these provisions, the algorithm, UDT(input)
, proceeds as follows:
- Compute
mappings
, a list of all dictionaries that maps each possible input fromactions
to one of the allowable outputs. (In the earlier example,mappings = [{1:'A',2:'A'}, {1:'A',2:'B'}, {1:'B',2:'A'}, {1:'B',2:'B'}]
.) - Play chicken with the universe: For each
m
inmappings
, if P("UDT(i) == m[i]
for everyi
inactions.keys()
") = 0, then returnm[input]
. - Calculate expected utilities: For each
m
inmappings
, for each(p,U,us)
inworlds
, for eachu
inus
, computeq
:= P("U() == u
" | "UDT(i) == m[i]
for everyi
inactions.keys()
"); the expected utility EU(m
) ofm
is the sum of all the correspondingp*u*q
. (Note that we made sure in the previous step that the conditional probabilitiesq
exist.) - Choose the
m
with the highest expected utility. If multiple options have the same utility, choose the lexicographically lowest one. - Return
m[input]
.
Now, the universe must always chicken out (the algorithm will never need to return in step 2), because one of the possible worlds in W must be true, this true world cannot be ruled out by Peano Arithmetic because PA is sound, and if the algorithm returned m[input]
in step 2, then "UDT(i) == m[i]
for every i
in actions.keys()
" would hold in this true world, so the probability of this sentence could not be zero.
Further, the algorithm will halt on all inputs, because although it does some big computations, there is no unbounded search anywhere; and it's easy to see that on each possible input, it will return one of the allowable outputs. This reasoning can be formalized in PA. Using our earlier assumption, it follows (in PA, in much less than 4^^^^4 steps) that each U
will halt and return a value in the corresponding us
. Thus, in each possible logical world in W, for every (p,U,us)
in worlds
, "U()
halts" will be true, and "U() == u
" will be true for exactly one u
(more than one would quickly prove a contradiction), and this u
will be in us
; and therefore, the different q
for a given (p,U,us)
will add up to 1.
I haven't thought about it too much more, but I do not see a promising avenue to follow towards prowing that P_lim(x) converges.
Some good news: if I didn't make a mistake, the set of all complete theories extending PA is indeed a measurable set in the "countable sequence of coin tosses" space; hint: it's the countable intersection of a sequence of events induced by replacing 3^^^3 by larger and larger numbers. (I'd be happy to expand on this if someone is interested.) Therefore, we have a measurable subspace, on which an appropriate probability measure could exist. Unfortunately, the subspace has probability zero in the original probability space, so you can't just do a Bayesian update (that would imply a division by zero).
It just occurred to me that you could sidestep the question and instead define P_lim(x) directly. For example, you could say something like
P_lim(x) = P(PA+x+y is consistent | PA+y is consistent)
if y is randomly chosen from all statements in PA using a simple length-based prior. This way P_lim(x)=1 if PA proves X, and P_lim(x)=0 if PA disproves X, as expected. Then you could view your original definition as a sort of approximation scheme for this one.
This way of assigning probabilities as "degrees of consistency" doesn't seem to contain much more information than PA about the "true integers", but if we just need a prior, something like it could work.