I've been having difficulties with constructing a toy scenario for AI self-modification more interesting than Quirrell's game, because you really want to do expected utility maximization of some sort, but currently our best-specified decision theories search through the theorems of one particular proof system and "break down and cry" if they can't find one that tells them what their utility will be if they choose a particular option. This is fine if the problems are simple enough that we always find the theorems we need, but the AI rewrite problem is precisely about skirting that edge. It seems natural to want to choose some probability distribution over the possibilities that you can't rule out, and then do expected utility maximization (because if you don't maximize EU over some prior, it seems likely that someone could Dutch-book you); indeed, Wei Dai's original UDT has a "mathematical intuition module" black box which this would be an implementation of. But how do you assign probabilities to logical statements? What consistency conditions do you ask for? What are the "impossible possible worlds" that make up your probability space?

Recently, Wei Dai suggested that logical uncertainty might help avoid the Löbian problems with AI self-modification, and although I'm sceptical about this idea, the discussion pushed me into trying to confront the logical uncertainty problem head-on; then, reading Haim Gaifman's paper "Reasoning with limited resources and assigning probabilities to logical statements" (which Luke linked from So you want to save the world) made something click. I want to present a simple suggestion for a concrete definition of "impossible possible world", for a prior over them, and for an UDT algorithm based on that. I'm not sure whether the concrete prior is useful—the main point in giving it is to have a concrete example we can try to prove things about—but the definition of logical possible worlds looks like a promising theoretical tool to me.

*

Let S be the set of sentences of Peano Arithmetic less than 3^^^3 symbols long, and let Pow(S) := the set of all subsets of S. We interpret elements X of Pow(S) as "logical worlds", in which the sentences in X are true and the sentences in (S \ X) are false. Each world X can be represented by a single sentence sX, which is the conjunction of the sentences ({x | x in X} union {not y | y in S \ X}). Now, we exclude those X that are proven contradictory by one of the first 4^^^^4 theorems of PA; that is, we define the set W of possible worlds to be

    W := {X in Pow(S) | "not sX" is not among the first 4^^^^4 theorems of PA}.

W is our probability space. Note that it's finite. For lack of a better idea, I propose to use the uniform prior on it. (Possibly useful: another way to think about this is that we choose the uniform prior on Pow(S), and after looking at the 4^^^^4 theorems, we do a Bayesian update.)

Individual sentences in S induce events over this probability space: a sentence x corresponds to the event {X in Pow(S) | x in X}. Clearly, all of the above can be carried out by a computable function, and in particular we can write a computable function P(.) which takes a statement in S and returns the probability of the corresponding event (and the source of this function can be short, i.e., it doesn't need to contain any 3^^^3-sized lookup tables).

*

The decision algorithm makes use of two global variables which specify the problem.

  • actions is a Python dictionary that maps possible inputs to the algorithm to a list of outputs the algorithm is allowed to return when receiving that input. For example, in the problem from Wei Dai's UDT1.1 post, actions = {1: ['A','B'], 2: ['A','B']}, meaning that your input is either '1' or '2', and in both cases you may choose between options 'A' and 'B'. We'll assume there's nothing 3^^^3-sized in actions.
  • worlds is a list of triples of the form (p,U,us), representing possible physical worlds the agent might find itself in, where p is the probability of being in that world, U is the source of a function that computes and returns the agent's utility if that world is the true one (by simulating that world and running a utility function over the result), and us is a list of values U might return. The probabilities must add to 1. We'll assume that there's nothing 3^^^3-sized here either, and that it's provable in much less that 4^^^^4 steps that if the decision algorithm halts on all inputs specified by actions and returns one of the allowable actions, then each U will halt and return a value in the corresponding us. (The reason for the condition is that the functions U may contain copies of the agent's source, and may make calls to the agent, so if the agent didn't halt, neither could U.)

With these provisions, the algorithm, UDT(input), proceeds as follows:

  1. Compute mappings, a list of all dictionaries that maps each possible input from actions to one of the allowable outputs. (In the earlier example, mappings = [{1:'A',2:'A'}, {1:'A',2:'B'}, {1:'B',2:'A'}, {1:'B',2:'B'}].)
  2. Play chicken with the universe: For each m in mappings, if P("UDT(i) == m[i] for every i in actions.keys()") = 0, then return m[input].
  3. Calculate expected utilities: For each m in mappings, for each (p,U,us) in worlds, for each u in us, compute q := P("U() == u" | "UDT(i) == m[i] for every i in actions.keys()"); the expected utility EU(m) of m is the sum of all the corresponding p*u*q. (Note that we made sure in the previous step that the conditional probabilities q exist.)
  4. Choose the m with the highest expected utility. If multiple options have the same utility, choose the lexicographically lowest one.
  5. Return m[input].

Now, the universe must always chicken out (the algorithm will never need to return in step 2), because one of the possible worlds in W must be true, this true world cannot be ruled out by Peano Arithmetic because PA is sound, and if the algorithm returned m[input] in step 2, then "UDT(i) == m[i] for every i in actions.keys()" would hold in this true world, so the probability of this sentence could not be zero.

Further, the algorithm will halt on all inputs, because although it does some big computations, there is no unbounded search anywhere; and it's easy to see that on each possible input, it will return one of the allowable outputs. This reasoning can be formalized in PA. Using our earlier assumption, it follows (in PA, in much less than 4^^^^4 steps) that each U will halt and return a value in the corresponding us. Thus, in each possible logical world in W, for every (p,U,us) in worlds, "U() halts" will be true, and "U() == u" will be true for exactly one u (more than one would quickly prove a contradiction), and this u will be in us; and therefore, the different q for a given (p,U,us) will add up to 1.

New to LessWrong?

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 2:28 PM

Reminder: Original scientific research on saving the world is valuable and you should upvote it.

I have a principle of not upvoting things I don't understand. Otherwise, I would.

Very nice that you're doing this! I'd like to understand your proposed prior a little better:

1) Does the probability of any given statement converge to a limit as the proof length goes to infinity? That's trivially true for statements that are proven true or false by PA, but what about independent statements like Con(PA)?

2) Does the probability of any given statement strongly depend on the encoding used for statements, or can we somehow "wash out" that dependence?

I'd like to understand your proposed prior a little better

Me too! I.e., I understand the definition, but not very much about its properties. Your questions are helpful, thanks!

1) Does the probability of a single statement converge to a limit as the proof length goes to infinity?

You mean if we keep the 3^^^3 fixed, but replace 4^^^^4 by larger and larger numbers? Yes: for a particular world X, if (not s_X) is proven by PA at all, it's proven after some finite number of steps, and taking the maximum of these over all X, we see that there is some number we can replace 4^^^^4 by after which nothing changes. But it's important to keep in mind that the probabilities our agent uses may not be close to that limit! In particular, all but one of the probabilities we force to be >0 will in the limit be 0, because PA can obviously simulate our agent and figure out what it returns.

An interesting followup question is: consider the probability-in-the-limit P_lim(x) of some statement x, as in the previous paragraph. If we now increase the 3^^^3 parameter, does P_lim(x) tend to a limit? Don't have an answer to that one, so far.

2) Does the probability of a single statement strongly depend on the encoding used for statements, or can we somehow "wash out" that dependency?

If we keep the 3^^^3 number fixed and allow very weird encodings, it seems likely that we can do ugly things, though I haven't checked in detail. My intuition would be that if any nice result is to be had at all, it would be of a similar form as those about Kolmogorov complexity -- something like, two fixed encodings cannot differ by more than a constant if you let the numeric parameters grow large enough. But that's just a guess.

Yeah, I guess your "interesting followup question" is what interests me as well...

Oops, our comments crossed. Here are two points I wrote in reply to your original comment:

  • Nit: I don't think there's any particular reason to think that [lim_m lim_n prior at (m,n)] will be equal to [lim_n lim_m prior at (m,n)]. (You wrote about what we get at "(infinity,infinity)".)

  • I would hope that if the proposed limit exists at all, you could define it over a nice probability space: I imagine that the individual outcomes ("worlds") would be all sets of sentences of the language of PA, the sigma algebra of events would be the same as for the probability space of a countable sequence of coin tosses (a standard construction), and the set of all complete theories extending PA would be measurable and have probability 1. (Or maybe that is too much to hope for...)

I haven't thought about it too much more, but I do not see a promising avenue to follow towards prowing that P_lim(x) converges.

Some good news: if I didn't make a mistake, the set of all complete theories extending PA is indeed a measurable set in the "countable sequence of coin tosses" space; hint: it's the countable intersection of a sequence of events induced by replacing 3^^^3 by larger and larger numbers. (I'd be happy to expand on this if someone is interested.) Therefore, we have a measurable subspace, on which an appropriate probability measure could exist. Unfortunately, the subspace has probability zero in the original probability space, so you can't just do a Bayesian update (that would imply a division by zero).

It just occurred to me that you could sidestep the question and instead define P_lim(x) directly. For example, you could say something like

P_lim(x) = P(PA+x+y is consistent | PA+y is consistent)

if y is randomly chosen from all statements in PA using a simple length-based prior. This way P_lim(x)=1 if PA proves X, and P_lim(x)=0 if PA disproves X, as expected. Then you could view your original definition as a sort of approximation scheme for this one.

This way of assigning probabilities as "degrees of consistency" doesn't seem to contain much more information than PA about the "true integers", but if we just need a prior, something like it could work.

It doesn't seem to me like your prior needs to encode information about the "true integers" any more than it needs to encode information about the world. It just needs to be flexible enough to see the structure that is there.

Maybe this is what you mean by "if we just need a prior," but it seems like an important point.

While we're at it, it's not even clear our prior needs to capture PA. Maybe you could use something much weaker, perhaps even a theory so weak it could assert its own (cut free) consistency, and then just learn the structure of PA by observing some computations (or however you gain info about logical facts).

[-][anonymous]12y00

OK, so by taking the limit of priors at (m,n) we get a well-defined prior at (m,infinity). What happens if we take the limit of those, do we get a nice prior at (infinity,infinity) for individual statements (not worlds)?

[This comment is no longer endorsed by its author]Reply
[-][anonymous]12y00

[ETA: oops, comments crossed; cousin_it originally posted a restatement of my followup question.]

Yes, that's what I meant. Nit, though: I don't think there's any particular reason to think that [lim_m lim_n prior at (m,n)] will be equal to [lim_n lim_m prior at (m,n)].

I would hope that if the proposed limit exists at all, you could define it over a nice probability space: I imagine that the individual outcomes ("worlds") would be all sets of sentences of the language of PA, the sigma algebra of events would be the same as for the probability space of a countable sequence of coin tosses (a standard construction), and the set of all complete theories extending PA would be measurable and have probability 1. (Or maybe that is too much to hope for...)

[This comment is no longer endorsed by its author]Reply

We interpret elements X of Pow(S) as "logical worlds", in which the sentences in X are true and the sentences in (Pow(S) \ X) are false.

Shouldn't this be (S \ X) instead of (Pow(S) \ X)?

Fixed, thank you!

This seems like it's sweeping the hard work under a rug - that is, managing the inconsistencies between short proofs that give you probabilities and long proofs that give you certainties.

Ok, but that wasn't the question I was trying to solve; at this point, I was merely trying to understand what expected utility maximization over logical uncertainty could even mean, in principle. Before the insight described in this post, I was able to think about scenarios like Vladimir Nesov's Counterfactual Mugging and Logical Uncertainty verbally, or by making a model specific to this situation, but I wasn't able to give a definition of an agent that maximizes utility over logical possibilities in enough detail to prove something about how it would behave in a given situation.

(To formalize counterfactual mugging, I wouldn't use the uniform prior from this post, though -- I'd use the definition of impossible possible world and leave the prior as a black box satisfying the obvious assumption about high digits of pi.)

[ETA: Hmm, on second thoughts, counterfactual mugging still isn't the easiest thing to actually formalize, though :-)]

Ah, I think I see what the insight was - you're not actually trying to solve logical uncertainty, you're trying to solve a specific problem of UDT with logical uncertainty, as outlined in your third link, and you're doing this by replacing a search through proofs with an iteration over "logical worlds" where the proof would be true or false.

Am I close?

EDIT: well, I guess that's not totally true - you are trying to solve logical uncertainty, in the sense that logical uncertainty is going to be defined relative to some bounded model of decision-making, so you're proposing a model.

In "2. Play chicken with the universe: For each m in mappings, if P("UDT(i) == m[i] for every i in mappings.keys()") = 0, then return m[input]." do you actually mean actions.keys() instead of mappings.keys()?

I think this applies to all subsequent uses of mappings.keys() too.

Indeed I do and indeed it does. Thanks!

For lack of a better idea, I propose to use the uniform prior on it.

If we consider only universally quantified statements: Conjunctive statements in each world are redundant as the world itself is a conjunction anyway. If each world contains only disjunctive statements, shouldn't the worlds that assign truth to longer statements be more likely, as it is easier to satisfy disjunctive statements with more terms?

Um, I don't see value in excluding statements that have e.g. a forall-exists-forall-exists sequence of quantifiers at the outer level?

But yeah, there may well be some systematic bias, and I have no particular reason to believe that easier satisfiability of long disjunctive statements could not be the cause of such a bias... unfortunately, my insight isn't good enough to offer anything but guesses. As I said in the post, I'm really not sure whether this is a good choice of prior or not; the main point of having it is to have something concrete to think about, possibly as a stand-in for a better prior over the same set of "worlds".

Um, I don't see value in excluding statements that have e.g. a forall-exists-forall-exists sequence of quantifiers at the outer level?

I just wanted to show it seems possible to do better and that was easier to do considering only a subset of statements.

Where exactly is the logical prior being used in the decision procedure? It doesn't seem like it would be used for calculating U, as U was implied to be computable. I don't see why it would be needed for p, as p could be computed from the complexity of the program, perhaps via kolmogorov complexity. Also, what is the purpose of us? Can't the procedure just set us to be whatever U outputs?

[-][anonymous]12y-40

Universal prior, speed prior, log-normal prior, concrete prior... surely we can get more creative. How about, badass prior? Awesome prior? Katana prior? Dai prior? Five finger exploding heart prior?

[This comment is no longer endorsed by its author]Reply