This post requires some knowledge of decision theory math. Part of the credit goes to Vladimir Nesov.

Let the universe be a computer program U that returns a utility value, and the agent is a subprogram A within U that knows the source code of both A and U. (The same setting was used in the reduction of "could" post.) Here's a very simple decision problem:

def U():
  if A() == 1:
    return 5
  else:
    return 10

The algorithm for A will be as follows:

  1. Search for proofs of statements of the form "A()=a implies U()=u". Upon finding at least one proof for each possible a, go to step 2.
  2. Let L be the maximum length of proofs found on step 1, and let f(L) be some suitably fast-growing function like 10^L. Search for proofs shorter than f(L) of the form "A()≠a". If such a proof is found, return a.
  3. If we're still here, return the best a found on step 1.

The usual problem with such proof-searching agents is that they might stumble upon "spurious" proofs, e.g. a proof that A()==2 implies U()==0. If A finds such a proof and returns 1 as a result, the statement A()==2 becomes false, and thus provably false under any formal system; and a false statement implies anything, making the original "spurious" proof correct. The reason for constructing A this particular way is to have a shot at proving that A won't stumble on a "spurious" proof before finding the "intended" ones. The proof goes as follows:

Assume that A finds a "spurious" proof on step 1, e.g. that A()=2 implies U()=0. We have a lower bound on L, the length of that proof: it's likely larger than the length of U's source code, because a proof needs to at least state what's being proved. Then in this simple case 10^L steps is clearly enough to also find the "intended" proof that A()=2 implies U()=10, which combined with the previous proof leads to a similarly short proof that A()≠2, so the agent returns 2. But that can't happen if A's proof system is sound, therefore A will find only "intended" proofs rather than "spurious" ones in the first place.

Quote from Nesov that explains what's going on:

With this algorithm, you're not just passively gauging the proof length, instead you take the first moral argument you come across, and then actively defend it against any close competition

By analogy we can see that A coded with f(L)=10^L will correctly solve all our simple problems like Newcomb's Problem, the symmetric Prisoner's Dilemma, etc. The proof of correctness will rely on the syntactic form of each problem, so the proof may break when you replace U with a logically equivalent program. But that's okay, because "logically equivalent" for programs simply means "returns the same value", and we don't want all world programs that return the same value to be decision-theoretically equivalent.

A will fail on problems where "spurious" proofs are exponentially shorter than "intended" proofs (or even shorter, if f(L) is chosen to grow faster). We can probably construct malicious examples of decision-determined problems that would make A fail, but I haven't found any yet.

New to LessWrong?

New Comment


39 comments, sorted by Click to highlight new comments since:

I found myself wondering if there are any results about the length of the shortest proof in which a proof system can reach a contradiction, and found the following papers:

Paper 1 talks about partial consistency. We have statements of the following form:

) is a statement that there is no ZF-proof of contradiction with length =< n.

The paper claims that this is provable in ZF for each n. The paper then discusses results about the proof length of the partial consistency statements is polynomial in n. The author goes on to derive analogous results pertaining to frege proof systems weaker than ZF.

From these results it may be possible to have the agent's proof system produce a partial consistency result for a stipulated length.

Paper 2 shows that the question of whether a formula has a proof of length no more than k is undecidable.

Both papers seem relevant, but I don't presently have the time to think them through carefully.

There's stuff like that in Ch. 2 of Lindstrom's Aspects of Incompleteness, apparently elementary within the area.

Thanks to you and Nesov for the references. I didn't know this stuff and it looks like it might come in handy.

I'll copy more of my comment here then, with added links that explain some terms. (No credit to me for the post, I'm just interpreting the result.)

Basically, you've combined two previously existing ideas into an algorithm that's better than either of them alone. The first idea is that moral arguments with shorter proofs seem to capture some of what it means for a moral argument to be "natural", but it's flawed in that it's not clear how similar in complexity are the other moral arguments, and if it's a close call, then the distinction is not particularly natural. The second idea is to use the diagonal method to banish the spurious moral arguments beyond an arbitrary proof length, but its flaw is that we have to know the guarded proof length in advance.

With this algorithm, you're not just passively gauging the proof length, instead you take the first moral argument you come across, and then actively defend it against any close competition using the diagonal method.

I'm not sure why do you need a fast-growing function at all? It seems, L+1 would work just as well. Proofs longer than L are not considered in Step 1, and if Step 2 ensures that no spurious proof is found for length <=L, then the agent should be fine. Shouldn't it?

The problem with L+1 is that it permits for there to be two different moral arguments, say evaluating the hypothetical action A==1 as having different utility, that have approximately the same proof length (L and L+10, say). If that is the case, it's not clear in what sense is the shorter one any better (more "natural", less "spurious") than the longer one, since the difference in proof length is so insignificant.

On the other hand, if we have the shortest moral argument of length less than L, and then the next moral argument is of length at least 10^L, then the first moral argument looks clearly much better. An important point is that step 2 is not just checking that the proofs of the longer moral arguments are much longer, instead it actually makes it so, it wouldn't be the case if step 2 were absent.

Having f(L)=L+1 would be sufficient to protect against any "moral arguments" that involve first proving agent()!=a for some a. Do we have any examples of "spurious" proofs that are not of this form?

I agree that f(L)=L+1 should work okay in practice. Having f(L) grow fast is more for psychological reassurance and making it easier to prove things about the agent.

But that can't happen if A's proof system is sound, therefore A will find only "intended" proofs rather than "spurious" ones in the first place.

If A's proof system is sound, that directly implies that A won't generate "spurious" proofs. I don't see the point of deriving this from the fact that "spurious" proofs lead to contradictions.

It seems like step 2 is just validating that proofs are correct, and when it finds a problem, it doesn't seem to do anything good about it. What is the point?

If A's proof system is sound, that directly implies that A won't generate "spurious" proofs.

Nope. In the example problem from the post, A's proof system will happily prove statements like "A()==1 implies U()==100". After all, A() finishes in finite time and returns 2, so any proof system worth its salt will eventually prove that A()==1 is false by simulating A, and a false statement implies anything. The challenge is to not let the agent stumble upon spurious proofs before it returns a value.

To make this post more obvious to people like me and JGWeissman, it would be a good idea to include this in the original post! I came out of this post thinking "Why did I just read that?" and couldn't figure it out until this response.

OK, thanks! Included this reasoning in the post.

ahh, good, i did re-read stuff and i was like - did i skip some major piece of it? am i going insane lately?

Now it makes a lot more sense.

Sorry, the omission was due to illusion of transparency on my part. I'd spent so much time thinking about avoiding spurious counterfactuals that I couldn't imagine other people not seeing that as the most important problem :-)

any proof system worth its salt will eventually prove that A()==1 is false by simulating A,

Not if A includes the proof search! No sound proof system will be able to say anything interesting about programs that use the system itself.

If the program doesn't terminate, the proof system may be unable to prove that. But if the program terminates in finitely many steps, any good enough proof system allows for a proof that just simulates all the steps one by one. Of course the program itself won't reach that particular proof because the proof is too long.

It seems like step 2 is just validating that proofs are correct, and when it finds a problem, it doesn't seem to do anything good about it.

The activity of step 2 results in proofs of statements like [A()==1 => U()==100] being long (in agent's proof system, compared to length of proofs of the "natural" moral arguments like [A()==1 => U()==5]), something that wouldn't be the case had step 2 been absent. So it does perform tangible work.

Another thing is that I don't think spurious proofs can have influence on the decision in any case.

Take the simpler algorithm B:

  1. Search for proofs of statements of the form "B()=b implies U()=u". Upon finding at least one proof for each possible b, go to step 2.
  2. Return the best b found on step 1.

A spurious proof is one for b where B()≠b.

By definition, the result returned in step 2 will be B(). Therefore B() is the best b found in step 1. Since B()=b for the best b, this can not be from a spurious proof. If for any other b' there is a spurious proof, then the proved utilities must be worse, since otherwise B() would equal c. Therefore the existence of spurious proofs does not affect B at all.

It's sort of inspiring that you make all the expected mistakes someone would make when studying this problem for the first time, but you progress through them really fast. Maybe you'll soon get to the leading edge? That would be cool!

In this comment the mistake is thinking that "the best b found in step 1" is indeed the best b for the given problem. For example, in the problem given in the post, B could find a correct proof that B()=1 implies U()=5 and a spurious proof that B()=2 implies U()=0. Then B will go on to return 1, thus making the spurious proof also correct because its premise is false.

In this comment the mistake is thinking that "the best b found in step 1" is indeed the best b for the given problem. For example, in the problem given in the post, B could find a correct proof that B()=1 implies U()=5 and a spurious proof that B()=2 implies U()=0. Then B will go on to return 1, thus making the spurious proof also correct because its premise is false.

Ah, I think I get it. By returning 1, B makes B()=2 false, and therefore it can no longer proof anything about the cases where it would have returned 2. In essence, counter-factual reasoning becomes impossible.

However, these problems only happen if B does not find a proof of "B()=2 implies U()=10". If it does find such a proof then it would correctly return 2. This reminds me of bounded rationality. If B searches for all proofs up to some length L, then for large enough L it will also find the non-spurious ones.

Counter-factuals were my reason for writing U=U'(A). By allowing the argument to U' to be different from A you can examine counter-factual cases. But perhaps then you end up with just CDT.

If B searches for all proofs up to some length L, then for large enough L it will also find the non-spurious ones.

Right, and the non-spurious proofs will "kill" the spurious ones. That was the idea in the reduction of "could" post, linked from the current one. Nice work! Please continue thinking, maybe you'll find something that we missed.

Why do you need statements of the form "A()==a implies U()==u" instead of "U(a)==u"? Or if you insist that A is a function, those of the form "U(()↦a)==u".

Then A would of course just pick the action with the highest (provable) utility.

The difference is that you no longer have need to prove things about A inside A, which is impossible.

U does not receive an action as an argument, U is a program with no arguments that includes A as a subprogram.

But you can easily rewrite U() as U'(A'). I think that solves all the problems with self-referential A. At least this works for the example in this post. How would U look for, say, Newcomb's problem?

def U(A):
    if predict(A) == 1:
        box1 = 1000, box2 = 1000000
    else:
        box1 = 1000, box2 = 0
    if A() == 1:
        return box2
    else:
        return box1+box2

still works fine if A is a parameter of U.

One of the motivations for our branch of decision theory research is that the physics of our world looks more like an argumentless function containing many logical correlates of you than like a function receiving you as an argument. Of course having a general way to derive one representation from the other would be great news for us. You can view the post as supplying one possible way to do just that :-)

Mathematically, the two forms are equivalent as far as I can tell. For any U that contains A, you can rewrite it to U=U'(A). If A is a part of U, then in order to proof things about U, A would have to simulate or proof things about itself, which leads to all kinds of problems. However, if A knows that U=U'(A), it can consider different choices x in U'(x) without these problems.

For any U that contains A, you can rewrite it to U=U'(A).

...In many different ways. For example, you could rewrite it to a U' that ignores its argument and just repeats the source code of U. Or if U contains one original copy of A and one slightly tweaked but logically equivalent copy, your rewrite might pick up on one but not the other, so you end up defecting in the Prisoner's Dilemma against someone equivalent to yourself. To find the "one true rewrite" from U to U', you'll need the sort of math that we're developing :-)

Also, do you mean U()=U'(A) (dependence on source code) or U()=U'(A()) (dependence on return value)? There's a subtle difference, and the form U()=U'(A) is much harder to reason about, because U'(X) could do all sorts of silly things when X!=A and still stay equivalent to U when X=A.

Doesn't this totally kill the idea of resolving ASP by having the agent prove something about its own actions in less than N steps (for N<<L), that got posted last week?

I'm not sure. Can you expand? Also, the followup post seems a little more relevant to ASP.

Well, if it's possible for the predictor to prove (in N steps) that A()==1, then it's also possible for the agent to prove A()==1 in N steps, and so A()=/= 2 in fewer than L steps, and so the agent returns 2. So there is no proof that A()==1 that's N steps, so you can't get the million dollars.

In my formulation of ASP, the predictor has a stronger formal system than the agent, so some statements can have short proofs in the predictor's formal system but only long proofs in the agent's formal system. An extreme example is consistency of the agent's formal system, which is unprovable by the agent but an axiom for the predictor.

Hm. So what would you still need to win. You'd still need to prove that A()==1 implies U()==10^6. But if that's true, the only way the agent doesn't prove that A()==2 implies U()==10^6+10^3 is if the predictor is really always right. But that's not provable to the agent, because I'm pretty sure that relies on the consistency of the agent's formal system.

The ASP post already has a rigorous proof that this implementation of A will fail. But maybe we can find a different algorithm for A that would search for proofs of some other form, and succeed?

[-][anonymous]00

What happens with U() like this:

def U():
if(A()==1 and every Goodstein sequence terminates at 0) return 10
else if(A()==2) return 1
else return 5

[This comment is no longer endorsed by its author]Reply
[-][anonymous]00

That's a nice question, but it seems to be about a different formalism, not the one talking about programs. How do you encode "every Goodstein sequence terminates at 0" as a program?

[This comment is no longer endorsed by its author]Reply

How the hell will it find proof that U() returns 100 if U() does not have 100 in a single return statement, and clearly can only return 5 or 10? Or are you meaning 10 here? Still, how it can find such proof with any but a very broken proof system?

It seems to me that even a very simple set of axioms will allow it to treat A() as a black box that returns x here, arriving at correct proof that if x=1 then U()=5 , without ever recursing into itself. When i have to make bots in game do something, i dont make bots emulate bots, i just say, x is the every bot's output in those circumstances, then apply algebra to find x from equation that got x on both sides.

edit: furthermore, the proof systems that can prove a contradiction, can prove anything.

I guess i miss something about decision theories. I, basically, do applied math for a living, meaning that I do decisions using math, in the domain where it works. First thing I do when I need to get rid of recursion, is turn this into an equation with x (or many unknowns). You just treat the bot as black box, which outputs the value that the bot outputs. Then, you can arrive at equation with one solution or many solutions or infinitely many solutions. I don't need to secondguess myself by running myself in emulator. Black-box unknown works fine.

As one of my math teachers used to say, "if something's obvious, that means it's easy to prove". Can you prove that the agent cannot easily prove that A()=1 implies U()=100?

edit: nevermind, did read the post before explanation was added. Guys, seriously, this stuff is confusing enough and its really hard to see what is the point of exercise even, when theres no actual hints on that. Furthermore, it is not mainstream knowledge of any kind. The updateless decision theory is specifically lesswrong thing. It's not easy for those who write practical deciding agents to get the point of exercise with the newcomb's , as in practical approach, there's simply two ways to write down the payoffs in newcomb, which looks too much like typical problem formalizing problem descriptions written in English (and newcomb's just doesn't even look like a problem description that different people would understand exactly in same way)

It doesn't prove U()=100, it proves [A()=1 implies U()=100]. Simultaneously having the proof that [A()=1 implies U()=5] allows you to disprove A()=1, which is perfectly fine, since actually A()=2.