Suppose I am a self-modifying AI reasoning about my own behavior (or about the behavior of another AI I am designing). 

To a human, it seems like it is very important that we trust our own deductions. That is, humans seem to believe "things I believe are probably true." Thus a human would not self-modify to stop acting on their beliefs. How do we formalize this sort of trust?

For simplicity say I exist in a fixed environment without uncertainty whose description I know (including my own existence in it). The only sort of statements I care about are mathematical statements: any property of our environment can be expressed as a purely mathematical statement. In order to act intelligently in this environment, I have a mathematical deduction engine which proves mathematical statements; my decisions are informed by the output of my deduction engine.

I am considering replacing my deduction engine with a psuedo-random statement evaluator in order to save energy (after all, I need energy for many purposes). Why shouldn't I do it? You might hope that the deduction engine would be able to tell me that my deduction engine is useful; that my deduction engine isn't just a psuedo-random statement evaluator. But an important property of mathematical deduction is the second incompleteness theorem: no reasonable proof system X can prove that X doesn't prove false statements. In fact the situation is more dire: if X ever proves a statement of the form "X wouldn't prove Y if Y weren't true" then X must also prove Y itself.

My question is: what sort of confidence in its own reasoning can a consistent thinker actually have? I know that an agent sure of its own correctness will start believing everything. But what about an agent who is 99% confident of its own correctness? What does Lob's theorem look like when applied to probabilistic beliefs?

For example, suppose the algorithm X believes itself to be well-calibrated in the following sense. For any statement A and any probability p and any time T, consider the statements S1 = "At time T, X believes A with probability p," and S2 = "At time T, X believes A with probability p, and A is really true." We say that X believes itself to be well-calibrated about A at time T if, for all p, the probability X assigns to S2 is exactly p times the probability X assigns to S1.

Lob's theorem says that this sort of belief in well-calibration is impossible if X is capable of carrying out complicated reasoning: X would infer with probability 1 that "At time T, if X believes A with probability 1 then A is true." If X believes this for a large enough time T then X can apply the inferences in Lob's theorem and arrive at belief in A with probability 1 (I think).

Now what if X believes itself to be well-calibrated only with probability 99%, in the following sense. Define S1 and S2 as before. What if, for all p, the probability X assigns to S2 is between (p+0.01) and (p-0.01) times the probability X assigns to S1? Does this lead to contradictory beliefs? Is this a strong enough belief to ensure that X will keep itself running and invest effort in improving its performance?

Of course given that we don't have any grasp of probabilistic reasoning about mathematics, maybe thinking about this issue is premature. But I would like to understand what probabilistic reasoning might look like in interesting cases to get more of a handle on the general problem.

Are there other good ways for X to believe in its own well-calibration? For example, in my description there is no explicit statement "X is well calibrated" which X can reason about, only particular special cases. Should there be such a statement? My description seems like it has many potential shortcomings, but I don't see anything better.

New Comment
12 comments, sorted by Click to highlight new comments since:

I am considering replacing my deduction engine with a psuedo-random statement evaluator in order to save energy (after all, I need energy for many purposes). Why shouldn't I do it?

I think there's a straightforward answer to this part of your question.

"Decisions are informed by the output of my deduction engine" means, I presume, that the AI deduces the result of each possible choice, then makes the choice with the most desirable result.

Suppose the AI is considering replacing its deduction engine with a pseudo-random statement evaluator. The AI will deduce that this replacement will make future decisions pseudo-random. I assume that this will almost never give the most desirable result; therefore, the deducer will almost never be replaced.

If the AI works this way (which I think it compatible with just about any kind of decision theory), its display of "trust" in itself is the direct result of the AI using its deducer to predict the result of replacing its deducer; "trust" is inherent in the way the deducer determines decisions. The deduction engine is able to "tell you that [the] deduction engine is useful" not by deducing its own consistency, but by deducing a better result if the AI keeps it than if the AI replaces it with a pseudo-random substitute.

Does that make sense, and/or suggest anything about the larger issue you're talking about?

It makes sense, but I don't think I agree.

Suppose I am deciding whether to open box A or box B. So I consult my deduction engine, it tells me that if I open box B I die and if I open box A I get a nickel. So I open box A.

Now suppose that before making this choice I was considering "Should I use my deduction engine for choices between two boxes, or just guess randomly, saving energy?" The statement that deduction engine is useful is apparently equivalent to

"If the deduction engine says that the consequences of opening box A are better than the consequences of opening box B, then the consequences of opening box A are better than the consequences of opening box B," which is the sort of statement the deduction engine could never itself consistently output. (By Loeb's theorem, it would subsequently immediately output "the consequences of opening box A are better than the consequences of opening box B" independent of any actual arguments about box A or box B. )

It seems the only way to get around this is to weaken the statement by inserting some "probably"s. After thinking about Loeb's theorem more carefully, it may be the case that refusing to believe anything with probability 1 is enough to avoid this difficulty.

I still can't see why the AI, when deciding "A or B", is allowed to simply deduce consequences, while when deciding "Deduce or Guess" it is required to first deduce that the deducer is "useful", then deduce consequences. The AI appears to be using two different decision procedures, and I don't know how it chooses between them.

Can you define exactly when usefulness needs to be deduced? In either case, it seems that it can deduce consequences in either case without deducing usefulness.

Apologies if I'm being difficult; if you're making progress as it is (as implied by your idea about "probably"s), we can drop this and I'll try to follow along again next time you post.

Slight nitpick:

Now what if X believes itself to be well-calibrated only with probability 99%, in the following sense. Define S1 and S2 as before. What if, for all p, the probability X assigns to S2 is between p and 0.99p times the probability X assigns to S1?

This doesn't handle well near 0, since it basically specifies a certain range on a log-scale (unless this was intentional?). You might want to weaken this to "for all p, the probability X assigns to S2 is between p+0.01 and p-0.01 times the probability X assigns to S1".

You are quite right; I will change the post accordingly.

But that's overdoing it. I'd invariably enter the lottery because it has an expected 0.5% chance of success.

I don't follow. We are discussing agents that can prove that, for all S1, S2 as specified, S2 < 0.01+S1. This does not say how much less. It is possible that S1=S2, we just aren't concerned with proving that in this thought experiment.

The only sort of statements I care about are mathematical statements: any property of our environment can be expressed as a purely mathematical statement.

I believe it's not just a practical difficulty that we can't really specify which mathematical statements environment signifies. We can indeed learn something about certain mathematical statements as a result of observing environment (in a somewhat confusing sense), but that doesn't tell us that real world is exactly the same kind of business as math.

Consider such a simplified world in the interest of making the point clearer. Do you think this problem goes away when the world is not just math?

But an important property of mathematical deduction is the second incompleteness theorem: no reasonable proof system X can prove that X doesn't prove false statements.

True, but it also can't prove that it does prove false statements. It can prove that the pseudorandom number generator does prove false statements, so it's clearly at least as bad.

I don't think your method of self-trust is very complete. Sometimes, your errors will correlate. For example, if you see one thing as stronger evidence of god than it really is, you'll see the same of something else, until you have what you believe is overwhelming evidence.

I suppose you'd have to have some idea of how much the errors correlate. You might give some weird probability distribution for it, which would result in the probability distribution getting wider tails for each piece of evidence added. The middle would represent the errors being independent, and the tails would represent them correlating.

True, but it also can't prove that it does prove false statements. It can prove that the pseudorandom number generator does prove false statements, so it's clearly at least as bad.

Agreed, but trying to make correct deductions takes energy. If your deductions aren't helping you, you are better off not talking.

I don't think your method of self-trust is very complete.

The question was: what sort of evidence could you have for your own correctness. I don't yet know any way of expressing such self-trust at all.

Sometimes, your errors will correlate.

You can express more complicated types of self-trust by looking at your beliefs about other conjunctions of A, B, believing A, and believing B. I believe that this actually expresses all of the possible information. For example, you can look at the probability you assign to "I believe A and B, but neither is true", etc.

You can express more complicated types of self-trust by looking at your beliefs about other conjunctions of A, B, believing A, and believing B. I believe that this actually expresses all of the possible information.

No, you could also ask about your belief of (A & B), which is different from the conjunction of believing A and believing B, as you could believe A to be correlated with B. You could also try recursing farther: "do I believe that I believe A?".