A putative new idea for AI control; index here. See also Utility vs Probability: idea synthesis.
Ok, here is the problem:
- You have to create an AI that believes (or acts as if it believed) that event X is almost certain, while you believe that X is almost impossible. Furthermore, you have to be right. To make things more interesting, the AI is much smarter than you, knows everything that you do (and more), and has to react sensibly when event X doesn't happen.
Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?
Acts and beliefs
The main trick is the parenthetical "(or acts as if it believed)". If the AI actually has true beliefs, then there is no problem with it being smart, being knowledgeable, or updating on the fact that X didn't happen. So the problem reduces to:
- How can an AI, that believes that X is almost impossible, behave as if X were almost certain?
This will be constructed by developing on my idea for "utility indifference" (that old concept needs a better name, btw; corrigibility is a much better candidate for being labelled "utility indifference").
Noisy events and thermodynamic miracles
Imagine that a signal is travelling down a wire inside a noisy gas (or some other source of randomness). Almost certainly the signal will get through, but very occasionally, by the spontaneous action of the molecules in the gas, the signal will get overwhelmed, erased and overwritten with another signal, randomly. This could be seen as a type of thermodynamic miracle: an incredibly unlikely event caused by spontaneous and coincidental coordination between lots of different molecules. We assume here that the AI cannot cause a thermodynamic miracle, nor make it more likely (this may require some specific definitions of what "counts" as a miracle).
The idea is to make the AI indifferent to any world in which there is no thermodynamic miracle. So give it a utility u such that u(w)=constant, for any world w without a thermodynamic miracle at the point of interest. The value of that constant doesn't matter; the result is that the AI's actions have no impact on utility in any world where the miracle didn't happen. Thus, as long as there is a tiny chance that the miracle happened, the AI will behave as if that was a certainty: for only in those worlds do its actions have any impact.
You can probably see where I'm going with this. Suppose the signal goes through the wire to turn on another AI. That AI will only turn on given a very specific 100-digit "ON" binary code. Given any other code, the second AI won't turn on, and will erase itself. We send the "ON" code.
So what happens? The first AI reasons that given a thermodynamic miracle (the only worlds it cares about) the output signal is going to be 100 random bits, which have only 2-100 chance of being "ON". Therefore it will behave as if it's almost certain "ON" will not get through the gas.
And what will (almost certainly) happen in reality? We send the "ON" signal, and it gets through unchanged. The first AI observes this (in fact, it expected this). However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.
We should develop this point a little. The AI is almost certain, ahead of time, that "ON" will happen. It behaves, however, as if "ON" is almost certain not to happen. Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour. Moreover, it knows that this is what it will (almost certainly) do. Since it knows ahead of time it will be backtracking, why does it still go ahead with its behaviour?
The answer is that most of the universes in which it backtracks, are universes where its actions have no consequences, and hence it doesn't care about them. So it might have 1-ε chance of backtracking, but it only cares about a proportion ε(2-100) of those worlds. On the other hand, it won't backtrack with probability ε(1-2-100), but it cares about those worlds completely.
False Thermodynamic Miracles
This is the explanation for the title: the AI will behave as if there was a real thermodynamic miracle (a spontaneous overwriting of a signal with the same signal) while we "know" that actually it was just the same signal going through.
By varying what we put at the other end of the wire, we can get the AI to acts "as if" it believed almost anything, without introducing problematic beliefs. We just have to be a bit cautious with our setups, by considering how humans would react if there was a real thermodynamic miracle (because the AI is certainly going to be considering that very carefully).
This allows us to consider AI designs with counterfactuals, false "beliefs", etc..., which will be of great use. Are there any flaws I'm missing? Practical flaws are ok (eg how do you define a thermodynamic miracle and make sure the AI can't affect it), but theoretical flaws are much more important at this stage.
In equations
Edit: (thanks to TylerJay)
In equation form, the AI is maximising
P(¬X)*C + P(X)*u(X, A),
for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X, A) - the probability P(X) is irrelevant.
The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then
P(W) ≈ P(¬X) ≈ 1 (we "know" that X doesn't happen and that we end up in W),
while
P(W|X) << 1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).
Same result, computationally
The above was a "physical" design, revolving around a particular physical setup, plus some value hacking. But we can do the same in a purely computational way, by adapting drnickbone's design.
To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.
It seems that A must assign equal probability 2-100 to each sequence (as any increase in probability makes that sequence less likely to appear, and any decrease makes it more likely). Therefore, we "know" that the signal will be "0000...0000", but the AI cannot assign a higher probability to that. This achieves the same outcome as above.
I personally feel the physical design is more secure - it seems easier to define what a thermodynamic miracle at a given point in spacetime would be, than to be certain that the AI will not evade the computational restrictions somehow, or that the computational setup will remain stable and well defined. Also, A may not spread exactly equal probability over 2100 sequences, messing up our certainty as to the result.
This is really interesting. I thought I understood it and I wanted to verify that by trying to summarize it (and maybe help others too) but now I'm not so sure...
Edit: Just to save anybody the reading time, my reasoning is false below. After sleeping on it, I see my mistake. Nothing below the "False Thermodynamic Miracles" subheading made sense to me yesterday because I thought the purpose of the setup was to have an "off switch" on the simulated AI under the false belief (letting it see the result of the signal after some time period). I get it now though. Maximizing "[P(no miracle) C] + [P(miracle) u(A) given miracle]" is the same as maximizing "u(A) given miracle". So the AI will act as if the miracle happened, because there's no cost associated with those actions if it didn't happen—only benefits if it did.
As I understand it, the basic premise is that the AI has a utility function u(w,A) where "w" is whether or not the specific thermodynamic miracle (TM) occurred and A represents a regular/normal input to a utility function (state of the universe, action taken, or whatever).
u(w,A) = C (a low constant) if w is false (TM did not happen), and u(w,A) is essentially "u(A) given w" when w is true (ie. it's a "normal" kind of utility function, dominated by things other than the TM, though it still has to incorporate the fact that the TM happened into its model of the world).
So, the signal is sent, and it calculates its expected utility for either scenario. If w = false, u(w,A) = a low constant, and if w = true, it determines an action A that maximizes the utility function u(A)|w. As long as 2^-100 u( A ) > C, it starts taking action A.
Is this right so far?
This mechanism seems underspecified to me (though maybe it's obvious to others). Why would it backtrack (or what do you mean by backtrack)? Eventually, it observes that w = false (that "ON" went through unchanged) and that its actions are no longer beneficial, so it just stops doing anything, right? The process terminates or it goes to standby?
If we want the AI to actually backtrack, wouldn't we need to modify the utility function so that it takes a configuration parameter as well? u(w,A,X) s.t. X = true iff the AI is in its initial configuration, and set it so that u(F,A,F) = 0, but u(F,A,T) = C ? (this probably counts as just an implementation detail, but I figured I'd mention it anyway)
The part that makes me feel like I didn't actually understand it at all is:
Can you explain this a bit more? Do you mean that even after it observes "ON", it doesn't stop acting as if w = true? That P(TM that overwrites ON with ON) * u(A)|w > C ? If that's the case, then it would never backtrack, right? So it's essentially a full simulation of an AI under the assumption w, but with the knowledge that w is incredibly unlikely, and no built-in halting condition?
Thanks
C need not be a low constant, btw. The only requirement is that u(false,action a, A) = u(false, action b, A) for all actions a and b and all A. ie nothing the AI does affects the utility of worlds where w is false, so this does not constrain its actions.
Basically the AI observes the ON signal going through, and knows that either a) the signal went through normally, or b) the signal was overwritten by coincidence by exactly the same signal. It's actions have no consequences in the first case, so it ignores it, and acts "as if" it was certain there had been a thermodynamic miracle that happened.