A putative new idea for AI control; index here. See also Utility vs Probability: idea synthesis.
Ok, here is the problem:
- You have to create an AI that believes (or acts as if it believed) that event X is almost certain, while you believe that X is almost impossible. Furthermore, you have to be right. To make things more interesting, the AI is much smarter than you, knows everything that you do (and more), and has to react sensibly when event X doesn't happen.
Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?
Acts and beliefs
The main trick is the parenthetical "(or acts as if it believed)". If the AI actually has true beliefs, then there is no problem with it being smart, being knowledgeable, or updating on the fact that X didn't happen. So the problem reduces to:
- How can an AI, that believes that X is almost impossible, behave as if X were almost certain?
This will be constructed by developing on my idea for "utility indifference" (that old concept needs a better name, btw; corrigibility is a much better candidate for being labelled "utility indifference").
Noisy events and thermodynamic miracles
Imagine that a signal is travelling down a wire inside a noisy gas (or some other source of randomness). Almost certainly the signal will get through, but very occasionally, by the spontaneous action of the molecules in the gas, the signal will get overwhelmed, erased and overwritten with another signal, randomly. This could be seen as a type of thermodynamic miracle: an incredibly unlikely event caused by spontaneous and coincidental coordination between lots of different molecules. We assume here that the AI cannot cause a thermodynamic miracle, nor make it more likely (this may require some specific definitions of what "counts" as a miracle).
The idea is to make the AI indifferent to any world in which there is no thermodynamic miracle. So give it a utility u such that u(w)=constant, for any world w without a thermodynamic miracle at the point of interest. The value of that constant doesn't matter; the result is that the AI's actions have no impact on utility in any world where the miracle didn't happen. Thus, as long as there is a tiny chance that the miracle happened, the AI will behave as if that was a certainty: for only in those worlds do its actions have any impact.
You can probably see where I'm going with this. Suppose the signal goes through the wire to turn on another AI. That AI will only turn on given a very specific 100-digit "ON" binary code. Given any other code, the second AI won't turn on, and will erase itself. We send the "ON" code.
So what happens? The first AI reasons that given a thermodynamic miracle (the only worlds it cares about) the output signal is going to be 100 random bits, which have only 2-100 chance of being "ON". Therefore it will behave as if it's almost certain "ON" will not get through the gas.
And what will (almost certainly) happen in reality? We send the "ON" signal, and it gets through unchanged. The first AI observes this (in fact, it expected this). However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.
We should develop this point a little. The AI is almost certain, ahead of time, that "ON" will happen. It behaves, however, as if "ON" is almost certain not to happen. Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour. Moreover, it knows that this is what it will (almost certainly) do. Since it knows ahead of time it will be backtracking, why does it still go ahead with its behaviour?
The answer is that most of the universes in which it backtracks, are universes where its actions have no consequences, and hence it doesn't care about them. So it might have 1-ε chance of backtracking, but it only cares about a proportion ε(2-100) of those worlds. On the other hand, it won't backtrack with probability ε(1-2-100), but it cares about those worlds completely.
False Thermodynamic Miracles
This is the explanation for the title: the AI will behave as if there was a real thermodynamic miracle (a spontaneous overwriting of a signal with the same signal) while we "know" that actually it was just the same signal going through.
By varying what we put at the other end of the wire, we can get the AI to acts "as if" it believed almost anything, without introducing problematic beliefs. We just have to be a bit cautious with our setups, by considering how humans would react if there was a real thermodynamic miracle (because the AI is certainly going to be considering that very carefully).
This allows us to consider AI designs with counterfactuals, false "beliefs", etc..., which will be of great use. Are there any flaws I'm missing? Practical flaws are ok (eg how do you define a thermodynamic miracle and make sure the AI can't affect it), but theoretical flaws are much more important at this stage.
In equations
Edit: (thanks to TylerJay)
In equation form, the AI is maximising
P(¬X)*C + P(X)*u(X, A),
for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X, A) - the probability P(X) is irrelevant.
The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then
P(W) ≈ P(¬X) ≈ 1 (we "know" that X doesn't happen and that we end up in W),
while
P(W|X) << 1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).
Same result, computationally
The above was a "physical" design, revolving around a particular physical setup, plus some value hacking. But we can do the same in a purely computational way, by adapting drnickbone's design.
To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.
It seems that A must assign equal probability 2-100 to each sequence (as any increase in probability makes that sequence less likely to appear, and any decrease makes it more likely). Therefore, we "know" that the signal will be "0000...0000", but the AI cannot assign a higher probability to that. This achieves the same outcome as above.
I personally feel the physical design is more secure - it seems easier to define what a thermodynamic miracle at a given point in spacetime would be, than to be certain that the AI will not evade the computational restrictions somehow, or that the computational setup will remain stable and well defined. Also, A may not spread exactly equal probability over 2100 sequences, messing up our certainty as to the result.
I am fairly confident that I understand your intentions here. A quick summary, just to test myself:
HAL cares only about world states in which an extremely unlikely thermodynamic even occurs- namely, the world in which one hundred random bits are generated spontaneously during a specific time interval. HAL is perfectly aware that these are unlikely events, but cannot act in such a way as to make the event more likely. HAL will therefore increase total utility over all possible worlds where the unlikely even occurs, and otherwise ignore the consequences of its choices.
This time interval corresponds by design with an actual signal being sent. HAL expects the signal to be sent, with a very small chance that it will be overwritten by spontaneously generated bits and thus be one of the words where it wants to maximize utility. Within the domain of world states that the machine cares about, the string of bits is random. There is a string among all these worlds states that corresponds to the signal, but it is the world where that signal is generated randomly by the spontaneously generated bits. Thus, within the domain of interest to HAL, the signal is extremely unlikely, whereas within all domains known to HAL, the signal is extremely likely to occur by means of not being overwritten in the first place. Therefore, the machine's behavior will treat the actual signal in a counterfactual way despite HAL's object-level knowledge that the signal will occur with high probability.
If that's correct, then it seems like a very interesting proposal!
I do see at least one difference between this setup, and a legitimate counterfactual belief. In particular, you've got to worry about behavior in which (1-epsilon)% of all possible worlds have a constant utility. It may not be strictly equivalent to the simple counterfactual belief. Suppose, in a preposterous example, that there exists some device which marginally increases your ability to detect thermodynamic miracles (or otherwise increases your utility during such a miracle); unfortunately, if no thermodynamic miracle is detected, it explodes and destroys the Earth. If you simply believe in the usual way that a thermodynamic miracle is very likely to occur, you might not want to use the device, since it's got catastrophic consequences for the world where your expectation is false. But if the non-miraculous world states are simply irrelevant, then you'd happily use the device.
As I think about it, I think maybe the real weirdness comes from the fact that your AI doesn't have to worry about the possibility of it being wrong about there having been a thermodynamic miracle. If it responds to the false belief that a thermodynamic miracle has occurred, there can be no negative consequences.
It can account for the 'minimal' probability that the signal itself occurs, of course- that's included in the 'epsilon' domain of worlds that it cares about. But when the signal went through, the AI would not necessarily be acting in a reasonable way on the probability that this was a non-miraculous event.
Yep, that's pretty much it.