Right, that much makes sense. The problem is the "perfectly simulate C3PO" part toward the end. If we really want to see what it would do, then we need a perfect simulation of the environment in addition to C3PO itself. Any imperfection, and C3PO might realize it's in a simulated environment. All else equal, once C3PO* knows it's in a simulated environment, it would presumably try to get out. Since its utility function is different from C3PO, it would sometimes be motivated to undermine C3PO (or us, if we're the ones running the simulation).
Just remember that this isn't a boxing setup. This is just a way of seeing what an AI will do under a false belief. From what I can tell, the concerns you brought up about it trying to get out isn't any different between the scenario when we simulate C3PO* and when we simulate C3PO. The problem of making a simulation indistinguishable from reality is a separate issue.
A putative new idea for AI control; index here. See also Utility vs Probability: idea synthesis.
Ok, here is the problem:
Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?
Acts and beliefs
The main trick is the parenthetical "(or acts as if it believed)". If the AI actually has true beliefs, then there is no problem with it being smart, being knowledgeable, or updating on the fact that X didn't happen. So the problem reduces to:
This will be constructed by developing on my idea for "utility indifference" (that old concept needs a better name, btw; corrigibility is a much better candidate for being labelled "utility indifference").
Noisy events and thermodynamic miracles
Imagine that a signal is travelling down a wire inside a noisy gas (or some other source of randomness). Almost certainly the signal will get through, but very occasionally, by the spontaneous action of the molecules in the gas, the signal will get overwhelmed, erased and overwritten with another signal, randomly. This could be seen as a type of thermodynamic miracle: an incredibly unlikely event caused by spontaneous and coincidental coordination between lots of different molecules. We assume here that the AI cannot cause a thermodynamic miracle, nor make it more likely (this may require some specific definitions of what "counts" as a miracle).
The idea is to make the AI indifferent to any world in which there is no thermodynamic miracle. So give it a utility u such that u(w)=constant, for any world w without a thermodynamic miracle at the point of interest. The value of that constant doesn't matter; the result is that the AI's actions have no impact on utility in any world where the miracle didn't happen. Thus, as long as there is a tiny chance that the miracle happened, the AI will behave as if that was a certainty: for only in those worlds do its actions have any impact.
You can probably see where I'm going with this. Suppose the signal goes through the wire to turn on another AI. That AI will only turn on given a very specific 100-digit "ON" binary code. Given any other code, the second AI won't turn on, and will erase itself. We send the "ON" code.
So what happens? The first AI reasons that given a thermodynamic miracle (the only worlds it cares about) the output signal is going to be 100 random bits, which have only 2-100 chance of being "ON". Therefore it will behave as if it's almost certain "ON" will not get through the gas.
And what will (almost certainly) happen in reality? We send the "ON" signal, and it gets through unchanged. The first AI observes this (in fact, it expected this). However, the only possible world that it cares about, now, is one in which the "ON" signal was overwritten by a thermodynamic miracle... into exactly the same "ON" signal. As long as the possibility of this is non-zero, the AI will behave as if it's certain.
We should develop this point a little. The AI is almost certain, ahead of time, that "ON" will happen. It behaves, however, as if "ON" is almost certain not to happen. Then, after "ON" has gone through, it must adapt itself and backtrack on its previous behaviour. Moreover, it knows that this is what it will (almost certainly) do. Since it knows ahead of time it will be backtracking, why does it still go ahead with its behaviour?
The answer is that most of the universes in which it backtracks, are universes where its actions have no consequences, and hence it doesn't care about them. So it might have 1-ε chance of backtracking, but it only cares about a proportion ε(2-100) of those worlds. On the other hand, it won't backtrack with probability ε(1-2-100), but it cares about those worlds completely.
False Thermodynamic Miracles
This is the explanation for the title: the AI will behave as if there was a real thermodynamic miracle (a spontaneous overwriting of a signal with the same signal) while we "know" that actually it was just the same signal going through.
By varying what we put at the other end of the wire, we can get the AI to acts "as if" it believed almost anything, without introducing problematic beliefs. We just have to be a bit cautious with our setups, by considering how humans would react if there was a real thermodynamic miracle (because the AI is certainly going to be considering that very carefully).
This allows us to consider AI designs with counterfactuals, false "beliefs", etc..., which will be of great use. Are there any flaws I'm missing? Practical flaws are ok (eg how do you define a thermodynamic miracle and make sure the AI can't affect it), but theoretical flaws are much more important at this stage.
In equations
Edit: (thanks to TylerJay)
In equation form, the AI is maximising
P(¬X)*C + P(X)*u(X, A),
for some constant C, some unlikely event X that the AI cannot affect, some set of relevant descriptors A, and some utility u. Since C is constant, this is exactly the same as maximising u(X, A) - the probability P(X) is irrelevant.
The whole setup described is simply a way to ensure that if W is the likely set of worlds consistent with observations after ¬X/X, then
P(W) ≈ P(¬X) ≈ 1 (we "know" that X doesn't happen and that we end up in W),
while
P(W|X) << 1 (in the worlds it cares about, the AI behaves as if W was incredibly unlikely to come about).
Same result, computationally
The above was a "physical" design, revolving around a particular physical setup, plus some value hacking. But we can do the same in a purely computational way, by adapting drnickbone's design.
To adapt that design, assume Ω will run UDT/TDT agent A to get its estimation as to the likely probability of any one of 2100 sequences of bits coming down the wire. It will then choose to send the signal that A assigns the lowest probability to. If there is a tie, it will choose the first sequence, lexicographically.
It seems that A must assign equal probability 2-100 to each sequence (as any increase in probability makes that sequence less likely to appear, and any decrease makes it more likely). Therefore, we "know" that the signal will be "0000...0000", but the AI cannot assign a higher probability to that. This achieves the same outcome as above.
I personally feel the physical design is more secure - it seems easier to define what a thermodynamic miracle at a given point in spacetime would be, than to be certain that the AI will not evade the computational restrictions somehow, or that the computational setup will remain stable and well defined. Also, A may not spread exactly equal probability over 2100 sequences, messing up our certainty as to the result.