Hmmm. With this setup, there is one piece of useful information that escapes the explosion; the information regarding whether or not the paperclip was, in fact, built. This minor piece of information has a non-zero probability of changing the future (by encouraging you to further continue in AI research).
Therefore, it seems likely that the disciple AI will try to build a paperclip without you finding out. This may require demolishing one or more sensors, placed inside the antimatter cloud (or, for extra thoroughness, subverting them). Alternatively, the AI may do absolutely nothing until a few seconds before the explosion, and then frantically make a single paperclip right at the end, to reduce the number of observations of the paperclip that escape.
As AI failure modes go, though, that seems pretty innocuous.
I mostly steer clear of AI posts like this, but I wanted to give props for the drawing of unsurpassable elegance.
Well, the important question is, will the future where you try your luck as an artist be probabilistically similar to the future where you don't?
You use "utility function" in a weird way when you say it has "a utility function U" at the start. Standard usage has it that, by definition, the utility function is the thing that the AI wants to maximize, so the utility function in this case is only "U-R". To reduce confusion, you could perhaps relabel your initial "utility function" as "objective function" or "goal function" (since you already use "O"). Along similar lines:
...AI consists functionally of the utility function U, the pena
You could also just have a single AI construct a counterfactual model where it was replaced by a resistor, compute R relative to this model, then maximize the utility U' = U - R. I like this better than the master/disciple model.
Aside from implementation, the tricky part seems like specifying R. If you specify it obvious ways like a negative overlap between probability distributions, the AI can think that R is low when really it has a small probability of a huge impact by human standards, which is bad.
Is this reduced impact thing essentially an attempt to build the status quo bias into an AI? (Don't get me wrong, I'm not saying that would necessarily be a bad thing.)
As CCC points out, the fact of whether the paperclip was built can itself influence the future (if you don't need the paperclip, there is no point in building the AI, so you expect its creation to influence the future). This gives a heuristic argument that whenever you want the AI to produce anything useful, and the AI doesn't optimize the future, U=1 will imply R>>1. Together with U=0 implying R=0, this suggests that if AI doesn't optimize the future, it'll choose U=0, i.e. refuse to produce anything useful (which is probably not what you want).
On t...
I love what you're doing and was with up until the conclusion, "Given all these assumptions, then it seems that this setup will produce what we want: a reduced impact AI that will program a disciple AI that will build a paperclip and little else."
Obviously such a conclusion does not realistically follow simply from implementing this one safety device. This is one good idea, but alone isn't nearly sufficient. You will need many, many more good ideas.
Thinking of possible ways to munchkin this setup... One example: create a copy of the universe, and turn it into paperclips, while leaving the original intact (R was originally specified over only). Provided, of course, that building more than a single paperclip increases U. This seems to be implied by "The disciple builds a paperclip or two and nothing much else", though is in contradiction with "utility 1 if it builds at least one".
Because having a utility function that somewhat resembles humans' (including Eliezer's) is part of what Eliezer means by "Friendly".
Maybe some Friendly AIs would in fact do that. But Eliezer's saying there's no obvious reason why they should; why would finding that the laws of physics aren't what we think they are cause an AI to stop acting Friendly, any more than (say) finding much more efficient algorithms for doing various things, discovering new things about other planets, watching an episode of "The Simpsons", or any of the countless other things an AI (or indeed a human) might do from time to time?
If I'm right that #2 is part of what Eliezer is saying, maybe I should add that I think it may be missing the point Stuart_Armstrong is making, which (I think) isn't that an otherwise-Friendly AI would discover it can violate what we currently believe to be the laws of physics and then go mad with power and cease to be Friendly, but that a purported Friendly AI design's Friendlines might turn out to depend on assumptions about the laws of physics (e.g., via bounds on the amount of computation it could do in certain circumstances or how fast the number of in...
what happens when the AI realises that the definition of what a "human" is turns out to be flawed.
The AI's definition of "human" should be computational. If it discovers new physics, it may find additional physical process that implement that computation, but it should not get confused.
Ontological crises seems to be a problem for AIs with utility functions over arrangements of particles, but it doesn't make much sense to me to specify our utility function that way. We don't think of what we want as arrangements of particles, we think at a much higher level of abstraction and we would be happy with any underlying physics that implemented the features of that abstraction level. Our preferences at that high level are what should generate our preferences in terms of ontologically basic stuff whatever ontology the AI ends up using.
I got 99 psychological drives but inclusive fitness ain't one.
In what way is evolution supposed to be robust? It's slow, stupid, doesn't reproduce the content of goal systems at all and breaks as soon as you introduce it to a context sufficiently different from the environment of evolutionary ancestry because it uses no abstract reasoning in its consequentialism. It is the opposite of robust along just about every desirable dimension.
t=59 minutes...
AI: Hmm, I have produced in this past hour one paperclip, and the only other thing I did was come up with the solutions for all of humanity's problems, I guess I'll just take the next minute to etch them into the paperclip...
t=2 hours...
Experimenters: Phew, at least we're safe from that AI.
It seems to me like an AI enclosed in a cloud of chaotic antimatter would not be very useful. Any changes small enough to be screened out by the existence of the antimatter cloud would also be small enough to be destroyed by the antimatter cloud when we go to actually use them, right? If we want the AI to make one paperclip, presumably we want to be able to access that paperclip once it's built. And the antimatter cloud would prevent us from getting at the paperclip. And that's completely ignoring that antimatter bomb rigged to detonate the contents of the box. There needs to be a better way of defining "reduced impact" for this to be a practical idea.
About a year ago, I posted several ideas for "reduced impact AI" (what Nick Bostrom calls "domesticity"). I think the most promising approach was the third one, which I pompously titled "The information in the evidence". In this post, I'll attempt to put together a (non-realistic) example of this, to see if it's solid enough to build on. I'll be highlighting assumptions I'm making about the AI; please point out any implicit assumption that I missed, and any other weaknesses of the setup. For the moment, I'm more interested in "this doesn't work" than "this can't be done in practice" or "this can't be usefully generalised".
EDIT: It wasn't clear here, but any paperclip constructed by the reduced impact AI would be destroyed in the explosion, and the AIs would not be observed during the process. How to get useful work out of the AI will be the next step, if this model holds up.
Intuitive idea
For a reduced impact AI, we want an AI that can accomplish something, say building a paperclip, without it going out of control and optimising the universe. We want the future to be roughly the same whether or not the AI was turned on. Hence the piece of information "the AI was turned on" is not particularly important - if we didn't know, we wouldn't go far wrong in our predictions.
To enforce this we'll equip the AI with a two-piece motivation: a utility function U (causing it to build paperclips) and a penalty function R (which penalises the AI if its actions have a large future 'impact'). The challenge is to have a setup and a definition of R that implements this intuitive idea.
Reduced impact
Can we hope to define reduced impact? Doesn't any change percolate across all future time? Yes, but not all changes are the same - especially where our ability to predict them is concerned. Imagine an AI, confined to a box. While inside the box, it will build a paperclip. Then, in one hour's time, an antimatter bomb inside the box will explode, obliterating everything inside. We would intuitively say that whether the paperclip was built or not, would have very little impact afterwards. What we really mean is that we are unable to usefully use that information for future predictions. To make it even harder, surround the box by a large mass of antimatter gas, moving chaotically. In order to figure out the consequences of anything happening inside the box, we'd have to be able to unravel chaos: solve chaotic equation correctly and quantitatively, not just qualitatively. So our first assumption is that the AI can't unravel chaos.
Note that the chaotic antimatter is meant to erase the impact of small changes, not to contain the AI. We should assume that the AI could easily get out of the box and spread across the universe, should it want to. So we'll be putting machinery and other useful tools inside the box - anything an AI could use to build paperclips or take over the universe. We're only successful if it doesn't want to do that last thing.
The Future
We want 'the future' to be roughly the same whatever the AI does. But what do we mean by that? Obviously, if the AI builds a paperclip, the near future inside the box will be very different. What we want is that the mid future, the future after the explosion, be not very different. So we want to define F, a subset of spacetime, that represents this mid future.
There are many ways of doing this. Let O=(box, AI turned on) be the point in space time at which the experiment begins. We could then take the initial rest frame of the box, and say that F contains everything two hours or more in the future from O (recall the explosion happens after one hour). Or we could be Lorentz-invariant, and say that F contains everything that can be reached from O, along a geodesic with proper-time more than two hours. F need not even be strictly "the future": it could contain everything in the universe, except for the box-and-surrounding-cloud-for-the-duration-of-the-experiment.
Note that in order to define this F, we're making use of general relativity - so we assume the AI understands physics. This includes grounding the concepts into reality properly, and dealing with ontology crises. We'll also need to assume that the AI obeys the laws of physics as we understand them, and doesn't have outside-physics superpowers.
The AI design
The point of defining F was to be able to talk sensibly about a space in which we expect the AI's actions to have reduced measurable impact. So we'll need to start assuming some things about how the AI is structured. We'll assume that the AI consists functionally of the utility function U, the penalty function R and a probability estimating module P, and whatever practical components (including knowledge bases) it needs to run its calculations. We'll assume that the AI can't directly change these components (though indirectly, it's another matter - if it has access to manipulators, it can freely hack itself). The reason for this assumption will be clear in the next section. Note that we are assuming the AI can know every detail of the setup - we're not counting on security through obscurity.
Given any world w, we can define wF as the world w restricted to the space F - i.e. the future part of w. The probability estimating module P can estimate numbers such as P(wF) (the probability of wF) and P(wF|e) (the probability of wF given evidence e). Some worlds will have wF=vF even with w≠v (their pasts differ, but their futures are the same).
Because of the limitations of the AI (being unable to unravel chaos), we have an implicit coarse-graining of the worlds. For instance, let X be a minor fact about the inside of the box, and let w and v be worlds such that w assumes X and v assumes ¬X. Then if wF=vF (so the worlds have the same future), we have P(vF|X)=P(wF|X) ≥ P(w|X), even though P(v|X)=0.
The Counterfactual
We want the future to be roughly the same "whether or not the AI was turned on". Hence we need to define a counterfactual. It's asking for trouble to ask a working AI to imagine what the universe would be like if it wasn't turned on. So we'll use a more complicated counterfactual that doesn't have that problem: the relevant counterfactual is firmly in the future, and firmly probabilistic.
Initially, we imagine the box contains a "master AI" (which has been turned on), and a "disciple AI", which is turned off. The master AI is the one whose motivation we want to constrain. We don't give the master AI any manipulators or direct means of influencing anything at all. All that it can do is build the source code for the disciple AI. We assume the master AI can't affect the world in any other way.
Specifically, the master AI will build the source code, according to its preferences, and store it somewhere. Then, exactly 30 minutes after the experiment began (and hence 30 minutes before the explosion), that source code will be copied into the disciple AI. If the source code is not empty, the disciple AI will be turned on.
The counterfactual will be whether the code is copied or not. At the 30 minute mark, a random process X will output 1 or 0 with equal probability. If X=1, the source code is copied as described. If X=0, nothing happens. We assume that X is a process that the (master) AI will find truly random (note that this can be deduced from the "can't unravel chaos" assumption - but it's good to have it specified separately, as it serves a different purpose in the experiment). This X will be the point of counterfactuality - we want the future world to be similar whether X=0 (no AI gets to do anything) or X=1 (the disciple AI is unleashed.
The whole setup can be summarised in the following drawing of unsurpassable elegance:
The Penalty Function
We can now describe the penalty function R. For any world w, we can compute the difference in probability between the future world given X=0 and X=1. There are many ways of assessing the difference; the absolute difference in probability is one simple way of doing this:
R(w)= | P(wF|X=1) - P(wF|X=0) |
Then we can integrate this across all w (for some reasonable weights) to get the total penalty function R. As required, R measures how much the AI's future expectation changes, given whether it's successful at creating the disciple AI's source code or not.
Relative Scaling
We're ready to go now. The AI's motivation is that it must maximise U-R: try and maximise its utility while diminishing the penalty function. This is not exactly a utility function, since the output of the probability module P appears inside R, the quantity to be maximised (probabilities shouldn't appear within the definition of the value of a world). Since the master AI is prevented from self-modification, this should not be a problem. Alternately, P could be formally defined as "the output of a module described as [the code of the AI's P module]". This would make U-R into a proper utility function, where the module P used by the master AI for probability happens to be the same as the module P used in the definition of R - but they aren't specified to be identical.
Let's define U in a nice bounded way: giving 0 utility if the disciple builds no paperclips, and utility 1 if it builds at least one. The motivations that the master AI could pick for its disciple fall broadly into three categories:
We've assumed that the master AI can't unravel chaos - and if this is exactly true, then we're fine. Then option 2 is perfect: U will be close to 1, and R will be exactly 0 and the AI will implement that. But maybe the master AI is a bit smarter than we thought, and the "can't unravel chaos" is broadly true, but the AI can still predict some small expected difference. This means that R will increase above zero, making the master AI's task harder, and making the "do nothing" option relatively more attractive.
This seems a great security feature - it means that if the master AI is smarter than we expected, it is more likely to do nothing. But this setup fails if the master AI always does nothing. So we'll need to weight R so that the master AI can program its disciple to do something. There are issues with doing this in practice (we don't want AIs to influence people to set R too low, for instance with deliberate inaction). But let's assume here that we've picked an R such that if the AI is as smart as we think, it will just about be able to do something (if it's smarter, it'll still do nothing).
Conclusion
Given all these assumptions, then it seems that this setup will produce what we want: a reduced impact AI that will program a disciple AI that will build a paperclip and little else. The main error mode (if the AI is smarter than expected) seems to be for the AI to do nothing.
Note: the original formulation of this had a second box, causally disconnected and otherwise identical but with no AI, for purposes of comparison and calibration. It seems that it isn't necessary, but I may put it back in if needed.