Eliezer_Yudkowsky comments on The mathematics of reduced impact: help needed - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (94)
Coarse-grained impact measures end with the AI deploying massive-scale nanotech in order to try and cancel out butterfly effects and force the world onto a coarse-grained path as close as possible to what it would've had if the AI "hadn't existed" however that counterfactual was defined. Weighting the importance of grains doesn't address this fundamental problem.
I think you're on fundamentally the wrong track here. Not that I know how to build an Oracle AI either, but giving an AI a huge penalty function over the world to minimize seems like an obvious recipe for building something that will exert lots and lots of power.
I share your pessimism about this sort of boxing proposal, but I don't think your criticism holds up.
In this setup the AI can always choose to do nothing, so if it is maximizing U - (penalty), where U is bounded between 0 and 1, it will expect to incur a penalty of at most 1. So your objection would be resolved if we could find a (penalty) such that massive nanotech reliably incurs a penalty of at least 1, but such that U - (penalty) was positive for some courses of action. Nanotech isn't magic, so this isn't so hard if we have a coarse graining relative to which the AI should have a very small effect.
For example, if our course-graining was defined by only paying attention to the world outside the box over the next hour, and we expect the AI to have no effect outside of the box until we sterilize the box and look inside in an hour (and the box were relatively well insulated and so on), this is a pretty "easy" target--it requires "merely" reaching into the ontology to specify the coarse-graining, the event X, and time duration, and being able to define a utility function in terms of the AI's beliefs without leading to insanity.
Of course, we can get similar gains in such simple cases with good hardware containment, and the question is what you would safely do with such limited functionality. Stuart is trying to aim for a more generally useful notion, whose possibility I am more dubious of.
I do not understand how your reply addresses the issue of the butterfly effect, which would also radiate out from a sterilized box. Physics is just not that stable; any microscopic equivalent of a sneeze, even in the form of past alternative firings for transistors, will still radiate out to larger and larger effects. If the counterfactual in "my effect on the universe" is defined relative to a privileged null action, the AI will always take that action and behave in an undefined way relative to the effect of electromagnetic radiation from its circuitry, etc., and the timing of its display and anything else that was defined into the coarse-grained equivalence class of the privileged null action, etc., all of which would be subject to optimization in the service of whichever other goals it had, so long as the inevitable huge penalty was avoided by staying in the "null action" equivalence class.
The penalty for impact is supposed to be defined with respect to the AI's current beliefs. Perhaps shuttling around electrons has large effects on the world, but if you look at some particular assertion X and examine P(X | electron shuffle 1) vs. P(X | electron shuffle 2), where P is AI's beliefs, you will not generally see a large difference. (This is stated in Stuart's post, but perhaps not clearly enough.)
I'm aware of the issues arising from defining value with this sort of reference to "the AI's beliefs." I can see why you would object to that, though I think it is unclear whether it is fatal (minimally it restricts the range of applicability, perhaps to the point of unhelpfulness).
Also, I don't quite buy your overall argument about the butterfly effect in general. For many chaotic systems, if you have a lot of randomness going in, you get out an appropriate equilibrium distribution, which then isn't disturbed by changing some inputs arising from the AI's electron shuffling (indeed, by chaoticness it isn't even disturbed by quite large changes). So even if you talk about the real probability distributions over outcomes for a system of quantum measurements, the objection doesn't seem to go through. What I do right now doesn't significantly affect the distribution over outcomes when I flip a coin tomorrow, for example, even if I'm omniscient.
See wedifrid's reply and my comment.
This confuses me. Doesn't the "randomness" of quantum mechanics drown out and smooth over such effects, especially given multiple worlds where there's no hidden pseudorandom number generator that can be perpetuated in unknown ways?
I don't think so. Butterfly effects in classical universes should translate into butterfly effects over many worlds.
If we use trace distance to measure the distance between distributions outside of the box (and trace out the inside of the box) we don't seem to get a butterfly effect. But these things are a little hard to reason about so I'm not super confident (my comment above was referring to probabilities of measurements rather than entire states of affairs, as suggested in the OP, where the randomness more clearly washes out).
So today we were working on the Concreteness / Being Specific kata.
I can't visualize how "trace distance" makes this not happen.
I believe the Oracle approach may yet be recovered, even in light of this new flaw you have presented.
There are techniques to prevent sneezing and if AI researchers were educated in them then such a scenario could be avoided.
(Downvote? S/he is joking and in light of how most of these debates go it's actually pretty funny.)
I've provided two responses, which I will try to make more clear. (Trace distance is just a precise way of measuring distance between distributions; I was trying to commit to an actual mathematical claim which is either true or false, in the spirit of precision.):
My sneezing may be causally connected to the occurrence of a hurricane. However, given that I sneezed, the total probability of a hurricane occurring wasn't changed. It was still equal to the background probability of a hurricane occurring, because many other contributing factors--which have a comparable contribution to the probability of a hurricane in florida--are determined randomly. Maybe for reference it is helpful to think of the occurrence of a hurricane as an XOR of a million events, at least one of which is random. If you change one of those events it "affects" whether a hurricane occurs, but you have to exert a very special influence to make the probability of a hurricane be anything other than 50%. Even if the universe were deterministic, if we define these things with respect to a bounded agent's beliefs then we can appeal to complexity-theoretic results like Yao's XOR lemma and get identical results. If you disagree, you can specify how your mathematical model of hurricane occurrence differs substantially.
This just isn't true. In the counterfactual presented the state of the universe where there is no sneeze will result - by the very operation of phsyics - in a hurricane while the one with a sneeze will not. (Quantum Mechanics considerations change the deterministic certainty to something along the lines of "significantly more weight in resulting Everett Branches without than resulting Everett Branches with" - the principle is unchanged.)
Although this exact state of the univrse not likely to occur - and having sufficient knowledge to make the prediction in advance is even more unlikely - it is certainly a coherent example of something that could occur. As such it fulfills the role of illustrating what can happen when a small intervention results in significant influence.
You seem to be (implicitly) proposing a way of mapping uncertainty about whether there may be a hurricane and then forcing them upon the universe. This 'background probability' doesn't exist anywhere except in ignorance of what will actually occur and the same applies to 'are determined randomly'. Although things with many contributing factors can be hard to predict things just aren't 'determined randomly' - at least not according to physics we have access to. (The aforementioned caveat regarding QM and "will result in Everett Branches with weights of..." applies again.)
This is helpful for explaining where your thinking has gone astray but a red herring when it comes to think about the actual counterfactual. It is true that if the occurrence of a hurricane is an XOR of a million events then if you have zero evidence about any one of those million events then a change in another one of the events will not tell you anything about the occurrence of a hurricane. But that isn't the how the (counterf)actual universe is.
I don't quite understand your argument. Lets set aside issues about logical uncertainty, and just talk about quantum randomness for now, to make things clearer? It seems to make my case weaker. (We could also talk about the exact way in which this scheme "forces uncertainty onto the universe," by defining penalty in terms of the AI's beliefs P, at the time of deciding what disciple to produce, about future states of affairs. It seems to be precise and to have the desired functionality, though it obviously has huge problems in terms of our ability to access P and the stability of the resulting system.)
Why isn't this how the universe is? Is it the XOR model of hurricane occurrence which you are objecting to? I can do a little fourier analysis to weaken the assumption: my argument goes through as long as the occurrence of a hurricane is sufficiently sensitive to many different inputs.
Is it the supposed randomness of the inputs which you are objecting to? It is easy to see that if you have a very tiny amount of independent uncertainty about a large number of those events, then a change in another one of those events will not tell you much about the occurrence of a hurricane. (If we are dealing with logical uncertainty we need to appeal to the XOR lemma, otherwise we can just look at the distributions and do easy calculations.)
There is a unique special case in which learning about one event is informative: the case where you have nearly perfect information about nearly all of the inputs, i.e., where all of those other events do not depend on quantum randomness . As far as I can tell, this is an outlandish scenario when looking at any realistic chaotic system--there are normally astronomical numbers of independent quantum events.
Is it the difference between randomness and quantum events that you are objecting to? I suggested tracing out over the internals of the box, which intuitively means that quantum events which leave residues in the box (or dump waste heat into the box) are averaged over. Would the claim seem truer if we traced over more stuff, say everything far away from Earth, so that more quantum processes looked like randomness from the perspective of our distance measure? It doesn't look to me like it matters. (I don't see how you can make claims about quantumness and randomness being different without getting into this sort of technical detail. I agree that if we talk about complete states of affairs, then quantum mechanics is deterministic, but this is neither coherent nor what you seem to be talking about.)
It's even better/worse, since we're operating on multiple worlds quantum mechanics, and many of those random events happens after the AI has stopped having an influence... If you have the AI output a bit, and then XOR it with a random bit, what bit the AI outputs has literally zero impact no matter how you count: you end up with one universe in which 1 was outputed and one in wich 0 was outputed.
... I guess this is based on the assumption that there's no difference between "universe A sees 1 and universe B sees 0" and "universe A sees 0 and universe B sees 1"... but blobs of amplitude having indexical identities like that seems like an incredibly silly notion to me.
Seems like "minimize impact" is being applied at the wrong granularity, if a large deliberate impact is required to cancel out a large incidental one. If we break open the "utility-function maximizing agent" black box, and apply the minimum-impact rule to subgoals instead of actions, it might work better. (This does, however, require an internal architecture that supports a coherent notion of "subgoal", and maintains it in spite of suboptimality through self modifications - both large cans of worms.)
What "minimum impact rule"? How is "impact" computed so that applying it to "subgoals" changes anything?
Yes, if none of our various coarse-grainings catch the nanotech, and if we use some sort of averaging of deviations. It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well). Then if the AI is capable of unraveling the butterfly effect for one of these measure, it will simply do nothing.
Doesn't protect from some types of miracle science, I'm aware of that.
I call bullshit. This isn't even magical thinking, it's buzzwords.
I hope http://lesswrong.com/lw/a39/the_mathematics_of_reduced_impact_help_needed/5x19 has upgraded your impression so that it at least reaches magical thinking level :-)
It had precisely that effect on me. I retract the claim of "bullshit", but it does indeed seem like magical thinking on the level of the Open Source Wish Project.
Furthermore, if you can get an AI to keep "the concentration of iron in the Earth's atmosphere" as a goal rather than "the reading of this sensor which currently reports the concentration of iron in the Earth's atmosphere" or "the AI's estimate of the concentration of iron in the Earth's atmosphere"... it seems to me you've done much of the work necessary to safely point the AI at human preference.
Ah, now we're getting somewhere.
I disagree. With the most basic ontology - say, standard quantum mechanics with some model of decoherence - you could define pretty clearly what "iron" is (given a few weeks, I could probably do that myself). You'd need a bit more ontology - specifically, a sensible definition of position - to get "Earth's atmosphere". But all these are strictly much easier than defining what "love" is.
Also, in this model, it doesn't matter much if your definitions aren't perfect. If "iron" isn't exactly what we thought it was, as long as it measures something present in the atmosphere that could diverge given a bad AI, we've got something.
Structurally the two are distinct. The Open Source Wish Project fails because it tries to define a goal that we "know" but are unable to precisely "define". All the terms are questionable, and the definition gets longer and longer as they fail to nail down the terms.
In coarse graining, instead, we start with lots of measures that are much more precisely defined, and just pile on more of them in the hope of constraining the AI, without understanding how exactly the constraints works. We have two extra things going for us: first, the AI can always output NULL, and do nothing. Secondly, the goal we have setup for the AI (in terms of its utility function) is one that is easy for it to achieve, so it can only squeeze a little bit more out by taking over everything, so even small deviations in the penalty function are enough to catch that.
Personally, I am certain that I could find a loop-hole in any "wish for immortality", but given a few million coarse-grained constraints ranging across all types of natural and artificial process, across all niches of the Earth, nearby space or the internet... I wouldn't know where to begin. And this isn't an unfair comparison, because coming up with thousands of these constraints is very easy, while spelling out what we mean by "life" is very hard.
What Vladimir said. The actual variable in the AI's programming can't be magically linked directly to the number of iron atoms in the atmosphere; it's linked to the output of a sensor, or many sensors. There are always at least two possible failure modes- either the AI could suborn the sensor itself, or wirehead itself to believe the sensor has the correct value. These are not trivial failure modes; they're some of the largest hurdles that Eliezer sees as integral to the development of FAI.
Yes, if the AI doesn't have a decent ontology or image of the world, this method likely fails.
But again, this seems strictly easier than FAI: we need to define physics and position, not human beings, and not human values.
You're missing the point: the distinction between the thing itself and various indicators of what it is.
I thought I was pretty clear on the distinction: traditional wishes are clear on the thing itself (eg immortality) but hopeless at the indicators; this approach is clear on the indicators, and more nebulous on how they achieve the thing (reduced impact).
By pilling on indicators, we are, with high probability, making it harder for the AI to misbehave, closing out more and more avenues for it to do so, pushing it to use methods that are more likely to fail. We only have to get the difference between "expected utility for minimised impact (given easy to max utility function)" and "unrestricted expected utility for easy to max utility function" (a small number) to accomplish our goals.
Will the method accomplish this? Will improved versions of the method accomplish this? Nobody knows yet, but given what's at stake, it's certainly worth looking into.
"There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies." - C.A.R. Hoare
Which is to say, not knowing where to begin looking for exception cases is not, to my mind, a point in favor of a proposed design.
Good point. But Bayesianly, it has to be an advantage that there is no obvious flaw (conservation of probability and so on). And because of the failure modes available (get nearly all utility with reduced impact disciple, or output NULL), it is more likely to succeed.
This should be testable; trying out some very toy problems, and seeing if an AI with as much resources we can give it, is capable of finding a way around it.
I certainly agree that a too-complicated-to-understand system with no obvious flaw is more likely to do what I expect it to than a too-complicated-to-understand system with an obvious flaw. This is true not only Bayesianly, but common-sensically. But if you're arguing that it's more likely to do what I expect than a less complicated system, I fail to follow your reasoning.
Agreed that this should be empirically explorable, if not strictly testable. Personally, I'm more inclined to explore it using NIs (natural intelligences), though. That is, if a group of relatively dumb but highly motivated NIs construct the kind of complex framework of independent interlocking constraints we're discussing, how likely is it that a relatively intelligent and equally motivated NI can find an unexpected loophole?
My own expectation is that while it's not terribly likely on any given trial, it's pretty likely given enough trials.
Just to make that a little more concrete... if N is a standard deviation of intelligence within the current human population, and we define a trial as a single iteration of the loophole-seeker analyzing the network of constraints, coming up with a strategy that has a chance of achieving its goals despite those constraints, and implementing that strategy, then if I assume a 3N delta between the constraint-defining agents and the loophole-seeking agent and enough of a resource discrepancy benefiting the constraint definers that they get to put their entire network in place before the loophole-seeker gets started, I'd estimate something on the order of a 1% chance of success for the loophole-seeker. So if a trial takes a few days and the constraint definers don't modify their network after each trial, then I expect no more than a year before the loophole-seeker succeeds.
What results would you expect from such a scenario?
I really don't know. I would expect the loophole-seeker to be much more successful if partial success was possible.
What? Can you give me a specific example and walk through this?
The idea for coarse graining (the "20 billion question approach") is to take a collection of attributes about the world, and use these to group the possible worlds into equivalence classes. Then the penalty function measures the difference between the probabilities of these equivalence classes, based on whether X=1 or X=0.
What do we want from such a collection of attributes? Well, first we would hope that at least one of them would detect an AI behaving badly. So we could use the quantity of iron in the upper atmosphere, the Dow Jones average, air pressure, and a million more such attributes. The more attributes we shove in, the more we can hope to catch the AI, without having to know exactly how the AI would misbehave.
But it doesn't help if one attribute detects the misbehaviour, but the penalty function doesn't register this. So we can't ever have a situation where "most of the attributes are fine, so that's all ok". Using the complete set of attributes to partition into equivalence classes, and treating the equivalence classes as entirely distinct will enable this. If two worlds have same expected Dow Jones values, same expected air pressure, same expected sewage volume in London, but different levels of metal in the upper atmosphere, we would want them to be as different from each other as if all the attributes were different - "all the cameras report no intruder, sir. Oh, except for camera 5, which is broken for some reason" is not an acceptable attitude.
We also have to be wary of changes across time. We wouldn't want the AI to grab the world with nanotech, and afterwards rearrange the world to minimize the penalty function: we want the initial grab to be penalised, whatever happens after. So we want to penalise any deviation at any time, irrespective of what happens at other times. This is an L-infinity norm over time (the penalty over time is the maximum of the penalty at any given moment).
To get back to your example, what happens when the AI can unravel the butterfly effect? Take the Dow Jones, for instance: maybe the AI can accurately predict its value depending on X=1 versus X=0 ("the safe-AI company is going IPO tomorrow, with the results of their first test to be published just before"). In that case, the AI will detect a large deviation in the X=1 versus X=0 world - except if it outputs null, which is thus what it will do.
How to avoid this (we want to have an AI that's useful, not just safe)? Well, the more chaos we put into the attributes, the more confident we are the AI won't be able to unravel the butterfly effect (the Dow Jones average might be predictable from a successful AI project, but what about the volatility?). So to effectively coarse-grain, we need:
A tall order! But I don't think it's that hard to at least get to the point where we feel that the are actually making things safer. My main worry is that we might not find enough desirable attributes to cover the space of possible misbehavings.
Upvoted for a relatively high-quality response, regardless of whether you're correct.
What, you prefer that to "It's a lot more secure if we shove a whole lot of chaotic stuff into the course graining measures, and use an L-infinity norm for deviations (across every moment of time as well)."? :-)
The future 10^8 years later is going to look very different, even if things go right (FAI style or whatever), simply because we'll have used the AI for something. This is going to push your L-infinity norm very high, regardless of it's actions now, which is obviously very bad. As such, I think you want to weigh it be e^-t or something.
My other concern is that the AI will note that dedicating lots of resources to learning how to obey (game) the system will result in a really low score.