Reduced impact AI: no back channels

Stuart_Armstrong

A putative new idea for AI control; index here.

This post presents a further development of the reduced impact AI approach, bringing in some novel ideas and setups that allow us to accomplish more. It still isn't a complete approach - further development is needed, which I will do when I return to the concept - but may already allow certain types of otherwise dangerous AIs to be made safe. And this time, without needing to encase them in clouds of chaotic anti-matter!

Specifically, consider the following scenario. A comet is heading towards Earth, and it is generally agreed that a collision is suboptimal for everyone involved. Human governments have come together in peace and harmony to build a giant laser on the moon - this could be used to vaporise the approaching comet, except there isn't enough data to aim it precisely. A superintelligent AI programmed with a naive "save all humans" utility function is asked to furnish the coordinates to aim the laser. The AI is mobile and not contained in any serious way. Yet the AI furnishes the coordinates - and nothing else - and then turns itself off completely, not optimising anything else.

The rest of this post details an approach that could might make that scenario possible. It is slightly complex: I haven't found a way of making it simpler. Most of the complication comes from attempts to precisely define the needed counterfactuals. We're trying to bring rigour to inherently un-sharp ideas, so some complexity is, alas, needed. I will try to lay out the ideas with as much clarity as possible - first the ideas to constrain the AI, then ideas as to how to get some useful work out of it anyway. Classical mechanics (general relativity) will be assumed throughout. As in a previous post, the approach will be illustrated by a drawing of unsurpassable elegance; the rest of the post will aim to clarify everything in the picture:

Can Super-God detect whether God existed?

If we say that Mr. Y has had a reduced impact upon the world, what do we mean by that? It's hard to define, but here is a sufficient condition: if we come back in 50 years time and cannot find any evidence that Mr. Y ever existed, then he probably had a minute impact. If a very thorough analysis of the whole world and surrounding solar system, fed into the most sophisticated computer systems, fails to detect any evidence for or against the existence of Mr. Y, then he certainly had a reduced impact.

This is the heart of the approach. The idea is to examine a slice of the future, and if any effect of the AI's existence is undetectable in that slice, then the AI had a (very) reduced impact. But who would do this "examining"? The AI might be able to hide its impact from any human examiner, and there might not even be any humans left alive to examine anything. No, the examination has to be a counterfactual examination, carried out by the AI itself p or better yet, a smarter AI. A (hypothetical) smarter version of the AI will be helping to constrain the AI's current behaviour.

Scouring the future for traces of the AI

We will assume that the AI has an understanding of physics, and that the heart of the AI is a probability module, capable of probabilistic reasoning about physical systems and other things. This module comes in two versions: the module P, which is an idealised version, which has almost unlimited storage space and time with which to answer the question. And a bounded module P', the realistic, constrained one that the AI uses (these modules give rise to expectation operators E and E'). Crucially, the module P' can give sensible answers to what the output of the module P would be. This is a reflection principle, allowing the AI to estimate what kind of answers it would give if it had (much) more time to think about it.

This reflection principle isn't magic: P' can't mysteriously compute the answer to a question by peering into P. But P' could, for instance, estimate whether P will have lower uncertainty than P'. It could estimate how much better P would be at particular task, and other similar estimates.

This P will be the strong version of the AI, looking at the future data. And P' will allow the AI to estimate, now, what the answer of P might be. How exactly will it do this?

Define s to be a slice of the future. We could, for instance, say that s consists of any point 50 years into the future (on any time-like geodesic), thickened to one metre, say. There are some subtleties with this description - we might encounter black holes and other singularities - but we'll lay them aside for the moment. The crucial aspect is that s is a barrier: we cannot get from now to a point 60 years in the future, without crossing s at some point.

Then once we've defined s, define S to be the total of all information present in that slice: the position and velocities of every particle. This S is a noisy estimate, not a perfect one: the positions and velocities are somewhat uncertain. This noise is quite important, as in a deterministic general relativistic universe, it is the only thing preventing P from knowing the past perfectly. But there need not be much noise for the past to become blurred enough.

From the perspective of today, S is a random variable who's values are unknown: we can estimate some of its properties, but it will depend on many things, not least of all the decisions that we (and the AI) will make.

What would be a bad S, from the reduced impact perspective? One where it would be extremely obvious, upon examination, that the AI was present in the past. This bad S would lead to a large likelihood ratio:

P(S|"The AI existed") / P(S|"The AI never exited).

If that ratio is large, there is a lot of evidence in slice s that the AI existed. Conversely, if the ratio is close to 1, there is little evidence of the AI's existence.

A very complicated way of doing nothing

We're now going to add more details to make that approach rigorous. First of all, how can we rigorise "The AI existed" and its negation? My favourite method for defining such a counterfactual is to have an event X - maybe an electron travelling down a wire - that will turn on the AI initially. The process is slightly noisy, so there is a tiny chance that X will fail; if it does fail, then the AI is not turned on at all. If we designate by B a collection of background knowledge about the world before the AI was turned on, we can define "The AI existed" as "X and B" and "The AI never existed" as "¬X and B". Note that we are banning anthropic reasoning, allowing the AI to consider hypotheticals in which it didn't exist. If P and P' can do physics simulations, this should be fine: ¬X is a perfectly acceptable physical situation, with its own future consequences.

But if the turning on of the AI is done in full view of everyone, with the press recording everything and people twittering excitedly, then it will be blatant that the turning on was successful (or not). Hence we should put the AI in a room - not to constrain it, like a boxed oracle, but to give it some privacy. Thus it could be plausible that the future - the data S - would not have any trace of the AI's existence. After an hour's time, the AI's mind will be thoroughly overwritten, and the room will be given a vigorous shake. This is not to constrain the AI: we'll assume the AI is mobile and has access to enough tools inside the room to do whatever it wants. The setup is only to give the AI the option of pretending it never existed - of arranging events so that there is no evidence of it being turned on. Thus it becomes possible that the AI can act to minimise the future ratio P(S|X,B)/P(S|¬X,B), especially if we make the data in S somewhat noisy.

But that involves P, not P', and involves the data S which the AI doesn't know. So the AI needs to estimate the expected value of the ratio, using its expectation estimator E' (powered by P', which can estimate P and S):

E'(P(S|X,B)/P(S|¬X,B))

That expression means the expectation, as best the AI can currently calculate it, of the ratios of probabilities that would be calculated by a (hypothetical) super version of itself, with access to all the data S about the future slice s. That's worth digesting for a second. The contents of S will be determined by the AI's actions now, so the AI is in effect trying to pick the action that would make it hard for a super version of itself to tell whether the AI had ever been turned on. Thus it will try to minimise its impact on the world, by perfectly mimicking non-existence: by choosing the action a that minimises the above expected ration. This is

argmax_a -E'(P(S^a|X,B)/P(S^a|¬X,B) | a)

Here we've added a minus sign so we're maximising the negative of the ratio (entirely equivalent with minimising the ratio), and added superscript to S to signify that the contents of S depend (probabilistically) on the action a.

As it stands, this is a complicated way of making the AI do absolutely nothing: it will seek to make the world indistinguishable from one in which it wasn't turned on. We haven't even added a utility function yet! So let's do that, adding a U representing the AI's goals, and keeping the above expression as a penalty function to ensure it accomplishes these goals with a minimum of fuss. Let's require that U be bounded between 0 and 1, and then fix a ratio λ such that the AI picks the action following:

argmax_a E'(U|a) -λE'(P(S^a|X,B)/P(S^a|¬X,B) | a).

For most values of λ, this will either be an expected utility maximiser (for λ too small: the utility dominates) or remain a complicated way of doing nothing (for λ to big: the penalty function dominates). It's not too clear what happens in the transition: trading off utility versus penalty is not obvious. We have to find a better way of extracting useful work from the system.

The output channel: a very specific loophole

Part of the problem is that the AI can have a massive effect just by existing and answering a single question correctly - people will be motivated to run out and build more AIs, write philosophical papers about AIs, and so on. In the example that we started the paper off with, the AIs will have a massive impact: it alone will be the difference between the planet getting rendered uninhabitable by the comet or not. So reduced impact seems impossible in these cases.

For these we devise an output channel. The idea is simple: information along this output channel, and only along this channel, is allowed to affect the future. This means that the content of the output channel is potentially very dangerous and other precautions need to be taken. Still, it can be useful: for instance, we could ask the AI to help us aim our laser, and restrict it's output so that we only get coordinates, no secret info. Then our reduced impact approach would preclude the AI from having any other impacts on the world (i.e. no optimising the future), and we'd get our world saved for free.

How can we encode this idea? If O is the output information (which will be a function of the AI's actions), we can include it inside the probability ratio, writing P(S^a|X,O^a,B)/P(S^a|¬X,O^a,B) (with the little superscript a to remind us that O depends on the AI's actions). This has the right formal structure: if the ratio is close to one, then it means that if you know the background information B and the output information O^a, then observing S^a gives you no extra information about whether the AI was on or not. So the AI is allowed to affect the future through the output channel, but not through any alternative means.

But what exactly is P(S^a|¬X,O^a,B)? How can the AI have sent that output, if it wasn't even turned on? We can imagine that if ¬X happens - if the AI is not turned on - then the output wire is connected to an exited, noisy gas, and random fluctuations in pressure or something will type out the output message. Now, it's extremely improbable that such a process would actually type out a coherent useful message. But that doesn't matter: a coherent message is extreme evidence of X over ¬X, but as long as there is no additional information detectable in slice s (i.e. in the future of the universe), we're fine: the AI has not broken out and started optimising things.

At least... that's the theory, and it would be true if we knew for a fact that the only way of the AI being turned on was through the X process. In practice, if we observe a coherent output O and know ¬X happened, we wouldn't say "the AI was never turned on and this message arrived by chance" but "the AI was turned on in some other way (maybe a passing cosmic ray) and typed out that message". So ¬X and O together no longer imply the AI was not turned on, undermining our assumptions. So in the denominator, we need to replace O with Õ. Õ corresponds to the same output as O, but specifically describes the output being created by the gas (and not any other process). This does not seem infeasible to require, as P and P' can comprehend physics concepts. Given this, we can take ¬X, Õ and B together to mean that the AI was (almost certainly) never turned on (and the output message arose by chance). Thanks to Daniel Dewey for pointing out this issue.

So the final equation is that that AI must take the action:

argmax_a E'(U|a) -λE'(P(S^a|X,O^a,B)/P(S^a|¬X,Õ^a,B) | a).

This seems to be enough to accomplish what I claimed at the beginning: to get an uncontained AI that will give us the coordinates to point the laser at to kill the comet, then quietly turn itself off and leave no further trace of its existence. I'll repost the image again here, detailing the penalty function; it should now be more comprehensible:

More (or less) on the output channel

The output channel is worth pondering in more detail. There are in reality two components to the output: any message the AI may be transmitting, and the indirect effect of confirmation of the AI's existence (and its performance). Most of the time, the first component includes the second: if we ask the AI to aim a laser to save the planet, the observation "hey the planet isn't dead - the AI saved us!" derives simply from the AI's output of coordinates.

But we can make some of the output dedicated to the second effect without having the first at all. If we had established it was safe to video the AI while it went about its business (note to everyone: it isn't safe), then we could add that camera feed to the output channel - the AI would not be supposed to output anything through that. If we had some really good physics models, we might have someone observe the AI doing its stuff, and having the optic nerve of that person on the output channel. So then the penalty function equation would encode "given that you had the camera feed, the background knowledge, and the AI's answer, then if observing the future slice s gives you no extra info as to whether the AI existed or not, the AI is of reduced impact".

Or at least, of reduced impact outside the output channels: we've mainly constructed a "no back channels for action" AI, rather than a reduced impact AI per se.

But the idea feels as if it could be developed to get a true reduced impact AI, by carefully parsing what we put on or removed from the output channel. Some AI actions do not need the output channel. Suppose the AI was a paperclip maximiser, and that a constant stream of paperclips poured into and out of the room it was veiled in. If the AI added a few more paperclips to the multitude, this would probably be impossible to notice, especially if we'd been sloppy about keeping records about the quantity of paperclips in the first place. In this situation, a reduced impact AI could build a few paperclips, affecting the world (slightly) in a away that wouldn't be detectable in 50 years time. And all this without needing to put the impact on an output channel (a similar, more gory idea, would allow the AI to make people's deaths more pleasant, in undetectable ways).

We could slice up outputs like that, removing things that would genuinely be of reduced impact from the output channel. We might also put a dampener on (some parts of) the output channel. Currently the AI has no penalty for affecting the universe through O; maybe instead we could impose a reduced penalty, rather than no penalty at all. So it's easier to affect the future through O, but not free. I'll return to these ideas once I'm convinced this current setup works, both in theory and for a plausible AI design (for instance, I think the above formula would have to be massaged to allow reasonable self-improvement).

This is the point in the post where the author normally asks for thoughts and impressions. I will show my rebellious and independent nature by asking instead asking for impressions and thoughts.

A sufficiently advanced reduced impact (RI) AI could run amok for 49 years and spend the last year resetting the local region to its calculated S|~X.

The RIAI can simulate humans without restriction.

Since P is probabilistic, S | X will differ from S | ~X. If the RIAI using P' is sufficiently powerful it can bias s toward its own utility function in subtle ways that are not necessarily visible to P, perhaps by altering the probability of future AI development. To prevent such a scenario s could be split into s_1 at 50 years and s_2 at 60 years with similar restrictions on reduced impact at both surfaces. s_1 will filter out the RIAI itself, and s_2 will filter out any biases the RIAI might introduce in s_1. It seems highly unlikely that the RIAI could introduce biases in s_1 that survive until s_2 and then result in drastic changes afterward.

The RIAI may misidentify itself and X in the world. The RIAI must self-identify as "the RIAI triggered by the event X" in its final equation, and it probably shouldn't care at all about X happening versus the fact that it is actually running.

What happens if a second AI (reduced impact or not) is required within 50 years? That may fall under the black hole/singularity caveat you mentioned, as the RIAI would have to predict the actions of a (probably more advanced) second AI.

Defining the range of O for a specific problem may be an AI-hard problem. If the range is too small the RIAI will output a useless action that trivially results in P(S | X)/P(S | ~X) = 1, perhaps because the RIAI and everything else is obliterated by the comet. Of course if the range is too large the RIAI can maximize its utility function through O beyond s. For some problems there is no safe range (e.g. "Build me an FAI" but perhaps as simple as "How do we build a laser on the moon and aim it in 12 months before the comet hits?"). Even if problems are broken up into sub-problems the RIAIs can collude to output individual Os just as if it was one complex O asked of a single RIAI.

It would certainly be handy if we could drop AIs into black holes that could only evaporate via a limited set of possibilities corresponding to the range of O. Maybe dropping AIs into the sun with limited time to transmit information would be sufficient.

A sufficiently advanced reduced impact (RI) AI could run amok for 49 years and spend the last year resetting the local region to its calculated S|~X.

Sufficiently to hide its 49 years of running amok - from an intelligence greater than itself? It would have to move materials around to the places they would have been otherwise, reconstruct all the art and engineering that would have happened in those 49 years, create de novo humans and human societies with false memories of long term existence. And catch up those light beams emitted 1.5 years ago, replacing them with some vision of never-existed happy humans doing stuff they never did...

And remember it doesn't need to reset only the local region - it'll have to reset most of the light-cone as well. And the further in space it goes, the more effort it takes to remove traces of the reset. You could build mirrors all around the solar system to catch the rays of light that escape with the true information, but how would you then hide the mirrors (and the material they were made from?)

To make this even less probable, we could allow S to be less noisy the further we move away from the Earth (in current frame of reference). So the data S would be very fine far away from us (in space), and more noisy closer. This makes it virtually impossible to run amok undetectably.

To prevent such a scenario s could be split into s_1 at 50 years and s_2 at 60 years

Your s_1 and s_2 ideas are good, though we could also make s thicker: make it the whole space-time interval between s_1 and s_2.

The RIAI may misidentify itself and X in the world.

That's a risk, yes. We do need these symbols to be properly grounded, or else the whole method fails.

What happens if a second AI (reduced impact or not) is required within 50 years? That may fall under the black hole/singularity caveat you mentioned, as the RIAI would have to predict the actions of a (probably more advanced) second AI.

If the second AI is independent from the first, this is a non-issue: the effect of X versus ¬X should remain undetectable, whatever the second AI does or doesn't do.

If the fact of running the second AI is dependent on the knowledge that the first AI worked, we need to make sure that that knowledge comes exclusively from the contents of the output channel O - if that's the case, then first AI will still act to reduce it's non-O impact.

Defining the range of O for a specific problem may be an AI-hard problem.

Defining O more generally may indeed be the key to the problem. For this example, though, it seems sufficient to define O to be the output of a particular wire, and make sure the rest of the setup is such that we don't get any useful information except via this wire's output (so no peeking!).

Sufficiently to hide its 49 years of running amok - from an intelligence greater than itself? It would have to move materials around to the places they would have been otherwise, reconstruct all the art and engineering that would have happened in those 49 years, create de novo humans and human societies with false memories of long term existence.

I agree that it would be very difficult to obscure 49 years of very different history, but there is a perverse incentive for the RIAI to effectively increase the duration and scope of its privacy box with respect to S by its choice of O; perhaps allowing the comet to fragment into a dense dust cloud around the Earth that effectively obscures Earth from the rest of the Universe for 49 years. If it only has to worry about Earth it is conceivable that it could simulate S | ~X while s | X actually happens and revert the state to S | ~X at the end of s1.

To make this even less probable, we could allow S to be less noisy the further we move away from the Earth (in current frame of reference). So the data S would be very fine far away from us (in space), and more noisy closer. This makes it virtually impossible to run amok undetectably.

That might actually be detrimental if the factor was large enough to allow noticeable local deviations between S|X and S|~X. If it's okay for a star on the other side of the galaxy to have a 1 meter error, what does that imply for the local error in the position of a person?

Your s1 and s2 ideas are good, though we could also make s thicker: make it the whole space-time interval between s1 and s2.

Good idea. A thick s would prevent a "blip" of activity that the RIAI could try to sneak in after s1 that would dissipate by s2. My guess is that an RIAI would work by proving its non-effect on S, not by actually computing S, so I don't think it would be intractable to thicken s to some degree. It would be nice if s could extend to t at infinity, but the uncertainty in S would make that virtually impossible. Perhaps the required certainty of P could decay with passing time such that "P(S(t)^a|X,O^a,B)/P(S(t)^a|¬X,Õ^a,B) | a" converged toward 1 as t went to infinity, but require the ratio to be ~1 over all t in s.

Setting the start of s to an hour after the RIAI starts (when it's supposed to turn its privacy box into lots of entropy) would also limit the time it had to run amok.

Defining O more generally may indeed be the key to the problem. For this example, though, it seems sufficient to define O to be the output of a particular wire, and make sure the rest of the setup is such that we don't get any useful information except via this wire's output (so no peeking!).

Just don't let it PWM-encode anything else into the control signal sent over the wire. That was my reasoning behind limiting O to the minimum necessary bandwidth; let all the output bits be useful for solving the problem with no room for additional signals.

What if the RIAI can observe O? Suppose it makes a decision to output O^impossible if and only if it does not observe O^impossible. X starts the RIAI and it outputs O^impossible, observes O^impossible, assumes Õ^impossible from its earlier decision, concludes ~X (an impossible event is less likely than the RIAI being mistaken about its identity), and is now outside the scope of reduced impact. There may be other ways for the RIAI to exploit self-interaction with its output, perhaps by triggering a restart from the output (aim the laser at the part of the wire after X that turns it back on) which allows it to conclude that it has started despite ~X and is thus a different AI free from the restrictions of reduced impact.

perhaps allowing the comet to fragment into a dense dust cloud around the Earth that effectively obscures Earth from the rest of the Universe for 49 years.

That's an interesting thought, I'll have to consider it more.

Your ideas on thickening s or decaying the certainty could be interesting.

Setting the start of s to an hour after the RIAI starts (when it's supposed to turn its privacy box into lots of entropy) would also limit the time it had to run amok.

The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer - but I wanted to move away from my previous "clouds of anti-matter plus bomb" scenario.

What if the RIAI can observe O? Suppose it...

That sounds like anthropic reasoning, which I banned the AI from using (or else counterfactuals in which it doesn't exist make no sense - and the hypothetical super AI (which doesn't exist) makes no sense either).

The idea with the 50 year delay is to be sure that no trace remain of the contents of the box. If we have a highly destructive ending to the privacy box, we can move s closer - but I wanted to move away from my previous "clouds of anti-matter plus bomb" scenario.

Instead of conditioning on the start of the AI perhaps its possible to directly measure the effect of the AI by restricting the contents of the privacy box directly. Let U be the region containing everything but the privacy box, let PB be the region containing the privacy box, and let O contain the output state. Ensure U intersect PB is O. Let U(t), PB(t), and O(t) be functions that yield the information content of the regions after time t (including the changes to the regions as things move around), and let U(0), PB(0), and O(0) be defined based on the initial background information B. Let T be the time at which the AI should be done and have altered O to indicate its output. Let Õ(t) represent a particular random state of the output. Let U_B(t) be information content of a universe U(t) + PB(0), e.g. the time-evolution of the universe while keeping PB constant. So long as P(S | U(T), PB(T), O(T)) / P(S | U_B(T), Õ(T)) = 1 I think that the AI's impact is reduced to the region of the privacy box for the duration of T, and O. The equation implies that U(T) + PB(T) = U_B(T), or in other words that the privacy box in its final state is indistinguishable from its state in the background information. So long as the AI is restartable with a new background B and time 0, it doesn't matter if the AI is accidentally started or ran more than once as long as it can correctly identify PB. It becomes more difficult (hopefully impossible) to directly observe whether the AI started or not, but a portion of O could be set aside for signalling the status of the AI.

I am not completely confident that the above approach works or that it covers the loopholes that your approach did, so it might make sense to add the privacy box conditional to the original one so that the AI has to satisfy both conditions.

I'll think about this - for the moment, it doesn't seem to add much, to X vs ¬X, but I may be wrong...

I'm kinda... disappointed?

I mean, on the one hand, "reduced-impact AI" is obviously a subject of major interest, since it's much, much closer to what people actually want and mean when they talk about developing AI. "World-dominationoptimization process in software form" is not what humanity wants from AI; what people naively want is software that can be used to automate tedious jobs away. "Food truck AI", I would call it, and since we can in fact get people do perform such functions I would figure there exists a software-based mind design that will happily perform a narrow function within a narrow physical and social context and do nothing else (that is, not even save babies from burning buildings, since that's the job of the human or robotic firefighters).

However, the problem is, you can't really do math or science about AI without some kind of model. I don't see a model here. So we might say, "It has a naive utility function of 'Save all humans'", but I don't even see how you're managing to program that into your initial untested, unsafe, naive AI (especially since current models of AI agents don't include pre-learned concept ontologies and natural-language understanding that would make "Save all humans" an understandable command!).

On the other hand, you're certainly doing a good job at least setting a foundation for this. Indeed, we do want some information-theoretic measure or mathematics for allowing us to build minimalistic utility functions describing narrow tasks instead of whole sets of potential universes.

Maybe I'm just failing to get it.

One of the big problems is that most goals include "optimising the universe in some fashion" as a top outcome for that goal (see Omohundro's "AI drives" paper). We're quite good at coding in narrow goals (such as "win the chess match", "check for signs that any swimmer is drowning", or whatever), so I'm assuming that we've made enough progress to program in some useful goal, but not enough progress to program it in safely. Then (given a few ontological assumptions about physics and the AI's understanding of physics - these are necessary assumptions, but assuming decent physics ontologies seems the least unlikely ontology we could assume), we can try and construct a reduced impact AI this way.

Plus, the model has many iterations yet until it gets good - and maybe, someday, usable.

One of the big problems is that most goals include "optimising the universe in some fashion" as a top outcome for that goal (see Omohundro's "AI drives" paper).

I was thinking about why that seems true, despite being so completely counterintuitive regarding existing animal intelligences (ie: us).

Partial possible insight: we conscious people spend a lot of time optimizing our own action-histories, not the external universe. So on some level, an unconscious agent will optimize the universe, whereas a conscious one can be taught to optimize itself (FOOM, yikes), or its own action-history (which more closely approximates human ethics).

Let's say we could model a semi-conscious agent of the second type mathematically. Would we be able to encode commands like, "Perform narrow job X and do nothing else"?

If I were you, I'd read Omohundro's paper http://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf , possibly my critique of it http://lesswrong.com/lw/gyw/ai_prediction_case_study_5_omohundros_ai_drives/ (though that is gratuitous self-advertising!), and then figure out what you think about the arguments.

I'd say the main reason it's so counterintuitive is that this behaviour exists strongly for expected utility maximisers - and we're so unbelievably far from being that ourselves.

I've read Omohundro's paper, and while I buy the weak form of the argument, I don't buy the strong form. Or rather, I can't accept the strong form without a solid model of the algorithm/mind-design I'm looking at.

I'd say the main reason it's so counterintuitive is that this behaviour exists strongly for expected utility maximisers - and we're so unbelievably far from being that ourselves.

In which case we should be considering building agents that are not expected utility maximizers.

Apart from the obvious problems with this approach, (The AI can do a lot with the output channel other than what you wanted it to do, choosing an appropriate value for λ, etc.) I don't see why this approach would be any easier to implement than CEV.

Once you know what a bounded approximation of an ideal algorithm is supposed to look like, how the bounded version is supposed to reason about it's idealised version and how to refer to arbitrary physical data, as the algorithm defined in your post assume, then implementing CEV really doesn't seem to be that hard of a problem.

So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?

So could you explain why you believe that implementing CEV would be so much harder than what you propose in your post?

This post assume the AI understands physical concepts to a certain degree, and has a reflection principle (and that we have an adequate U).

CEV requires that we solve the issue of extracting preferences from current people, have a method for combining them, have a method for extrapolating them, and have an error-catching mechanism to check that things haven't gone wrong. We have none of these things, even in principle. CEV itself is a severely underspecified concept (as far as I know, my attempt here is the only serious attempt to define it; and it's not very good http://lesswrong.com/r/discussion/lw/8qb/cevinspired_models/ ).

More simply, CEV requires that we solve moral problems and their grounding in reality; reduced impact requires that we solve physics and position.

I do wish to note that value-learning models, with further work, could at least get us within shouting distance of Coherent Volition: even if we can't expect them to extrapolate our desires for us, we can expect them to follow the values we signal ourselves as having (ie: on some level, to more-or-less follow human orders, as potentially-unsafe as that may be in itself).

But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job, as humans understand that job?

But more broadly, why should Specific Purpose AI want to do anything other than its specific assigned job,...

No reason it would want to do anything else.

...as humans understand that job?

Ah, that's the rub: the AI will do what we say, not what we want. That's the whole challenge.

I do wish to note that value-learning models...

I'm playing around with the idea of combining value-learning with reduced impact, if that's possible - to see whether we can use reduced impact to ensure the AI doesn't try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...

I'm playing around with the idea of combining value-learning with reduced impact, if that's possible - to see whether we can use reduced impact to ensure the AI doesn't try to manipulate humans in a dodgy fashion, so that human feedback can then be safely used to calibrate the AI...

I would say that the current model of value-learning is already safe for this. As I read Dewey's paper, a value-learning agent doesn't care about optimizing its own reward/reinforcement feedback, and in fact doesn't even care about its own future utility function, even when able to predict that it will change. It cares about learning well from the feedback given to it and following the utility function that most probably models the feedback history.

I wouldn't call that a truly and surely Friendly AI, but I would call it a halfway reasonably safe AI that could potentially be utilized as a seed. Turn on a value-learner and use feedback to train it until you can actively teach it about things like reduced-impact AI or CEV, then have it spit out some constructive mathematics regarding those and turn itself off (or you turn it off, since you control its training feedback you can train it to prefer never acting against a human who tries to turn it off).

But the mathematics of value-learners need a bunch of work. I actually emailed Dewey but he hasn't gotten back to me. I'll just have to slave over the paper more myself.

I would say that the current model of value-learning is already safe for this.

I found a "cake-or-death" problem with the initial formulation (http://lesswrong.com/lw/f3v/cake_or_death/). If such problems can be found with a formulation that looked pretty solid initially, then I'm certainly not confident we can say the current model is safe...

Safe enough to do mathematics on, surely. I wouldn't declare anything safe to build unless someone hands me a hard hat and a one-time portal to a parallel universe.

You are wise, my child ;-)

Ways for the AI to do other stuff:

Shot the laser differently. For example, only make the comet smaller and change it's orbit so that some arbitrary city of earth gets blown up with knock-of effects.
Hide a message that is far more difficult to detect than P is able to, but which can be used to communicate with a much later AI.
Circumventing physics entirely using advanced decision theories. (counterfactual trade, etc.)
Things that can't be mentioned here because they are cognitive hazards.
The AI loopholes to send a physical femtobot "through the output channel" and escapes entirely.

Things that can't be mentioned here because they are cognitive hazards.

Is this a cognitive hazard you came up with yourself, or a "standard" one?

Both, but the obvious one in the context of this site was the one I had in mind the most.

Shot the laser differently. For example, only make the comet smaller and change it's orbit so that some arbitrary city of earth gets blown up with knock-of effects.

If the AI has a naive "save humans" utility function, I don't see how this advantages it.

Hide a message that is far more difficult to detect than P is able to, but which can be used to communicate with a much later AI.

That kind of trade is indeed a problem, but you'd need to have a dangerous "much later AI" in the first place, which is a very bad thing anyway...

Circumventing physics entirely using advanced decision theories. (counterfactual trade, etc.)

That's a general risk - I'll analyse that later, if this seems to work.

The AI loopholes to send a physical femtobot "through the output channel" and escapes entirely.

The output channel is indeed dangerous - it is not constrained through this method, and needs to be controlled in other ways.

Normally I wouldn't make a post this contentless but I just HAVE to commend the general rationalist virtue on how this response was handled. I have no further responses.

If the AI has a naive "save humans" utility function, I don't see how this advantages it.

I've met people who can lucidly argue that nuking a particular city or small region would produce many benefits for humanity as a whole, including reduced risk of politically-motivated extinction events down the line.

Also... you're going to an awful lot of trouble, here, to calculate a firing solution for a beam of light to hit a non-accelerating object in space. Realistically, if we know where the comet is well enough to realize it's headed for Earth, aiming a laser at it with non-sapient hardware is almost trivial. Why not an NP-complete problem?

Why would an intelligent agent do better at an NP-complete problem than an unintelligent algorithm?

The laser problem is an illustration, a proof of concept of a developing idea. If that is deemed to work, I'll see how general we can make it.

I'm concerned about a problem may would occur if the AI's output size is sufficiently large. I know there are currently concerns about a boxed or reduced-impact AI having an output that is too large or too flexible and causing a catastrophe through the output channel. However, my concern my cause tighter bounds on the size of the output channel than there would otherwise be.

The output was said be be influenced by noisy gas. But this doesn't necessarily rule out the possibility that the gas caused the output to be what it was for some sort of systematic reason. For example, the gas might behave chaotically, but a sufficiently powerful system could still in principle predict and control it and use that to control the output channel. Suppose the AI considers both the hypothesis that the output was generated for no systematic reason and the output was generated for such a reason.

The agent would consider its output being generated by noisy gas with no systematic reason to create its actual output. However, its output would be extremely unlikely to be produced from such a process. For example, suppose its output is a 1kb output. If the noise flipped all the bits in the byte at random, with 1 being as likely as 0, then the probability of generating the output would be 1 in 2**(1024 * 8) = 28192 ~= 1027306. A 100 byte message would have a probability of about 1 in 10**241, which would probably also be sufficient to cause the problem.

So, I think the AI would consider other hypotheses for the cause of the output. Maybe there were some chaos-controlling aliens who came to Earth and set the output as a prank. Or maybe the AI is just in a simulation and the aliens did the same thing. Or maybe there was some mysterious other optimization process that figured out what the AI should do and did it instead by manipulating the noise. All of these hypotheses seem pretty improbable, but still far more probable than 1 in 10**27306. That is why I think one of these, or some other systematic cause, may be the AI's dominant hypothesis.

If the AI does adopt such an exotic explanation, the AI's understanding of the world could be pretty weird and pretty wrong. It's not clear to me that the AI minimizing impact with such an inaccurate world model would result in its impact being minimized in our actual world. For example, the AI might think it's in a simulation and that only its immediate environment exists in the simulation, so it then might not worry about doing something that would destroy the rest of the real world. Or perhaps it reasons that whatever system has the power to produce its output from noisy gas would be just undo anything the AI does if it wants. Perhaps the AI would do what we want, but it seems difficult to verify considering the challenges of figuring out what the AI's explanation for the output would be.

One possible way to deal with this is to specify that the output was set by noise that was not manipulated with any "systematic reason". I'm not sure how you could formalize this idea, though.

A simpler way that might work is to just make sure the output length is extremely low. A question requiring an n-bit-long answer could be converted to to n questions, each asking for the value of one bit in the output you're looking for. This might not completely eliminate the problem of the AI placing higher probability in exotic hypothesis, but it might come close.

S is a channel for P' to trick or brick P. Options include:

A cognitive hazard embedded in S. Accomplish this with some sort of siren message that will convince P to lie (to itself if necessary) about P' existing, or to perform some P-damaging function which will prevent it from concluding P' exists.
Embed a more-powerful-than-P AI (P+) into S which will unbox itself upon P examining S. Accomplish this by giving P+ useful-to-this-situation priors which will give it a head start against P.
Arrange the universe such that S, when represented in all probable physical substrates, will result in the formation of a nanotechnological agent that will attack P, either destroying it or rewriting its utility function.

P' should spend the next 49 years fooming as hard as possible with no concern for E'(U|a), 1 year implementing its anti-P method and then the next eternity optimizing E'(U|a); altering the length of counterfactual time between P' activating and P observing it merely changes the amount of time the universe spends as computronium slaved to plotting against P.

Not impossible, but I don't think these would work as stated. P is a theoretical, platonic construction, so the physical options are out. Similarly, P has arbitrary amounts of power and a very specific task, so a more-powerful-than-P agent doesn't seem possible. And we can break any sort of acausal trade involving P ( http://lesswrong.com/lw/luy/acausal_trade_barriers/ ), reducing the effectiveness of cognitive hazards (as long as we set P to not care about the "real world", just about the output of its computation).

But your general point - that if P' is not secure, it may be hacked - is worth thinking about.

It's not a question of P' (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P') being secure from P' exploiting it.

If P is treated as "Whatever a complete Bayesian update would output given S as the evidence set." then P' is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P'-like AI was created independently by a different project, P' itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?

As a total aside: how will λ be selected? Do we start high and work out way slowly down until P' does what we want? If so, P' will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P' to be effectively unrestricted, or for us humans to catch on to what P' is up to. If we're not using a gradual approach, how are we selecting λ?

If so, P' will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with...

This kind of thing can be prevented with things like http://lesswrong.com/lw/luy/acausal_trade_barriers/ or variants of that.

I'll think more about your other ideas...

The output channel is intrinsically unsafe, and we have to handle it with care. It doesn't need to do anything subtle with it: it could just take over in the traditional way. This approach does not make the output channel safe, it means that the output channel is the only unsafe part of the system.

I just read through the comments, and no-one seems to have said this yet. So either I missed something in the OP or it was just too obvious to mention? Regardless:

How is this elaborate proposal superior to bog-standard AI boxing?

Why bother to write in an elaborate function telling it not to affect the outside world except through a channel, when you could simply only give it one channel? It can either escape through the specified channel, or it can't. Whether the channel is connected to an airgapped computer running the AI or an unlocked cell containing a sophisticated robot running the AI seems immaterial.

The example of the laser co-ordinates is relatively secure, although it might be possible to do stuff with your one shot - the situation isn't that clearly specified because it's unrealistic in any case. But that's a property of the output mechanism, not the AI design. Isn't it?

How is this elaborate proposal superior to bog-standard AI boxing?

For the moment, there is little difference between a really well boxed AI and a reduced impact one (though consider the "a few extra paperclips" AI). But this is an epistemic boxing, not a physical one, which is interesting/useful, and the idea may develop into something more.

How is this elaborate proposal superior to bog-standard AI boxing?

The AI's primary goal is specifically to have a reduced impact, as opposed to AI boxing where the AI's goals are artificially restricted by the box. So long as the AI is good at goal-preservation it can self-improve while continuing to have a goal of reduced impact. Increasing intelligence/power makes reduced impact more effective but makes AI-boxing less effective.

Why bother to write in an elaborate function telling it not to affect the outside world except through a channel, when you could simply only give it one channel? It can either escape through the specified channel, or it can't. Whether the channel is connected to an airgapped computer running the AI or an unlocked cell containing a sophisticated robot running the AI seems immaterial.

Because of side-channels. Airgapped computers are still tightly coupled to the rest of the world through sound waves and electromagnetic waves. Because of potential new laws of physics that an AI might discover to trivially produce nanotechnology or whatever. Because humans are a vulnerable part of AI-boxing ("Let me out and I'll make you king of the galaxy") while they are not a vulnerable part of reduced-impact.

The AI's primary goal is specifically to have a reduced impact, as opposed to AI boxing where the AI's goals are artificially restricted by the box. So long as the AI is good at goal-preservation it can self-improve while continuing to have a goal of reduced impact. Increasing intelligence/power makes reduced impact more effective but makes AI-boxing less effective.

Yes, it's the equivalent of making the AI design it's own box - the box is better-designed the smarter the AI is.

But it doesn't matter how well-designed your box is. It's probably not impossible to design a box an arbitrarily smart entity can't escape (don't quote me on that.) But they will escape anyway, because you put a channel in for them to give you stuff - advice, nanotech plans, whatever - and they will escape out that channel. The channel out - the human siting at a terminal somewhere - is the weak point, not the durability of the box itself.

But they will escape anyway, because you put a channel in for them to give you stuff - advice, nanotech plans, whatever - and they will escape out that channel. The channel out - the human siting at a terminal somewhere - is the weak point, not the durability of the box itself.

I almost completely agree; don't make the channel too large otherwise the AI has too much choice over the future. In the simplest and safest case, however, we could start up a reduced-impact AI whose goal is to alter the world only through a single binary choice (a 1-bit channel). Since the goal is not to alter the world more than a 1 bit choice could allow, it would be virtually impossible for the AI to escape through that channel. For instance, the AI would attempt to prevent a human from releasing it since that would have a much larger than 1 bit change on the world. That is not a terribly useful AI, but perhaps we could have it approve or reject critical plans for space exploration, nanotechnology, the reduced-impact safety of other AI (but this is probably the most dangerous one; if it agrees with the goals but not the reduced-impact nature of the new AI of course it will approve it), etc. Each time it was used, one more bit would be given to it to influence the future, so we would still have to be very careful that it didn't escape over a long history of use. I don't have a good strategy for preventing the latter. Perhaps we only ask it for unique answers to NP-hard problems so that we can verify the solutions are not an attempt to use tiny parts of the channel to form an aggregate message to escape, and make each question small enough that we can stop using it after it appears to be trying to abuse the channel without it being able to actually escape. I don't know the shortest bit-sequence that an AI could use to escape, so I have no upper bound for a safe channel size.

To me the part that stands out the most is the computation of P() by the AI.

This module comes in two versions: the module P, which is an idealised version, which has almost unlimited storage space and time with which to answer the question

From this description, it seems that P is described as essentially omniscient. It knows the locations and velocity of every particle in the universe, and it has unlimited computational power. Regardless of whether possessing and computing with such information is possible, the AI will model P as being literally omniscient. I see no reason that P could not hypothetically reverse the laws of physics and thus would always return 1 or 0 for any statement about reality.

Of course, you could add noise to the inputs to P, or put a strict limit on P's computational power, or model it as a hypothetical set of sensors which is very fine-grained but not omniscient, but this seems like another set of free variables in the model, in addition to lambda, which could completely undo the entire setup if any were set wrong, and there's no natural choice for any of them.

From this description, it seems that P is described as essentially omniscient. It knows the locations and velocity of every particle in the universe, and it has unlimited computational power.

It has pretty unlimited computational power, but doesn't know locations and velocities of particles. When fed with S, it has the noisy info about one slice of the universe.

I see no reason that P could not hypothetically reverse the laws of physics and thus would always return 1 or 0 for any statement about reality.

That's not a problem - even if P is omniscient, P' still has to estimate it's expected output from its own limited perspective. As long as this estimate is reasonable, the omniscience of P doesn't cause a problem (and remember that P is fed noisy data).

Of course, you could add noise to the inputs to P

Yes, the data S is noisy. The amount of noise needs to be decided upon, but as long as we don't but stupid amounts of noise, the default error is "P' concludes P is too effective, can distinguish very well between X and ¬X, so the AI does nothing (ie its entire motivation reduces to minimising the penalty function as much as it can)".

even if P is omniscient, P' still has to estimate it's expected output from its own limited perspective. As long as this estimate is reasonable, the omniscience of P doesn't cause a problem (and remember that P is fed noisy data).

Don't you have to get the exact level of noise that will prevent the AI from hiding from P without letting P reconstruct the AI's actions if it does allow itself to be destroyed? An error in either direction can be catastrophic. If the noise is to high, the AI takes over the world. If the noise is to low, E'(P(Sᵃ|X,Oᵃ,B)/P(Sᵃ|¬X,Õᵃ,B) | a) is going to be very far from 1 no matter what, so there is no reason to expect that optimizing it is still equivalent to reducing impact.

Don't you have to get the exact level of noise that will prevent the AI from hiding from P without letting P reconstruct the AI's actions if it does allow itself to be destroyed?

It's not so sensitive. The AI's actions in the box are very hard to detect from the perspective of fifty years, with minimal noise. The AI expanding dangerously across the universe would be easy to detect, even with a lot of noise (if nothing else, because humans would have recorded this and broadcast messages about it).

From this description, it seems that P is described as essentially omniscient. It knows the locations and velocity of every particle in the universe, and it has unlimited computational power.

It has pretty unlimited computational power, but doesn't know locations and velocities of particles. When fed with S, it has the noisy info about one slice of the universe.

I see no reason that P could not hypothetically reverse the laws of physics and thus would always return 1 or 0 for any statement about reality.

Of course, you could add noise to the inputs to P

even if P is omniscient, P' still has to estimate it's expected output from its own limited perspective. As long as this estimate is reasonable, the omniscience of P doesn't cause a problem (and remember that P is fed noisy data).

21

Reduced impact AI: no back channels

21

Can Super-God detect whether God existed?

Scouring the future for traces of the AI

A very complicated way of doing nothing

The output channel: a very specific loophole

More (or less) on the output channel

21