In the comments on this post (which in retrospect I feel was not very clearly written), someone linked me to a post Eliezer wrote five years ago, "The Hidden Complexity of Wishes." After reading it, I think I've figured out why the term "Friendly AI" is used so inconsistently.

This post explicitly lays out a view that seems to be implicit in, but not entirely clear from, many of of Eliezer's other writings. That view is this:

There are three kinds of genies:  Genies to whom you can safely say "I wish for you to do what I should wish for"; genies for which no wish is safe; and genies that aren't very powerful or intelligent.

Even if Eliezer is right about that, I think that view of his has led to confusing usage of the term "Friendly AI." If you accept Eliezer's view, it may seem to make sense to not worry to much about whether by "Friendly AI" you mean:

  1. A utopia-making machine (the AI "to whom you can safely say, 'I wish for you to do what I should wish for.'") Or:

  2. A non-doomsday machine (a doomsday machine being the AI "for which no wish is safe.")

And it would make sense not to worry too much about that distinction, if you were talking only to people who also believe those two concepts are very nearly co-extensive for powerful AI. But failing to make that distinction is obviously going to be confusing when you're talking to people who don't think that. It will make it harder to communicate both your ideas and your reasons for holding those ideas to them.

One solution would be to more frequently link people back to "The Hidden Complexity of Wishes" (or other writing by Eliezer that makes similar points--what else would be suitable?) But while it's a good post and Eliezer makes some very good points with the "Outcome Pump" thought-experiment, the argument isn't entirely convincing.

As Eliezer himself has argued at great length, (see also section 6.1 of this paper) humans' own understanding of our values is far from perfect. None of us are, right now, qualified to design a utopia. But we do have some understanding of our own values; we can identify some things that would be improvements over our current situation while marking other scenarios as "this would be a disaster." It seems like there might be a point in the future where we can design an AI whose understanding of human values is similarly serviceable but no better than that.

Maybe I'm wrong about that. But if I am, until there's a better easy to read explanation of why I'm wrong for everybody to link to, it would be helpful to have different terms for (1) and (2) above. Perhaps call them "utopia AI" and "safe AI," respectively?

New Comment
33 comments, sorted by Click to highlight new comments since: Today at 3:26 PM

Edit: It's now fixed.

A non-doomsday machine (the AI "for which no wish is safe.")

In Eliezer's quote, "genies for which no wish is safe" are those that kill you irrespective of what wish you made, while here it's written as if you might be referring to AIs that are safe even if you make no wish, which is different. This should be paraphrased for clarity, whatever the intended meaning.

Or maybe the parenthesis refere only to "doomsday machine"

That's how I read it. The wording could be clearer.

This is the intended reading. Edited for clarity.

[-][anonymous]12y00

In any case it's confusing and should be paraphrased for clarity, whatever is the intended meaning.

[This comment is no longer endorsed by its author]Reply

Well, there's the systems that simply can't process your wishes (AIXI for instance), but which you can use to e.g. cure cancer if you wish (you could train it to do what you tell it to but all it is looking for is sequence that leads to reward button press, which is terminal - no value for button being held). Just as there is a system, screwdriver, which I can use to unscrew screws, if I wish, but it's not a screw unscrewing genie.

I think that some of the issue is that while Eliezer's conception of these issues has continued to evolve, we continue to both point and be pointed back to posts that he only partially agrees with. We might chart a more accurate position by winding through a thousand comments, but that's a difficult thing to do.

To pick one example from a recent thread, here he adjusts (or flags for adjustment) his thinking on Oracle AI, but someone who missed that would have no idea from reading older articles.

It seems like our local SI representatives recognize the need for an up to date summary document to point people to. Until then, our current refrain of "read the sequences" will grow increasingly misleading as more and more updates and revisions are spread across years of comments (that said, I still think people should read the sequences :) ).

It seems like our local SI representatives recognize the need for an up to date summary document to point people to.

Maybe this is what you're implying is already in progress, but if the main issue is that parts of the sequence are out of date, maybe Eliezer could commission a set of people who've been following the discussion all along to write review pieces, drawing on all the best comments, that describe how they would "rediscover" the conclusions of the aspect of the sequence they are responsible for themselves (with links back to original discussion).

Ideally these reviewers would work out between themselves how to make a clean and succinct narrative without lots of repetition; e.g. how to collapse issues that get revisited later or that crosscut into a clear narrative.

Then Eliezer and the rest of us could comment on those summaries, as a peer review.

Of course, it's fine if he wants to write the new material himself, but frankly I want to know what's going to happen in HPMOR. :)

I wonder if there's a way we could prevail upon the sufficiently informed people to make the relevant corrections as "re-running the sequences" posts come up.

I think the bigger issue is the collapsing of the notion of 'incredibly useful software that would be able to self improve and solve engineering problems' with philosophical notion of mind. The philosophical problem of how do we make the artificial mind not think about killing mankind, may not be solvable over the philosophical notion of the mind, and the solutions may be useless. However, practically it is a trivial part of much bigger problem of 'how do we make the software not explore the useless parts of the solution space'; it's not the killing of mankind that is problematic, but the fact that even on Jupiter sized computer the brute force solutions that explore such big and ill defined solution spaces would be useless. Long before you have to worry about the software finding an unintended way to achieve the objective, you encounter the problem of software not finding any way to achieve the objective because it was looking in the space >10^1000 times larger than it could search. The 'artificial intelligence', as in, useful software which does tasks we regarded as intelligent, is much broader and diverse concept than philosophical notion of mind.

Long before you have to worry about the software finding an unintended way to achieve the objective, you encounter the problem of software not finding any way to achieve the objective

Well, obviously, since it is pretty much the problem we have now. The whole point of the Friendly AI as formulated by SI is that you have to solve the former problem before the latter is solved, because once the software can achieve any serious objectives it will likely cause enormous damage on its way there.

Well, if that's the whole point, SI should dissolve today (shouldn't even have formed in first place). The software is not magic; "once the software can achieve any serious objectives" is when we know how to restrict the search space; it won't happen via mere hardware improvement. We don't start with philosophical ideal psychopathic 'mind', infinitely smart, and carve friendly mind out of it. We build our sculpture grain by grain using glue.

Just because software is built line by line doesn't mean it automatically does exactly what you want. In addition to outright bugs any complex system will have unpredictable behaviour, especially when exposed to real word data. Just because the system can restrict the search space sufficiently to achieve an objective doesn't mean it will restrict itself only to the parts of the solution space the programmer wants. The basic purpose of Friendly AI project is to formalize human value system sufficiently that it can be included into the specification of such restriction. The argument made by SI is that there is a significant risk a self-improving AI can increase in power so rapidly, that unless such restriction is included from the outset it might destroy humanity.

Just because it doesn't do exactly what you want doesn't mean it is going to fail in some utterly spectacular way.

You aren't searching for solutions to a real world problem, you are searching for solutions to a model (ultimately, for solutions to systems of equations), and not only you have limited solution space, you don't model anything irrelevant. Furthermore, the search space is not 2d and not 3d, and not even 100d, the volume increases really rapidly with size. The predictions of many systems are fundamentally limited by Lyapunov's exponent. I suggest you stop thinking in terms of concepts like 'improve'.

If something self improves at software level, that'll be a piece of software created with very well defined model of changes to itself, and the very self improvement will be concerned with cutting down the solution space and cutting down the model. If something self improves at hardware level, likewise for the model of physics. Everyone wants artificial rainman. The autism is what you get from all sorts of random variations to baseline human brain; looks like the general intelligence that expands it's model and doesn't just focus intensely is a tiny spot in the design space. I don't see why expect general intelligence to suddenly overtake specialized intelligences; the specialized intelligences have better people working on them, have the funding, and the specialization massively improves efficiency; superhuman specialized intelligences require lower hardware power.

Just because it doesn't do exactly what you want doesn't mean it is going to fail in some utterly spectacular way.

I certainly agree, and I am not even sure what the official SI position is on the probability of such failure. I know that Eliezer in hist writing does give the impression that any mistake will mean certain doom, which I believe to be an exaggeration. But failure of this kind is fundamentally unpredictable, and if a low probability even kills you, you are still dead, and I think that it is high enough that the Friendly AI type effort would not be wasted.

(ultimately, for solutions to systems of equations)

That is true in the trivial sense that everything can be described as equations, but when thinking how computation process actually happens this becomes almost meaningless. If the system is not constructed as a search problem over high dimensional spaces, then in particular its failure modes cannot be usefully thought about in such terms, even if it is fundamentally isomorphic to such a search.

that'll be a piece of software created with very well defined model of changes to itself

Or it will be created by intuitively assembling random components and seeing what happens. In which case there is no guarantee what it will actually do to its own model or even to what it is actually solving for. Convincing AI researches to only allow an AI to self modify when it is stable under self modification is a significant part of the Friendly AI effort.

Everyone wants artificial rainman.

There are very few statements that are true about "everyone" and I am very confident that this is not one of them. Even if most people with actual means to build one want specialized and/or tool AIs, you only need one unfriendly-successful AGI project to potentially cause a lot of damage. This is especially true as both hardware costs fall and more AI knowledge is developed and published, lowering the entry costs.

I don't see why expect general intelligence to suddenly overtake specialized intelligences;

To be dangerous AGI doesn't have to overtake specialized intelligences, it has to overtake humans. Existence of specialized AIs is either irrelevant or increases the risks from AGI, since they would be available to both, and presumably AGIs would have lower interfacing costs.

I certainly agree, and I am not even sure what the official SI position is on the probability of such failure. I know that Eliezer in hist writing does give the impression that any mistake will mean certain doom, which I believe to be an exaggeration. But failure of this kind is fundamentally unpredictable, and if a low probability even kills you, you are still dead, and I think that it is high enough that the Friendly AI type effort would not be wasted.

Unpredictable is a subjective quality. It'd look much better if the people speaking of unpredictability had demonstrable accomplishment. If there is a trillion equally probable unpredictable outcomes, out of which only a small integer is destruction of mankind, even though it is still technically fundamentally unpredictable the probability is low. Unpredictability does not imply likehood of the scenario; if anything, unpredictability implies lower risk. I am sensing either a bias or dark arts; the unpredictable is a negative word. The highly specific predictions should be lowered in their probability when updating on the statement like 'unpredictable'.

That is true in the trivial sense that everything can be described as equations, but when thinking how computation process actually happens this becomes almost meaningless.

Not everything is equally easy to describe as equations. For example we don't know how to describe number of real world paperclips with a mathematical equation. We can describe performance of a design with equation, and then solve for maximum, but that is not identical to 'maximizing performance of real world chip'.

If the system is not constructed as a search problem over high dimensional spaces, then in particular its failure modes cannot be usefully thought about in such terms, even if it is fundamentally isomorphic to such a search.

The problem is that of finding a point in a high dimensional space.

Or it will be created by intuitively assembling random components and seeing what happens. In which case there is no guarantee what it will actually do to its own model or even to what it is actually solving for. Convincing AI researches to only allow an AI to self modify when it is stable under self modification is a significant part of the Friendly AI effort.

I think you have a very narrow vision of 'unstable'.

Even if most people with actual means to build one want specialized and/or tool AIs, you only need one unfriendly-successful AGI project to potentially cause a lot of damage. This is especially true as both hardware costs fall and more AI knowledge is developed and published, lowering the entry costs.

To be dangerous AGI has to win in the future ecosystem where the fruit been taken. The general is a positive sounding word, beware of halo effect.

To be dangerous AGI doesn't have to overtake specialized intelligences, it has to overtake humans. Existence of specialized AIs is either irrelevant or increases the risks from AGI, since they would be available to both, and presumably AGIs would have lower interfacing costs.

I believe that is substantially incorrect. Suppose that there was an AGI in your basement, connected to internet, in the ecosystem of very powerful specialized AIs. The internet is secured by specialized network security AI and would have been taken by specialized botnet if it was not; you don't have a chip fabrication plant in your basement; the specialized AIs elsewhere are running on massive hardware designing better computing substrates, better methods of solving, and so on. What exactly this AGI is going to do?

This is going nowhere. Too much anthropomorphization.

The highly specific predictions should be lowered in their probability when updating on the statement like 'unpredictable'.

That depends what your initial probability is and why. If it already low due to updates on predictions about the system, then updating on "unpredictable" will increase the probability by lowering the strength of those predictions. Since destruction of humanity is rather important, even if the existential AI risk scenario is of low probability it matters exactly how low.

This of course has the same shape as Pascal's mugging, but I do not believe that SI claims are of low enough probability to be dismissed as effectively zero.

Not everything is equally easy to describe as equations.

That was in fact my point, which might indicate that we are likely to be talking past each other. What I tried to say is that an artificial intelligence system is not necessarily constructed as an explicit optimization process over an explicit model. If the model and the process are implicit in its cognitive architecture then making predictions about what the system will do in terms of a search are of limited usefulness.

And even talking about models, getting back to this:

cutting down the solution space and cutting down the model

On further thought, this is not even necessarily true. The solution space and the model will have to be pre-cut by someone (presumably human engineers) who doesn't know where the solution actually is. A self-improving system will have to expand both if the solution is outside them in order to find it. A system that can reach a solution even when initially over-constrained is more useful than the one that can't, and so someone will build it.

I think you have a very narrow vision of 'unstable'.

I do not understand what you are saying here. If you mean that by unstable I mean a highly specific trajectory a system that lost stability will follow, then it is because all those trajectories where the system crashes and burns are unimportant. If you have a trillion optimization systems on a planet running at the same time you have to be really sure that nothing can't go wrong.

I just realized I derailed the discussion. The whole AGI in specialized AI world is irrelevant to what started this thread. In the sense of chronology of being developed I cannot tell how likely it is that AGI could overtake specialized intelligences. It really depends whether there is a critical insight missing for the constructions of AI. If it is just an extension of current software then specialized intelligences will win for reasons you state. Although some of the caveats I wrote above still apply.

If there is a critical difference in architecture between current software and AI then whoever hits that insight will likely overtake everyone else. If they happen to be working on AGI or even any system entangled with the real world, I don't see how once can guarantee that the consequences will not be catastrophic.

Too much anthropomorphization.

Well, I in turn believe you are applying overzealous anti-anthropomorphization. Which is normally a perfectly good heuristic when dealing with software, but the fact is human intelligence is the only thing in "intelligence" reference class we have, and although AI will almost certainly be different they will not necessarily be different in every possible way. Especially considering the possibility of AI that are either directly base on human-like architecture or even are designed to directly interact with humans, which requires having at least some human-compatible models and behaviours.

That depends what your initial probability is and why. If it already low due to updates on predictions about the system, then updating on "unpredictable" will increase the probability by lowering the strength of those predictions. Since destruction of humanity is rather important, even if the existential AI risk scenario is of low probability it matters exactly how low.

The importance should not weight upon our estimation, unless you proclaim that I should succumb to a bias. Furthermore, it is the destruction of the mankind that is the prediction being made here. Via multitude of assumptions, the most dubious one being that the system will have real-world, physical goal. Number of paperclips is not easy.

On further thought, this is not even necessarily true. The solution space and the model will have to be pre-cut by someone (presumably human engineers) who doesn't know where the solution actually is. A self-improving system will have to expand both if the solution is outside them in order to find it. A system that can reach a solution even when initially over-constrained is more useful than the one that can't, and so someone will build it.

Sorry, you are factually wrong as of how the design of automatic tools work. Rest of your argument presses too hard to recruit multitude of importance related biases and cognitive fallacies that were described on this very site.

If you have a trillion optimization systems on a planet running at the same time you have to be really sure that nothing can't go wrong.

No I don't, if the systems that work right took all the low hanging fruit from picking by one that goes wrong.

Well, I in turn believe you are applying overzealous anti-anthropomorphization. Which is normally a perfectly good heuristic when dealing with software, but the fact is human intelligence is the only thing in "intelligence" reference class we have, and although AI will almost certainly be different they will not necessarily be different in every possible way. Especially considering the possibility of AI that are either directly base on human-like architecture or even are designed to directly interact with humans, which requires having at least some human-compatible models and behaviours.

You seem to keep forgetting of all the software that is fundamentally different from human mind, but solves the problems very well. The issue reads like a belief in extreme superiority of man over machine, except it is a superiority of anthropomorphized software over all other software.

That sounds way less scary when you consider actual software that is approaching recursive self improvement and get more specific than vague "increase in power". It's just generic ignorant anti-technology talk that relies on vague concepts like "power" and dissipates once you get in any way specific.

The software also tends not to do what you want it to do for sake of this argument. There's an enormous gap between 'not doing exactly what we want' and doing exactly what you want for this argument to work. The automated engineering software simulates microscopic material interaction; vague self improvement and increases in power only make it better at not doing unrelated stuff.

This is my distinction between Friendly AI and what I call Obedient AI (Which is neccesarily much less powerful than FAI because it must act slowly enough for a human to tell whether orders are being obeyed.)

Humans have systems for predicting and understanding the desires of other humans baked in. The information theoretic complexity of the systems is likely to be very high. I tend to think extracting all this complexity and building a cross domain optimizer are separate problems.