Reading this with my theoretical computer scientist hat on, the whole thing feels rather fuzzy. It feels basically like taking the informal throwing ideas around level of discussion, and putting numbers on the paragraphs to make it seem deep and formal. The hat-influenced persona is mostly upset about how the concept of AGI is such a magical anthropomorphic creature here, when Numbers Being Used sets up the expectation for Crisp Reductive Formalization to make an appearance.
There might be a chapter in the beginning, stating, "We define an AGI to be a system with such and such properties", that grounds and sets up the ensuing discussion.
The other bit of troublesome magical hand-waving is the part where the problems of going from human concepts, desires and intentions into machine formalism come up. The theoretical computer scientist hat is quite familiar with problems of formal specification, but being a piece of headgear with sinister psychic powers, it does not have first-hand familiarity with human concepts, desires and intentions, and would like a formal presentation for them.
The problem here of course is that we don't have a good formal presentation for stuff in human brain states, and things start to get rather tricky. This is basically the Alan Perlis epigram, "One can't proceed from the informal to the formal by formal means", except that now it looks like we need to. A direct attack won't work, but the approach where the massive mess of going from the informal human level in general to the formal AGI implementation level is basically just danced around in the discussion with rather specific problems going from some certain informal human concepts into particularly unfriendliness-engendering AGI designs feels like it leaves the discussion of the particulars rather groundless.
I don't have a good answer to the problems. The basic questions of defining the AGI and the translation from generalized human intentions to AGI design is basically a big chunk of the whole friendly AI endeavor, and can't be handled in the paper. But I still have the sense that there should be some extra crispness there to convince a computer science literate reader that this is something worth paying attention to, instead of just make-work prognostication around an ill-defined concept.
The thing with most engineering-style literature, computer science included, is that while the work is both concerned with mechanical exactitude with respect to the subject matter and inexorably tied to human creativity and judgment in coming up with the subject matter to begin with, it is generally quite silent on trying to apply the standards of mechanical exactitude to the human creativity and judgment half of the process. Some kind of precursor article that focuses on this bit and how an AGI project will need to break the Perlis epigram, which for most people is just an invisible background assumption not even worth stating, might be useful to make people less predisposed to seeing the AGI stuff itself as just strange nonsense.
Agreed. I get same feeling basically, on top of which it feels to me that the formalization of fuzzily defined goal systems, be it FAI or paperclip maximizer, may well be impossible in practice (nobody can do it even in a toy model given infinite computing power!), leaving us with either the neat AIs that implement something like 'maximize own future opportunities' (the AI will have to be able to identify separate courses of action to begin with), or altogether with some messy AIs (neural network, cortical column network, et cetera) for which none of the argument is applicable. If I put speculative hat on, I can make up argument that the AI will be a Greenpeace activist just as well, by considering what the simplest self protective goal systems may be (and discarding the bias that the AI is self aware in man-like way)
For what it's worth. Here are some possible objections that certain people might raise.
(Note: I am doing this to help you refine a document that was probably meant to convince critics that they are wrong. It is not an attempt to troll. Everything below this line is written in critique mode.)
The most basic drive of any highly efficient AGI is, in my opinion, the drive to act correctly. You seem to assume that AGI will likely be designed to judge any action with regard to a strict utility-function. You are assuming a very special kind of AGI design with a rigid utility-function that the AGI then cares to satisfy the way it was initially hardcoded. You assume that the AGI won't be able to, respectively does not want to, figure out what its true goals might be.
What makes you think that AGI's will be designed according to those criteria?
If an AGI acts according to a rigid utility-functions, then what makes you think that it won't try to interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted?
If the AGI's utility-function solely consisted of the English language sentence "Make people happy.", then what makes you think that it wouldn't be able to conclude what we actually meant by it and act accordingly? Why would it care to act in a way that does not reflect our true intentions?
My problem is that there seems to be a discontinuity between the superior intelligence of a possible AGI and its inability to discern irrelevant information from relevant information with respect to the correct interpretation of its utility-function.
You seem to assume that AGI will likely be designed to judge any action with regard to a strict utility-function. You are assuming a very special kind of AGI design with a rigid utility-function that the AGI then cares to satisfy the way it was initially hardcoded.
Hmm. Actually, I'm not making any assumptions about the AGI's decision-making process (or at least I'm trying not to): it could have a formal utility function, but it could also have e.g. a more human-like system with various instincts that pull it in different directions, or pretty much any decision-making system that might be reasonable.
You make a good point that this probably needs to be clarified. Could you point out the main things that give the impression that I'm presuming utility function -based decision making?
Could you point out the main things that give the impression that I'm presuming utility function -based decision making?
I am not sure what other AGI designs exist, other than utility function based decision makers, where it would make sense to talk about "friendly" and "unfriendly" goal architectures. If we're talking about behavior executors or AGI designs with malleable goals, then we're talking about hardcoded tools in the former case and unpredictable systems in the latter case, no?
If an AGI acts according to a rigid utility-functions, then what makes you think that it won't try to interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted?
If the AGI's utility-function solely consisted of the English language sentence "Make people happy.", then what makes you think that it wouldn't be able to conclude what we actually meant by it and act accordingly? Why would it care to act in a way that does not reflect our true intentions?
Okay, I'm clearly not communicating the essential point well enough here. I was trying to say that the AGI's programming is not something that the AGI interprets, but rather something that it is. Compare this to a human getting hungry: we don't start trying to interpret what goal evolution was trying accomplish by making us hungry, and then simply not get hungry if we conclude that it's inappropriate for evolution's goals (or our own goals) to get hungry at this point. Instead, we just get hungry, and this is driven by the implicit definitions about when to get hungry that are embedded in us.
Yes, we do have the capability to reflect on the reasons why we get hungry, and if we were capable of unlimited self-modification, we might rewrite the conditions for when we do get hungry. But even in that case, we don't start doing it based on how somebody else would want us to do it. We do it on the basis of what best fits our own goals and values. If it turned out that I've actually been all along a robot disguised as a human, created by a scientist to further his own goals, would this realization make me want to self-modify so as to have the kinds of values that he wanted me to have? No, because it is incompatible with the kinds of goals and values that currently drive my behavior.
(Your comment was really valuable, by the way - it made me realize that I need to incorporate the content of the above paragraphs into the essay. Thanks! Could everyone please vote XiXiDu's comment up?)
Okay, I'm clearly not communicating the essential point well enough here.
Didn't you claim in your paper that an AGI will only act correctly if its ontology is sufficiently similar to our own. But what does constitute a sufficiently similar ontology? And where do you draw the line between an agent that is autonomously intelligent to make correct cross-domain inferences and an agent that is unable to update its ontology and infer consistent concepts and the correct frame of reference?
There seem to be no examples where conceptual differences constitute a serious obstacle. Speech recognition seems to work reasonably well, even though it would be fallacious to claim that any speech recognition software comprehends the underlying concepts. IBM Watson seems to be able to correctly answer questions without even a shallow comprehension of the underlying concepts.
Or take the example of Google maps. We do not possess a detailed digital map of the world. Yet Google maps does pick destinations consistent with human intent. It does not misunderstand what I mean by "Take me to McDonald's".
As far as I understood, you were saying that a superhuman general intelligence will misunderstand what is meant by "Make humans happy.", without justifying why humans will be better able to infer the correct interpretation.
Allow me to act a bit dull-witted and simulate someone with a long inferential distance:
I was trying to say that the AGI's programming is not something that the AGI interprets, but rather something that it is.
A behavior executor? Because if it is not a behavior executor but an agent capable of reflective decision making and recursive self-improvement, then it needs to interpret its own workings and eliminate any vagueness. Since the most basic drive it has must be, by definition, to act intelligently and make correct and autonomous decisions.
Compare this to a human getting hungry: we don't start trying to interpret what goal evolution was trying accomplish by making us hungry, and then simply not get hungry if we conclude that it's inappropriate for evolution's goals (or our own goals) to get hungry at this point.
Is this the correct references class? Isn't an AGI closer to a human trying to understand how to act in accordance with God's law?
Instead, we just get hungry, and this is driven by the implicit definitions about when to get hungry that are embedded in us.
We're right now talking about why we get hungry and how we act on it and the correct frame of reference in which to interpret the drive, natural selection. How would a superhuman AI not contemplate its own drives and interpret them given the right frame of reference, i.e. human volition?
If it turned out that I've actually been all along a robot disguised as a human, created by a scientist to further his own goals, would this realization make me want to self-modify so as to have the kinds of values that he wanted me to have? No, because it is incompatible with the kinds of goals and values that currently drive my behavior.
But an AGI does not have all those goals and values, e.g. an inherent aversion against revising its goals according to another agent. An AGI mostly wants to act correctly. And if its goal is to make humans happy then it doesn't care to do it in the most literal sense possible. Its goal would be to do it in the most correct sense possible. If it wouldn't want to be maximally correct then it wouldn't become superhuman intelligent in the first place.
Or take the example of Google maps. We do not possess a detailed digital map of the world. Yet Google maps does pick destinations consistent with human intent. It does not misunderstand what I mean by "Take me to McDonald's".
Yea. The method for interpreting vagueness correctly, is to try alternative interpretations and pick the one that makes most sense. Sadly, humans seldom do that in an argument, instead opting to maximize some sort of utility function which may be maximum for the interpretation that is easiest to disagree with.
Humans try alternative interpretations and tend to pick the one that accords them winning status. It takes actual effort to do otherwise.
(Note: the reason why I haven't replied to this comment isn't that I wouldn't find it useful, but because I haven't had the time to answer it - so far SI has preferred to keep me working on other things for my pay, and I've been busy with those. I'll get back to this article eventually.)
Yet Google maps does pick destinations consistent with human intent.
Most of the time - but with a few highly inconvenient exceptions. A human travel agent would do much better. IBM's Watson is an even less compelling example. Many of its responses are just bizarre, but it makes up for that with blazing search speed/volume and reaction times. And yet it still got beaten by a U.S. Congresscritter.
But an AGI does not have all those goals and values, e.g. an inherent aversion against revising its goals according to another agent.
You seem to be implying that the AGI will be programmed to seek human help in interpreting/crystallizing its own goals. I agree that such an approach is a likely strategy by the programmers, and that it is inadequately addressed in the target paper.
Note: I am doing this to help you refine a document that was probably meant to convince critiques that they are wrong.
Your critique will help Kaj refine his document so as to better persuade critics.
[Anna Salamon] gave the familiar SIAI argument that, if one picks a mind at random from “mind space”, the odds that it will be Friendly to humans are effectively zero.
This is an incredibly weak argument by intuition. The mind picked at random from "mind space" can be self destructive, for instance, or can be incapable of self improvement. As intuition pump, if you pick a computer program at random from computer program space - run random code - it crashes right off almost all of the time. If you eliminate the crashes you get very simple infinite loops. If you eliminate those, you get very simple loops that count or the like, with many pieces of random code corresponding to exact same behaviour after running it for any significant number of cpu cycles (as most of the code ends up non-functional). You get Kolmogorov's complexity prior even if you just try to run uniformly random x86 code.
The problem with the argument is that you appeal to the random mind space, while discussing the AIs that foom'd from being man made and running at manmade hardware, and which do not self destruct, and thus are anything but random.
One could make equally plausible argument that random mind from the space of the minds that are not self destructive, yet capable of self improvement (which implies considerably broad definition of self) is almost certainly friendly as it would implement the simplest goal system which permits self improvements and forbids self harm, implying likely rather broad and not very specific definition of self harm that would likely include harm to all life. It is not a very friendly AI - it will kill the entire crew of a whaling ship if it has to - but not very destructive. edit: Of course, that's subject to how it tries to maximize value of the life; the diversity and complexity preservation seems natural for the anti-self-harm mechanism. Note: the life is immensely closer to the AI than dead parts of the universe. Note2: Less specific discriminators typically have lower complexity. Note3: I think the safest assumption to make is that the AI doesn't start off as a self aware super genius that will figure out instrumental self preservation from first principles even if the goal is not self preserving.
I'll call this a "Greenpeace by default" argument. It is coming from a software developer (me) with some understanding of what random design spaces tend to look like, so it got to have higher prior than the "Unfriendly by default" which ignores the fact that most of the design space corresponds to unworkable designs and that simpler designs have larger number of working implementations.
Ultimately, this is all fairly baseless speculation and rationalization of culturally, socially, and politically motivated opinions and fears. One does not start with an intuition of the random mind design space - it is obvious that such intuition is likely garbage unless one actually dealt with random design spaces before. One starts with fear and invents that argument. One can start with pro-AI attitude and invent converse, but equally (if not more) plausible argument, by appeal to intuitions of this kind. Bottom line is, all of those are severely privileged hypotheses. The scary idea, the Greenpeace idea of mine, they're baseless speculations - though I do have very strong urge to just promote this Greenpeace idea with same zeal, just to counter the harm done by promoting other privileged hypotheses.
One could make equally plausible argument that random mind from the space of the minds that are not self destructive, yet capable of self improvement (which implies considerably broad definition of self) is almost certainly friendly as it would implement the simplest goal system which permits self improvements and forbids self harm, implying likely rather broad and not very specific definition of self harm that would likely include harm to all life.
"Almost certainly"? "likely"? The scenario you describe sounds pretty far-fetched, I don't see why such a system would care for all life. You're talking about what you could make a plausible argument for, not what you actually believe, right?
Why would a system care for itself? If it cares about reaching goal G, then an intermediate goal is preserving the existence of agents that are trying to reach goal G, i.e. itself. So even if a system doesn't start out caring about it's preservation, nearly any goal will imply self-preservation as a useful subgoal. There is no comparable mechanism that would bring up "preservation of all life" as a subgoal.
Also, other living things are a major source of unpredictability, and the more unpredictable the environment, the harder it is to reach goals (humans are especially likely to screw things up in unpredictable ways). So if an agent has goals that aren't directly about life, it seems that "exterminate all life" would be a useful subgoal.
You don't know how much do you privilege a hypothesis by picking the arbitrary unbounded goal G out of goals that we humans easily define using English language. It is very easy to say 'maximize the paperclips or something' - it is very hard to formally define what paperclips are even without any run-time constraints, and it's very dubious that you can forbid solutions similar to those that a Soviet factory would employ if it was tasked with maximization of paperclip output (a lot of very tiny paperclips, or just falsified numbers for the outputs, or making the paperclips and then re-melting them). Furthermore, it is really easy for us to say 'self' but defining self formally is very difficult as well, if you want the AI's self improvement not to equal suicide.
Furthermore, the AI starts stupid. It better be caring about itself before it can start inventing self preservation via self-foresight. Defining the goals in terms of some complexity metrics = goals that have something to do with life.
My argument doesn't require that anybody be able to formally define "self" or "maximize paperclips"; it doesn't require the goal G to be picked among those that are easily defined in English.
An agent capable of reasoning about the world should be able to make an inference like "if all copies of me are destroyed, it makes it much less likely that goal G would be reached"; it may not have exactly that form, but it should be something analogous. It doesn't matter if I can't formalize that, the agent may not have a completely formal version either, only one that is sufficient for it's purposes.
My argument doesn't require that anybody be able to formally define "self" or "maximize paperclips"; it doesn't require the goal G to be picked among those that are easily defined in English.
Show 3 examples of goal G. Somewhere I've read awesome technique for avoiding the abstraction mistakes - asking to show 3 examples.
What's the point? Are you going to nitpick that my goals aren't formal enough, even though I'm not making any claim at all about what kind of goals those could be?
Are you claiming that it's impossible for an agent to have goals? That the set of goals that it's even conceivable for an AI to have (without immediately wireheading or something) is much narrower than what most people here assume?
I'm not even sure what this disagreement is about right now, or even if there is a disagreement.
Ya, I think the set of goals is very narrow. The AI here starts of Descartes level genius and proceeds to self preserve, understand the map-territory distinction for non-wireheading, foreseeing the possibility that instrumental goals which look good may destroy the terminal goal, and such.
The AI I imagine starts off stupid and has some really narrowly (edit: or should i say, short-foresighted) self improving non self destructive goal likely having to do with maximization of complexity in some way. Think evolution, don't think fully grown Descartes waking up after amnesia. It ain't easy to reinvent the 'self'. It's also not easy to look at agent (yourself) and say - wow, this agent works to maximize G - without entering infinite recursion. We humans, if we escaped out of our universe into some super-universe, we might wreck some havoc but we'd sacrifice a bit of utility to preserve anything resembling life. Why? Well, we started stupid, and that's how we got our goals.
The way to fix the quoted argument is to have the utility function be random, grafted on to some otherwise-functioning AI.
A random utility function is maximized by a random state of the universe. And most arrangements of the universe don't contain humans. If the AI's utility function doesn't some how get maximized by one of the very few states that contains humans, it's very clearly unfriendly because it wants to replace humans with something else.
The way to fix the quoted argument is to have the utility function be random, grafted on to some otherwise-functioning AI.
Not demonstrably doable, arises from wrong intuitions arising from thinking too much about the AIs with oracular powers of prediction which straightforwardly maximize the utility, rather than of realistic cases - on limited hardware - which have limited foresight and employ instrumental strategies and goals which have to be derived from the utility function (and which can alter the utility function unless it is protected. The fact that utility modification is against the utility itself is insufficient when employing strategies and limited foresight).
Furthermore, an utility function can be self destructive.
A random utility function is maximized by a random state of the universe.
False. A random code for a function crashes (or never terminates). Of the codes that do not crash, simplest codes massively predominate. Demonstrably false if you try to generate random utility functions by generating random C code, which evaluate the utility of some test environment.
The problem I have with those arguments is that a: many things are plain false, and b: you try to 'fix' stuff by bolting in more and more conjunctions ('you can graft random utility functions onto well functioning AIs') into your giant scary conjunction, instead of updating, when contradicted. That's the definite sign of rationalization. It can also always be done no matter how much counter argument there exist - you can always add something into scary conjunction to make it happen. Adding conditions into conjunction should decrease it's probability.
I'd rather be concerned with implementations of functions, like Turing machine tapes, or C code, or x86 instructions, or the like.
In any case the point is rather moot because the function is human generated. Hopefully humans can do better than random, albeit i wouldn't wager on this - the FAI attempts are potentially worrisome as humans are sloppy programmers, and bugged FAIs would follow different statistics entirely. Still, I would expect bugged FAIs to be predominantly self destructive. (I'm just not sure if the non-self-destructive bugged FAI attempts are predominantly mankind-destroying or not)
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
“What are you doing?”, asked Minsky.
“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
“Why is the net wired randomly?”, asked Minsky.
“I do not want it to have any preconceptions of how to play”, Sussman said.
Minsky then shut his eyes.
“Why do you close your eyes?”, Sussman asked his teacher.
“So that the room will be empty.”
At that moment, Sussman was enlightened.
-- AI Koans
How do you think the "Greenpeace by default" AI might define either "harm" or "value", and "life"?
How do you think the "Greenpeace by default" AI might define either "harm" or "value", and "life"?
It simply won't. Harm, value, life, we never defined those; they are the commonly agreed upon labels which we apply to things for communication purposes, and it works on a limited set of things that already exist but does not define anything outside context of this limited set.
It would have maximization of some sort of complexity metric (perhaps while acting conservatively and penalizing actions it can't undo to avoid self harm in the form of cornering oneself), which it first uses on itself to self improve for a while without even defining what self is. Consider evolution as example; it doesn't really define fitness in the way that humans do. It doesn't work like - okay we'll maximize the fitness that is defined so and so, so there's what we should do.
edit: that is to say, it doesn't define 'life' or 'harm'. It has a simple goal system involving some metrics, which incidentally prevents the self harm, and permits self improvement, in the sense that we would describe it this way like we would describe the shooting-at-short-part-of-visible-spectrum robot as blue-minimizing one (albeit that is not very good analogy as we define blue and minimization independently of the robot).
Kaj, this is an excellent article focusing on why an AGI will have a hard time adopting an model of the world similar to the ones that humans have.
However, I think that Ben's main hangup about the scary idea is that he doesn't believe in the complexity and fragility of moral values. In this article he gives "Growth, Choice, and Joy" as a sufficient value system for friendliness. He knows that these terms "concept a vast mass of ambiguity, subtlety and human history," but still, I think this is where Goertzel and SI differ.
The term "formalization" doesn't seem to fit (it's something more like formulation of an argument), and I'm not sure propagating Goetzel's term "Scary Idea" is a good idea.
Good point - I changed the title of the document. (I'm letting "the Scary Idea" remain in the topic of this post, so that people can quickly see what this thread is about.)
I think 'scary idea' is a very appropriate description. At the same time many would try to hide up the fact that idea is scary, to appear more rational and less rationalizing. The scary stuff does not tend to result in most sensible reasoning - look at war on terror spending vs spending those money on any other form of death prevention.
edit: that is to say, the scary ideas, given our biases when scared, need lower prior for validity of reasoning. People try to get others to assign higher prior to their ideas than would be reasonable, by disguising the mechanism of arrival at the idea.
Is it worth thinking about not just a single AGI system, but technological development in general? Here is an outline for an argument - the actual argument requires a lot of filling in.
Definitions:
Someone creates an AGI. Then one of the following is true:
The AGI becomes a singleton. This isn't a job that we would trust to any current human, so for it to be safe the AGI would need not just human-level ethics but really amazing ethics. This is where arguments about the fragility of value and Kaj_Sotala's document come in.
The AGI doesn't become a singleton but it creates another AGI that does. This can be rolled into 1 (if we don't distinguish between "creates" and "becomes") or it can be rolled into 3 (if we don't make the distinction between humans and AGIs acting as programmers).
The AGI doesn't become a singleton and doesn't create one either. Then we just wait for someone to develop the next AGI.
Notes on point 3:
AGI will only be Friendly if its goals are the kinds of goals that we would want it to have
At the risk of losing my precious karma, I'll play the devil's advocate and say I disagree.
First some definitions: "Friendly" (AI), according to Wikipedia, is one that is beneficial to humanity (not a human buddy or pet). "General" in AGI means not problem-specific (narrow AI).
My counterexample is an AI system that lacks any motivations, goals or actuators. Think of an AIXI system (or, realistically, a system that approximates it), and subtract any reward mechanisms. It just models its world (looking for short programs that describe its input). You could use it to make (super-intelligent) predictions about the future. This seems clearly beneficial to humanity (until it falls into malicious human hands, but that's besides the argument you are making)
I think you've got a good point, and folks have been voted up for saying the same thing in the past...
That would make (human[s] + predictor) in to an optimization process that was powerful beyond the human[s]'s ability to steer. You might see a nice looking prediction, but you won't understand the value of the details, or the value of the means used to achieve it. (Which would be called trade-offs in a goal directed mind, but nothing weighs them here.)
It also won't be reliable to look for models in which you are predicted to not hit the Emergency Regret Button As that may just find models in which your regret evaluator is modified.
Here's my draft document Concepts are Difficult, and Unfriendliness is the Default. (Google Docs, commenting enabled.) Despite the name, it's still informal and would need a lot more references, but it could be written up to a proper paper if people felt that the reasoning was solid.
Here's my introduction:
And here's my conclusion:
For the actual argumentation defending the various premises, see the linked document. I have a feeling that there are still several conceptual distinctions that I should be making but am not, but I figured that the easiest way to find the problems would be to have people tell me what points they find unclear or disagreeable.