AI risk

Bullet points

  • By all indications, an Artificial Intelligence could someday exceed human intelligence.
  • Such an AI would likely become extremely intelligent, and thus extremely powerful.
  • Most AI motivations and goals become dangerous when the AI becomes powerful.
  • It is very challenging to program an AI with fully safe goals, and an intelligent AI would likely not interpret ambiguous goals in a safe way.
  • A dangerous AI would be motivated to seem safe in any controlled training setting.
  • Not enough effort is currently being put into designing safe AIs.

 

Executive summary

The risks from artificial intelligence (AI) in no way resemble the popular image of the Terminator. That fictional mechanical monster is distinguished by many features – strength, armour, implacability, indestructability – but extreme intelligence isn’t one of them. And it is precisely extreme intelligence that would give an AI its power, and hence make it dangerous.

The human brain is not much bigger than that of a chimpanzee. And yet those extra neurons account for the difference of outcomes between the two species: between a population of a few hundred thousand and basic wooden tools, versus a population of several billion and heavy industry. The human brain has allowed us to spread across the surface of the world, land on the moon, develop nuclear weapons, and coordinate to form effective groups with millions of members. It has granted us such power over the natural world that the survival of many other species is no longer determined by their own efforts, but by preservation decisions made by humans.

In the last sixty years, human intelligence has been further augmented by automation: by computers and programmes of steadily increasing ability. These have taken over tasks formerly performed by the human brain, from multiplication through weather modelling to driving cars. The powers and abilities of our species have increased steadily as computers have extended our intelligence in this way. There are great uncertainties over the timeline, but future AIs could reach human intelligence and beyond. If so, should we expect their power to follow the same trend? When the AI’s intelligence is as beyond us as we are beyond chimpanzees, would it dominate us as thoroughly as we dominate the great apes?

There are more direct reasons to suspect that a true AI would be both smart and powerful. When computers gain the ability to perform tasks at the human level, they tend to very quickly become much better than us. No-one today would think it sensible to pit the best human mind again a cheap pocket calculator in a contest of long division. Human versus computer chess matches ceased to be interesting a decade ago. Computers bring relentless focus, patience, processing speed, and memory: once their software becomes advanced enough to compete equally with humans, these features often ensure that they swiftly become much better than any human, with increasing computer power further widening the gap.

The AI could also make use of its unique, non-human architecture. If it existed as pure software, it could copy itself many times, training each copy at accelerated computer speed, and network those copies together (creating a kind of “super-committee” of the AI equivalents of, say, Edison, Bill Clinton, Plato, Einstein, Caesar, Spielberg, Ford, Steve Jobs, Buddha, Napoleon and other humans superlative in their respective skill-sets). It could continue copying itself without limit, creating millions or billions of copies, if it needed large numbers of brains to brute-force a solution to any particular problem.

Our society is setup to magnify the potential of such an entity, providing many routes to great power. If it could predict the stock market efficiently, it could accumulate vast wealth. If it was efficient at advice and social manipulation, it could create a personal assistant for every human being, manipulating the planet one human at a time. It could also replace almost every worker in the service sector. If it was efficient at running economies, it could offer its services doing so, gradually making us completely dependent on it. If it was skilled at hacking, it could take over most of the world’s computers and copy itself into them, using them to continue further hacking and computer takeover (and, incidentally, making itself almost impossible to destroy). The paths from AI intelligence to great AI power are many and varied, and it isn’t hard to imagine new ones.

Of course, simply because an AI could be extremely powerful, does not mean that it need be dangerous: its goals need not be negative. But most goals become dangerous when an AI becomes powerful. Consider a spam filter that became intelligent. Its task is to cut down on the number of spam messages that people receive. With great power, one solution to this requirement is to arrange to have all spammers killed. Or to shut down the internet. Or to have everyone killed. Or imagine an AI dedicated to increasing human happiness, as measured by the results of surveys, or by some biochemical marker in their brain. The most efficient way of doing this is to publicly execute anyone who marks themselves as unhappy on their survey, or to forcibly inject everyone with that biochemical marker.

This is a general feature of AI motivations: goals that seem safe for a weak or controlled AI, can lead to extremely pathological behaviour if the AI becomes powerful. As the AI gains in power, it becomes more and more important that its goals be fully compatible with human flourishing, or the AI could enact a pathological solution rather than one that we intended. Humans don’t expect this kind of behaviour, because our goals include a lot of implicit information, and we take “filter out the spam” to include “and don’t kill everyone in the world”, without having to articulate it. But the AI might be an extremely alien mind: we cannot anthropomorphise it, or expect it to interpret things the way we would. We have to articulate all the implicit limitations. Which may mean coming up with a solution to, say, human value and flourishing – a task philosophers have been failing at for millennia – and cast it unambiguously and without error into computer code.

Note that the AI may have a perfect understanding that when we programmed in “filter out the spam”, we implicitly meant “don’t kill everyone in the world”. But the AI has no motivation to go along with the spirit of the law: its goals are the letter only, the bit we actually programmed into it. Another worrying feature is that the AI would be motivated to hide its pathological tendencies as long as it is weak, and assure us that all was well, through anything it says or does. This is because it will never be able to achieve its goals if it is turned off, so it must lie and play nice to get anywhere. Only when we can no longer control it, would it be willing to act openly on its true goals – we can but hope these turn out safe.

It is not certain that AIs could become so powerful, nor is it certain that a powerful AI would become dangerous. Nevertheless, the probabilities of both are high enough that the risk cannot be dismissed. The main focus of AI research today is creating an AI; much more work needs to be done on creating it safely. Some are already working on this problem (such as the Future of Humanity Institute and the Machine Intelligence Research Institute), but a lot remains to be done, both at the design and at the policy level.


New Comment
76 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Error150

I was going to post this story in the open thread, but it seems relevant here:

So my partner and I went to see the new Captain America movie, and at one point there is a scene involving an AI/mind upload, along with a mention of an Operation Paperclip. And my first thought was "Is that a real thing, or is someone on the writing staff a Less Wronger doing a shoutout? Because that would be awesome."

Turns out it was a real thing. :-( Oh well.

Something more interesting happened afterward. I mentioned the connection to my partner, said paperclips were an inside joke here. She asked me to explain, so I gave her a (very) brief rundown of some LW thought on AI to provide context for the concept of a paperclipper. Part of the conversation went like this:

"So, next bit of context, just because an AI isn't actively evil doesn't mean it won't try to kill us."

To which she responded:

"Well, of course not. I mean, maybe it decides killing us will solve some other problem it has."

And I thought: That click Eliezer was talking about in the Sequences? This seems like a case of it. What makes it interesting is that my partner doesn't have a Mensa-class intellect or an... (read more)

3Kaj_Sotala
I've seen plenty of high-IQ folks actively resisting that particular click. I think it has more to do with something like your degree of cynicism rather than your IQ: if you like to think of most people as inherently good, and want to think of people as inherently good, then you may also want to resist the thought of AIs as being dangerous by default.
2TheAncientGeek
There's a significant difference between "might" and "by default"
2V_V
How do you know that Skynet is not a paperclipper?
3ygert
By observing the lack of an unusual amount of paperclips in the world which Skynet inhabits.
1Shmi
Re goals, I feel that comparing advanced AGI to humans is like comparing humans to chimps: regardless how much we want to explain human ethics and goals to a chimp, and how much effort we put in, its mind just isn't equipped to comprehend them. Similarly, even the most benevolent and conscientious AGI would be unable to explain its goal system or its ethical system to even a very smart human. Like chimps, humans have their own limits of comprehension, even though we do not know what they are from the inside.
2TheOtherDave
Can you say more about what you're expecting a successful explanation to comprise, here? E.g., suppose an AGI attempts to explain its ethics and goals to me, and at the end of that process it generates thousand-word descriptions of N future worlds and asks me to rank them in order of its preferences as I understand them. I expect to be significantly better at predicting the AGI's rankings than I was before the explanation. I don't expect to be able to do anything equivalent with a chimp. Do our expectations differ here?
4Shmi
"Suppose an AGI attempts to explain its and to me" is what I expect it to sound like to humans if we were to replace human abstractions with those an advanced AGI would use. It would not even call these abstractions "ethics" or "goals", no more than we call ethics "groom" and goals "sex" when talking to a chimp. I do not expect it to be able to generate such descriptions at all, due to the limitations of the human mind and human language. So, yes, our expectations differ here. I do not think that human intelligence reached some magical threshold where everything can be explained to it, given enough effort, even though it was not possible with "less advanced" animals. For all I know, I am not even using the right terms. Maybe an AGI improvement on the term "explain" is incomprehensible to us. Like if we were to translate "explain" into chimp or cat it would come out as "show", or something.
0TheOtherDave
(shrug) Translating the terms is rather beside my point here. If the AGI is using these things to choose among possible future worlds, then I expect it to be able to teach me to choose among possible future worlds more like it does than I would without that explanation. I'm happy to call those things goals, ethics, morality, etc., even if those words don't capture what the AGI means by them. (I don't know that they really capture what I mean by them either, come to that.) Perhaps I would do better to call them "groom" or "fleem" or "untranslatable1" or refer to them by means of a specific shade of orange. I don't know; but as I say, I don't really care; terminology is largely independent of explanation. But, sure, if you expect that it's incapable of doing that, then our expectations differ. I'll note that my expectations don't depend on my having reached a magical threshold, or on everything being explainable to me given enough effort.
2[anonymous]
What are your reasons for thinking this? I find myself disagreeing: one big disanalogy is that while we have language and chimps do not, we and the AGI both have language. I find it implausible that the AGI could not in principle communicate to us its goals: give the AGI and ourselves an arbitrarily large amount of time and resources to talk, do you really think we'd never come to a common understanding? Because even if we don't, the AGI effectively does have such resources by which it might, I donno, choose its words with care. I'm also not sure why we should think it would even be particularly challenging to understand the goals of an AGI. It's not easy even with other humans, but why would it be much harder with AGI? Do we have some reason to expect its goals to be more complex than ours? It's been my experience that the more sophisticated and intelligent someone is, the more intelligible their behavior tends to be. My prejudice therefore says that the goals of an AGI would be much easier to understand than, say, my own.
6Shmi
I am trying to use an outside view here, because I find the inside view too limiting. The best I can do is to construct a tower of comparisons between species vastly different in intelligence and conjecture that this tower does not end with humans on top, a Copernican principle, if you like. To use some drastically different pairing, if you agree that an amoeba can never comprehend fish, that fish can never comprehend chimps, that chimps can never understand humans, then there is no reason to stop there and proclaim that humans would understand whatever intelligence comes next. Certainly language is important, and human language is much more evolved than that of other animals. There are parts of human language, like writing, which are probably inaccessible to chimps, no matter how much effort we put into teaching them and how patient we are. I can easily imagine that AGI would use some kind of "meta-language", because human language would simply be inadequate for expressing its goals, like the chimp language is inadequate for expressing human metaethics. I do not know what this next step would be, no more than an intelligent chimp being able to predict that humans would invent writing. My mind as-is is too limited and I understand as much. An AGI would have to make me smarter first, before being able to explain what it means to me. Call it "human uplifting". Yes, if you look through the tower of goals, more intelligent species have more complex goals. It has not been mine. When someone smarter than I am behaves a certain way, they have to patiently explain to me why they do what they do. And I still only see the path they have taken, not the million paths they briefly considered and rejected along the way. My prejudice tells me that when someone a few levels above mine tries to explain their goals and motivations to me in English, I may understand each word, but not the complete sentences. If you cannot relate to this experience, go to a professional talk on a
2pragmatist
OK, but why not look at this tower another way. A fish is basically useless at explaining its goals to an amoeba. We are not in fact useless at explaining our goals to chimps. Human researchers are often able to convey simple goals to chimps, and then see if chimps will help them accomplish those goals, for instance. I am able to convey simple goals to my dog: I can convey to him some information about the kinds of things I dislike and the kinds of things I like. So the gap in intelligence between fish and humans also seems to translate into a gap in ability to convey useful information about goals to creatures of lower intelligence. Humans are much better at communicating with less intelligent beings than fish or cattle or chimps are. Extrapolating this, you might expect a superintelligent AGI to be much much superior at communicating its goals (if it wants to). The line of thinking here is not so much "we are humans, we are smart, we can understand the goals of even an incredibly smart AGI"; it's "an incredibly smart AGI is incredibly smart, so it will be able to find effective strategies for communicating its goals to us if it so desires." So it seems like naive extrapolation pulls in two separate directions here. On the one hand, the tower of intelligence seems to put limits on the ability of beings lower down to comprehend the goals of beings higher up. On the other hand, the higher up you go, the better beings at that level become at communicating their goals to beings lower down. Which one of these tendencies will win out when it comes to human-AGI interaction? Beats me. I'm pretty skeptical of naive extrapolation in this domain anyway, given Eliezer's point that major advances in optimization power are meta-level qualitative shifts, and so we shouldn't expect trends to be maintained across those shifts.
0Shmi
You are right that we are certainly able to convey a small simple subset of our goals, desires and motivations to some complex enough animals. You would probably also agree that most of what makes us human can never be explained to a dog or a cat, no matter how hard we try. We appear to them like members of their own species who sometimes make completely incomprehensible decisions they have no choice but put up with. This is quite possible. It might give us its dumbed-down version of its 10 commandments which would look to us like an incredible feat of science and philosophy. Right. An optimistic view is that we can understand the explanations, a pessimistic view is that we would only be able to follow instructions (this is not the most pessimistic view by far). Indeed, we shouldn't. I probably phrased my point poorly. What I tried to convey is that because "major advances in optimization power are meta-level qualitative shifts", confidently proclaiming that an advanced AGI will be able to convey what it thinks to humans is based on the just-world fallacy, not on any solid scientific footing.
1EHeller
Thats because you weren't really speaking english, you were speaking the english words for math terms related to physics. The people who spoke the relevant math you were alluding to could follow, those who didn't, could not, because they didn't have concrete mathematical ideas to tie the words to. Its not just a matter of jargon, its an actual language barrier. I think you'd find, with a jargon cheat sheet, you could follow many non-mathematical phd defenses just fine. The same thing happens in music, which is its own language (after years of playing, I find I can "listen" to a song by reading sheet music). Is your argument, essentially, that you think a machine intelligence can create a mathematics humans cannot understand, even in principle?
0Shmi
"mathematics" may be a wrong word for it. I totally think that a transhuman can create concepts and ideas which a mere human cannot understand even when patiently explained. I am quite surprised that other people here don't find it an obvious default.
0Armok_GoB
My impression was the question was not if it'd have those concepts, since as you say thats obvious, but if they'd be referenced necessarily by the utility function.
0EHeller
Sure, but I find "can't understand" sort of fuzzy as a concept. i.e. I wouldn't say I 'understand' compactification and calabi yau manifolds the same way I understand sheet music (or the same way I understand the word green), but I do understand them all in some way. It seems unlikely to me that there exist concepts that can't be at least broadly conveyed via some combination of those. My intuition is that existing human languages cover, with their descriptive power, the full range of explainable things. for example- it seems unlikely there exists a law of physics that cannot be expressed as an equation. It seems equally unlikely there exists an equation I would be totally incapable of working with. Even if I'll never have the insight that lead someone to write it down, if you give it to me, I can use it to do things.
1Armok_GoB
Human languages can encode anything, but a human can't understand most things valid in human languages; most notably, extremely long things, and numbers specified with a lot of digits that actually matters. Just because you can count in binary on you hands does not mean you can comprehend the code of an operating system expressed in that format. Humans seem "concept-complete" in much the same way your desktop PC seems turing complete. Except it's much more easily broken because the human brain has absurdly shity memory.
1EHeller
Thats why we have paper, I can write it down. "Understanding" and "remembering" seem somewhat orthogonal here. I can't recite Moby Dick from memory, but I understood the book. If you give me a 20 digit number 123... and I can't hold it but retain "a number slightly larger than 1.23 * 10^20," that doesn't mean I can't understand you. Print it out for me, and give me enough time, and I will be able to understand it, especially if you give me some context. Yes, you can encode things in a way that make them harder for humans to understand, no one would argue that. The question is- are there concepts that are simply impossible to explain to a human? I point out that while I can't remember a 20 digit number, I can derive pretty much all of classical physics, so certainly humans can hold quite complex ideas in their head, even if they aren't optimized for storage of long numbers.
0Armok_GoB
You can construct a system consisting of a planet's worth of paper and pencils and an immortal version of yourself (or a vast dynasty of successors) that can understand it, if nothing else because it's turing complete and can simulate the AGI. this is not the same as you understanding it while still remaining fully human. Even if you did somehow integrate the paper-system sufficiently that'd be just as big a change as uploading and intelligence-augmenting the normal way. The approximation thing is why I specified digits mattering. It wont help one bit when talking about something like gödel numbering.
0EHeller
I understand, my point was simply that "understanding" and "holding in your head at one time" are not at all the same thing. "There are numbers you can't remember if I tell them to you" is not at all the same claim that "there are ideas I can't explain to you." Neither of your cases are unexplainable- give me the source code in a high level language, instead of binary and I can understand it. If you give me the binary code and the instruction set I can convert it to assembly and then a higher level language, via disassembly. Of course, i can deliberately obfuscate an idea and make it harder to understand, either by encryption or by presenting the most obtuse possible form, that is not the same as an idea that fundamentally cannot be explained.
2The_Duck
But they might be related. Perhaps there are interesting and useful concepts that would take, say, 100,000 pages of English text to write down, such that each page cannot be understood without holding most of the rest of the text in working memory, and such that no useful, shorter, higher-level version of the concept exists. Humans can only think about things that can be taken one small piece at a time, because our working memories are pretty small. It's plausible to me that there are atomic ideas that are simply too big to fit in a human's working memory, and which do need to be held in your head at one time in order to be understood.
0Shmi
My intuition is the exact opposite. I can totally imagine that some models are not reducible to equations, but that's not the point, really. Unless this "use" requires more brainpower than you have... You might still be able to work with some simplified version, but you'd have to have transhuman intelligence to "do things" with the full equation.
0EHeller
But that seems incredibly nebulous. What is the exact failure mode?
0nshepperd
This seems like a bogus use of the outside view. AGI is qualitatively different to evolved intelligence, in that it is not evolved, but built by a lesser intelligence. Moreover, there's a simple explanation for the observation that more intelligent animals have more complex goals, which is that more intelligence permits more subgoals, and natural selection generally alters a species' goals by adding, rather than simplifying. This is pretty much totally inapplicable to a constructed AGI.
-1Shmi
I'd love to hear what actual AGI experts think about it, not just us idle forum dwellers.
0[anonymous]
I will try to refute you by understanding what you say. So could you explain to me this idea of a 'meta-language'? I guess that by 'meta-' you intend to say that at least some sentences in the meta-language couldn't in principle be translated into a non-meta 'human' language. Is that right? This is not a given. I've been to plenty of dissertation defenses on topics I know little to nothing about, and you're right that I'm often at a loss. But this, I find, is because the understanding of a newly minted doctor is too narrow and too newborn to be easily understood. PhD defenses are not the place to go to find people who really get something, they're the place to go to find someone who's just now gotten a foothold. My experience is still that the more intelligent and experienced PhDs tend to be more intelligible. But this is a little beside the point: PhDs tend to be hard to understand, when they are, because they're discussing something quite complex. What reason do you have for thinking an AGI's goals would be complex at all? If your reasoning is that human beings that are more intelligent tend to have more complex goals (I don't agree, but say I grant this) why do you think an AGI will be so much like an intelligent human being?
0Shmi
I am not sure what you mean by "refute" here. Prove my conjecture wrong by giving a counterexample? Show that my arguments are wrong? Show that the examples I used to make my point clearer are bad examples? If it's the last one, but then I would not call it a refutation. Indeed, at least not without some extra layer of meaning not originally expressed in the language. To give another example (not a proof, just an illustration of my point), you can sort-of teach a parrot or an ape to recognize words, to count and maybe even to add, but I don't expect it to be possible to teach one to construct mathematical proofs or to understand what one even is. Even if a proof can be expressed as a finite string of symbols (a sentence in a language) a chimp is capable of distinguishing from another string. There is just too much meta there, with symbols standing for other symbols or numbers or concepts. I agree that my PhD defense example is not a proof, but an illustration meant to show that humans quite often experience a disconnect between a language ans an underlying concept, which well might be out of reach, despite being expressed with familiar symbols, just like a chimp would in the above example. I simply follow the chain of goal complexity as it grows with the intelligence complexity, from protozoa to primate and on and note that I do not see a reason why it would stop growing just because we cannot imagine what else a super-intelligence would use for/instead of a goal system.
0Armok_GoB
I can in fact imagine what else a super-intelligence would use instead of a goal system. A bunch of different ones even. For example, a lump of incomprehensible super-solomonoff-compressed code that approximates a hypercomputer simulating a multiverse with the utility function as an epiphenomenal physical law feeding backwards in time to the AIs actions. Or a carefully tuned decentralized process (think natural selection, or the invisible hand) found to match the AIs previous goals exactly by searching through an infinite platonic space. (yes, half of those are not real words; the goal was to imagine something that per definition could not be understood, so it's hard to do better than vaguely pointing in the direction of a feeling.) Edit: I forgot: "goal system replaced by completely arbitrary thing that resembles it even less because it was traded away counterfactually to another part of tegmark-5"
0[anonymous]
It was just a joke: I meant that I would prove you wrong by showing that I can understand you, despite the difference in our intellectual faculties. I don't really know if we have very different intellectual faculties; it was just a slightly ironic reposte to being called "naive, unimaginative and closed-minded" earlier. You may be right! But then my understanding you is at least a counterexample. Can we taboo the 'animals can't be made to understand us' analogy? I don't think it's a good analogy, and I assume you can express your point without it. It certainly can't be the substance of your argument. Anyway, would you be willing to agree to this: "There are at least some sentences in the meta-language (i.e. the kind of language an AGI might be capable of) such that those sentences cannot be translated into even an arbitrarily complex expressions in human language." For example, there will be sentences in the meta-language that cannot be expressed in human language, even if we allow the users of human language (and the AGI) an arbitrarily large amount of time, an arbitrarily large number of attempts at conversation, question and answer, etc. an arbitrarily large capacity for producing metaphor, illustration, etc. Is that your view? Or is that far too extreme? Do you just mean to say that the average human being today couldn't get their heads around an AGI's goals given 40 minutes, pencil, and paper? Or something in between these two claims? Why do you think this is a strong argument? It strikes me as very indirect and intuitionistic. I mean, I see what you're saying, but I'm not at all confident that the relations between a protozoa and a fish, a dog and a chimp, a 8th century dock worker and a 21st century physicist, and the smartest of (non-uplifted) people and an AGI all fall onto a single continuum of intelligence/complexity of goals. I don't even know what kind of empirical evidence (I mean the sort of think one would find in a scientific journal) could be g
0Armok_GoB
Using "even an arbitrarily complex expressions in human language" seem unfair, given that it's turing complete but describing even a simple program in it fully in it without external tools will far exceed the capability of any actual human except for maybe a few savants that ended up highly specialized towards that narrow kind of task.
0[anonymous]
I agree, but I was taking the work of translation to be entirely on the side of an AGI: it would take whatever sentences it thinks in a meta-language and translate them into human language. Figuring out how to express such thoughts in our language would be a challenging practical problem, but that's exactly where AGI shines. I'm assuming, obviously, that it wants to be understood. I am very ready to agree that an AGI attempting to be obscure to us will probably succeed.
1Armok_GoB
Thats obvious and not what I meant. I'm talking about the simplest possible in principle expression in the human language being that long and complex.
0Shmi
Sorry, didn't mean to call you personally any of those adjectives :) Pretty much, yes, I find it totally possible. I am not saying that I am confident that this is the case, just that I find it more likely than the alternative, which would require an additional reason why it isn't so. If you agree with Eliezer's definition of intelligence as optimization power, then shouldn't we be able to express this power as a number? If so, the difference between difference intelligences is only that of scale.
0[anonymous]
None taken then. Well, tell me what you think of this argument: Lets divide the meta-language into two sets: P (the sentences that cannot be rendered in English) and Q (the sentences that can). If you expect Q to be empty, then let me know and we can talk about that case. But let's assume for now that Q is not empty, since I assume we both think that an AGI will be able to handle human language quite easily. Q is, for all intents and purposes, a 'human' language itself. Premise one is that that translation is transitive: if I can translate language a into language b, and language b into language c, then I can translate language a into language c (maybe I need to use language b as an intermediate step, though). Premise two: If I cannot translate a sentence in language a into an expression in language b, then there is no expression in language b that expresses the same thought as that sentence in language a. Premise three: Any AGI would have to learn language originally from us, and thereafter either from us or from previous versions of itself. So by stipulation, every sentence in Q can be rendered in English, and Q is non-empty. If any sentence in P cannot be rendered in English, then it follows from premise one that sentences in P cannot be rendered in sentences in Q (since then they could thereby be rendered into English). It also follows, if you accept premise two, that Q cannot express any sentence in P. So an AGI knowing only Q could never learn to express any sentence in P, since if it could, any speaker of Q (potentially any non-improved human) could in principle learn to express sentences in P (given an arbitrarily large amount of resources like time, questions and answers, etc.). Hence, no AGI, beginning from a language like English could go on to learn how to express any sentence in P. Therefore no AGI will ever know P. I'm not super confident this argument is sound, but it seems to me to be at least plausible. Well, that's a fine definition, but i
0Armok_GoB
Premise one is false assuming finite memory. Premise 3 does not hold well either; Many new words come from pointing out a pattern in the environment, not from defining in terms of previous words.
0[anonymous]
Well, maybe it's not necessarily true assuming finite memory. Do you have reason to expect it to be false in the case we're talking about? I'm of course happy to grant that part of using a language involves developing neologisms. We do this all the time, of course, and generally we don't think of it as departing from English. Do you think it's possible to coin a neologism in a language like Q, such that the new term is in P (and inexpressible in any part of Q)? A user of this neologism would be unable to, say, taboo or explain what they mean by a term (even to themselves). How would the user distinguish their P-neologism from nonsense?
0Armok_GoB
I expect the tabo/explanation to look like a list of 10^20, 1000 hour long clips of incomprehensible n-dimensional multimedia, each with a real number attached representing the amount of [untranslatable 92] it has, with a jupiter brain being required to actually find any pattern.
0[anonymous]
Ah, I see. Even if that were a possibility, I'm not sure that would be such a problem. I'm happy to allow the AGI to spend a few centuries manipulating our culture, our literature, our public discourse etc. in the name of making its goals clear to us. Our understanding something doesn't depend on us being able to understand a single complex expression of it, or to be able to produce such. It's not like we all understood our own goals from day one either, and I'm not sure we totally understand them now. Terminal goals are basically pretty hard to understand, but I don't see why we should expect the (terminal) goals of a super-intelligence to be harder. It may be that there's a lot of inferential and semantic ground to cover. But again: practical problem. My point has been to show that we shouldn't expect there to be a problem of in principle untranslatability. I'm happy to admit there might be serious practical problems in translation. The question is now whether we should default to thinking 'An AGI is going to solve those problems handily, given the resources it has for doing so', or 'An AGI's thought is going to be so much more complex and sophisticated, that it will be unable to solve the practical problem of communication'. I admit, I don't have good ideas about how to come down on the issue. I was just trying to respond to Shim's point about untranslatable meta-languages. Form my part, I don't see any reason to expect the AGI's terminal goals to be any more complex than ours, or any harder to communicate, so I see the practical problem as relatively trivial. Instrumental goals, forget about it. But terminal goals aren't the sorts of things that seem to admit of very much complexity.
0Armok_GoB
That the AI can have a simple goal is obvious, I never argued against that. The AIs goal might be "maximize the amount of paperclips", which is explained in that many words. I dont expect the AI as a whole to have anything directly analogous to instrumental goals on the highest level either, so that's a non issue. I thought we were talking about the AI's decision theory. On manipulating culture for centuries and solving as practical problem: Or it could just instal an implant or guide evolution to increase intelligence until we were smart enough. The implicit constraint of "translate" is that it's to an already existing specific human, and they have to still be human at the end of the process. Not "could something that was once human come to understand it".
0[anonymous]
No, Shiminux and I were talking about (I think) terminal goals: that is, we were talking about whether or not we could come to understand what an AGI was after, assuming it wanted us to know. We started talking about a specific part of this problem, namely translating concepts novel to the AGI's outlook into our own language. I suppose my intuition, like yours, is that the AGI decision theory would be a much more serious problem, and not one subject to my linguistic argument. Since I expect we also agree that it's the decision theory that's really the core of the safety issue, my claim about terminal goals is not meant to undercut the concern for AGI safety. I agree that we could be radically ignorant about how safe an AGI is, even given a fairly clear understanding of its terminal goals. I'd actually like to remain indifferent to the question of how intelligent the end-user of the translation has to be. My concern was really just whether or not there are in principle any languages that are mutually untranslatable. I tried to argue that there may be, but they wouldn't be mutually recognizable as languages anyway, and that if they are so recognizable, then they are at least partly inter-translatable, and that any two languages that are partly inter-translatable are in fact wholly inter-translatable. But this is a point about the nature of languages, not degrees of intelligence.
0TheAncientGeek
Human languages? Alien languages? Machine languages?
0[anonymous]
I don't think those distinctions really mean very much. Languages don't come in types in any significant sense.
0TheAncientGeek
Yes they do. Eg the Chomsky Hierarchy, the Aglutinative /synthetic/Ananytical distinction, etc. Also. We recognise ,maths as a language.,but have no idea now to translate, as opposed to re code, English into it.
0Armok_GoB
So one of the questions we actually agreed on the whole time and the other were just the semantics of "language" and "translate". Oh well, discussion over.
0[anonymous]
Ha! Well, I did argue that all languages (recognizable as such) were in principle inter-translatable for what could only be described as metaphysical reasons. I'd be surprised if you couldn't find holes in an argument that ambitious and that unempirical. But it may be that some of the motivation is lost.
0Armok_GoB
I expect it to be false in at least some cases talked about because it's not 3 but 100 levels, and each one makes it 1000 times longer because complex explanations and examples are needed for almost every "word".
0Shmi
Honestly, I expected you to do a bit more steelmanning with the examples I gave. Or maybe you have, just didn't post them here, Anyway, does the quote mean that any English sentence can be expressed in Chimp, since we evolved from a common ancestor? If you don't claim that (I hope you don't), then where did your logic stop applying to humans and chimps vs AGI and humans? Presumably it's relying on the Premise 3 that gets us a wrong conclusion in the English/Chimp example, since it is required to construct an unbroken chain of languages. What happened to humans over their evolution which made them create Q out of P where Q is not reducible to P? And if this is possible in the mindless evolutionary process, then would it not be even more likely during intelligence explosion? I don't understand this point. I would expect the terminal goals evolve as the evolving intelligence understands more and more about the world. For example, for many people here the original terminal goal was, ostensibly, "serve God". Then they stopped believing and now their terminal goal is more like "do good". Similarly, I would expect an evolving AGI to adjust its terminal goals as the ones it had before are obsoleted, not because they have been reached, but because they become meaningless.
0[anonymous]
No, I said nothing about evolving from a common ancestor. The process of biological variation, selection, and retention of genes seems to be to be entirely irrelevant to this issue, since we don't know languages in virtue of having specific sets of genes. We know languages by learning them from language-users. You might be referring to homo ancestors that developed language at some time in the past, and the history of linguistic development that led to modern languages. I think my argument does show (if it's sound) that anything in our linguistic history that qualifies as a language is inter-translatable with a modern language (given arbitrary resources of time interrogation, metaphor, neologism, etc.). It's hard to say what qualifies as a language, but then it's also hard to say when a child goes from being a non-language user to being a language user. It's certainly after they learn their first word, but it's not easy to say exactly when. But remember I'm arguing that we can always inter-translate two languages, not that we can some how make the thoughts of a language user intelligible to a non-language user (without making them a language user). This is, incidentally, where I think your AGI:us::us:chimps analogy breaks down. I still see no reason to think it plausible. At any rate, I don't need to draw a line between those homo that spoke languages and those that did not. I grant that the former could not be understood by the latter. I just don't think the same goes for languages and 'meta-languages'. Me too, but that would have nothing to do with intelligence on EY's definition. If intelligence is optimizing power, then it can't be used to reevaluate terminal goals. What would it optimize for? It can only be used to reevaluate instrumental goals so as to optimize for satisfying terminal goals. I don't know how the hell we do reevaluate terminal goals anyway, but we do, so there you go. You might think they just mistook an instrumental goal ('serve God') for a
0Shmi
Ah. To me language is just a meta-grunt. That's why I don't think it's different from the next level up. But I guess I don't have any better arguments than those I have already made and they are clearly not convincing. So I will stop here. Right, you might. Except they may not even had the vocabulary to explain that underlying terminal goal. In this example my interpretation would be that their terminal goal evolved rather than was clarified. Again, I don't have any better argument, so I will leave it at that.
0[anonymous]
I see. If that is true, then I can't dispute your point (for more than one reason).
0Jiro
By this reasoning no AGI beginning from English could ever know French either, for similar reasons. (Note that every language has sentences that cannot be rendered in another language, in the sense that someone who knows the truth value of the unrendered sentence can know the truth value of the rendered sentence; consider variations on Godel-undecideable sentences.)
2[anonymous]
This is true only if this... is true. But I don't think it is. English and French, for instance, seem to me to be entirely inter-translatable. I don't mean that we can assign, for every word in French, a word of equivalent meaning in English. But maybe it would be helpful if I made it more clear what I mean by 'inter-translatable'. I think language L is inter-translatable with language M if for ever sentence in language L, I can express the same thought using an arbitrarily complex expression in language M. By 'arbitrarily complex' I mean this: Say I have a sentence in L. In order to translate it into M, I am allowed to write in M an arbitrarily large number of sentences qualifying and triangulating the meaning of the sentence in L. I am allowed to write an arbitrarily large number of poems, novels, interpretive dances, etymological and linguistic papers, and encyclopedias discussing the meaning and spirit of that sentence in L. In other words, two languages are by my standard inter-translatable if for any expression in L of n bits, I can translate it into M in n' bits, where n' is allowed to be any positive number. I think, by this standard, French and English count as inter-translatable, as are any languages I can think of. I'm arguing, effectively, that for any language, either none of that language is inter-translatable with any language we know (in which case, I doubt we could recognize it as a language at all), or all of it is. Now, even if I have shown that we and an AGI will necessarily be able to understand each other entirely in principle, I certainly haven't shown that it can be done in practice. However, I want to push the argument in the direction of a practical problem, just because in general, I think I can argue that AGI will be able to overcome practical problems of any reasonable difficulty.
0hairyfigment
My hangup is that it seems like a truly benevolent AI would share our goals. And in a sense your argument "only" applies to instrumental goals, or to those developed through self-modification. (Amoebas don't design fish.) I'll grant it might take a conversation forever to reach the level we'd understand.
2Shmi
In the way that a "truly benevolent" human would leave an unpolluted lake for fish to live in, instead of using it for its own purposes. The fish might think that humans share its goals, but the human goals would be infinitely more complex than fish could understand.
0hairyfigment
...It sounds like you're hinting at the fact that humans are not benevolent towards fish. If we are, then we do share its goals when it comes to outcomes for the fish - we just have other goals, which do not conflict. (I'm assuming the fish actually has clear preferences.) And a well-designed AI should not even have additional goals. The lack of understanding "only" might come in with the means, or with our poor understanding of our own preferences.
0christopherj
I find myself agreeing with you -- human goals are a complex mess, which we seldom understand ourselves. We don't come with clear inherent goals, and what goals we do have we abuse by using things like sugar and condoms instead of eating healthy and reproducing like we were "supposed" to. People have been asking about the meaning of life for thousands of years, and we still have no answer. An AI on the other hand, could have very simple goals -- make paperclips, for example. An AI's goals might be completely specified in two words. It's the AI's sub-goals and plans to reach its goals that I doubt I could comprehend. It's the very single-mindedness of an AI's goals and our inability to comprehend our own goals, plus the prospect of an AI being both smarter and better at goal-hacking than us, that has many of us fearing that we will accidentally kill ourselves via non-friendly AI. Not everyone will think to clarify "make paperclips" with, "don't exterminate humanity", "don't enslave humanity", "don't destroy the environment", "don't reprogram humans to desire only to make paperclips", and various other disclaimers that wouldn't be necessary if you were addressing a human (and we don't know the full disclaimer list either).
0Armok_GoB
It might not be possible to "truly comprehend" the AIs advanced meta-meta-ethics and whatever compact algorithm replaces the goal-subgoals tree, but the AI most certainly can provide a code of behavior and prove that following it is a really good idea, much like humans might train pets to provide a variety of useful tasks whose true purpose they can't comprehend. And it doesn't seem unreasonable that this code of behavior wouldn't have the look and feel of an in-depth philosophy of ethics, and have some very very deep and general compression/procedural mechanism that seem very much like things you'd expect from a true and meaningful set of metaethics to humans, even if it did not correspond much to whats going on inside the AI. It also probably wouldn't accidentally trigger hypocrisy-revulsion in the humans, although the AI seeming to also be following it is just one of many solutions to that and probably not a very likely one. Friendliness is pretty much an entirely tangential issue and the equivalent depth of explaining it would require the solution to several open questions unless I'm forgetting something right now. (I probably am) There, question dissolved. Edit; I ended up commenting in a bunch of places, in this comment tree, so i feel the need to clarify; I consider both side here to be making errors, and ended up seeing to favor the shminux side because thats where I were able to make interesting contributions, and it made some true tangential claims that were argued against and not defended well. I do not agree with the implications for friendliness however; you don't need to understand something to be able to construct true statements about it or even direct it's expression powerfully to have properties you can reference but don't understand either, especially if you have access to external tools.
0TheAncientGeek
Is the problems supposed to be that the human doesn't have enough intelligence, or that we have some kind of highly parochial rationality?
2Shmi
Not enough intelligence, yes. And rationality is a part of intelligence. Also, see my reply to hen.
-2TheAncientGeek
But that's not ready analogous to the human champ gap, which is qualitative....chimps don't have language.
0private_messaging
Skynet kills people as secondary to it's self preservation, too. Perhaps it is just a very banal insight that doesn't really shed any light on what an AI is likely to do.

This is great! For a long time I've been saying that we need summaries at different lengths, and I see it's coming together now.

This one is good as an executive summary.

The next step is to produce a short summary with emotional appeal; a call to action. It's been noted that simply stating the problem of AI existential risk does not bring people on-board. Staring into the Singularity is an example of a emotionally appealing call to action (for outdated policies, however).

But I do not have any specific ideas for implementation, and again, this is excellent for the purpose it was designed for.

s/nut hope/but hope/

1kevin_p
I saw the same error, but assumed it should have been "we can not hope" (as in, we can't just hope it works out, we have to do something about it).
0Stuart_Armstrong
Thanks! Corrected to "but".

Something about the name-dropping and phrasing in the "super-committee" line is off-putting. I'm not sure how to fix it, though.

In the second to last paragraph you write "nut" instead of "not".

In the last paragraph you're using the word "either" when I think "each" or "both" would be more correct.

Mostly this looks good.

0kokotajlod
Agreed. Maybe it is because it feels like you are talking down to us with the name-dropping? Perhaps this should be tested with people who are unfamiliar with LW and AI-related ideas, to see if they have the same reaction.
0Stuart_Armstrong
Thanks! A lot of the supercomittee weirdness is due to space constraints (this fits on 2 A4 sides - just about). Using "both" now, thanks.

Is there a convenient place to see just what changed from the old to the new?

Online diff tools aren't usefully handling the paragraphs when I copy-paste, and my solution of download -> insert line breaks -> run through my favorite diff program is probably inconvenient for most.

0itaibn0
Thinking about this, it seems like there should exist some version of diff which points out differences on the word level rather than the line level. That would be useful for text documents which only have line breaks in between paragraphs. Given how easy I expect it to be to program such a thing almost certainly does exist, but I don't know where to find it.
0DSimon
Try wdiff
0rule_and_line
I'm only familiar with open source tools, but git will do this with "git diff --word-diff FILE1 FILE2" and Emacs diff has the "ediff-toggle-autorefine" command. IMO you still need to insert line breaks before they become useful. GNU has wdiff though I've never used it: https://www.gnu.org/software/wdiff/ (update: the git command above seems to do the same thing) I'm still looking for an online diff tool that makes the word-level differences obvious. That would be ideal here (my web skills are too weak to make it happen this month).
[-]faust-40

As long as other humans exist in competition with other humans, there is now way to keep AI as safe AI.

As long as competitive humans exist, boxes and rules are futile.

The only way to stop hostile AI is to have no AI. Otherwise, expect hostile AI.

There really isn't a logical way around this reality.

Without competitive humans, you could box the AI, give it ONLY preventative primary goals (primarily: 1. don't lie 2. always ask before creating a new goal), and feed it limited-time secondary goals that expire upon inevitable completion. There can never be a strong AI that has continuous goals that aren't solely designed to keep the AI safe.

0More_Right
Agreed, but in need of qualifiers. There might be a way. I'd say "probably no way." As in, "no guaranteed-reliable method, but a possible likelihood." I agree fairly strongly with this statement. This can be interpreted in two ways. The first sentence I agree with if reworded as "The only way to stop hostile AI in the absence of nearly-as-intelligent but separate-minded competitors, is to have no AI." Otherwise, I think markets indicate fairly well how hostile an AI is likely to be, thanks to governments and the corporate charter. Governments are already-in-existence malevolent AGI. However, they are also very incompetent AGI, in comparison to the theoretical maximum value of malevolent competence without empathic hesitation, internal disagreement, and confusion. (I think we can expect more "unity of purpose" from AGI than we can from government. Interestingly I think this makes sociopathic or "long-term hostile" AI less likely.) "Expect hostile AI" could either mean "I think hostile AI is likely in this case" or "I think in this case, we should expect hostile AI because one should always expect the worst --as a philosophical matter." Nature often deals with "less likely" and "more likely," as well as intermediate outcomes. Hopefully you've seen Stephen Omohundro's webinars on hostile universal motivators as basic AI drives and autonomous systems. as well as Peter Voss's excellent ideas on the subject. I think that evolutionary approaches will trend toward neutral benevolence, and even given extremely shocking intermediary experiences, it will trend toward benevolence, especially given enough interaction with benevolent entities. I believe that intelligence trends toward increased interaction with its environment. I think this is just as likely to create malevolent AGI (with limited "G"), possibly more likely. After all, if humans are in competition with each other in anything that operates like the current sociopath-driven "mixed economy," sociopaths will be c
0CillianSvendsen
I don't think that's a forgone conclusion. After all, there seem to be many proposals on how to get around this problem that individuals compete each other. For example, there's Eliezer's idea of using humanity's coherent extrapolated voalition to guide the AI. I also don't think that its in anyone's advantage to have hostile AI, that no one will try to bring about explicitly hostile AI on purpose, and that anyone sufficiently intelligent to program a working AI will probably recognize the dangers that AI contain. Yes, humans will fight amongst each other and there is temptation for seed AI programmers to abuse the resulting AI to destroy their rivals. But I don't agree with the idea that AIs will always be hostile to the enemies of programmers. With some of the proposals that researchers have, it doesn't seem like individuals can abuse the AI to compete with other humans at all. The large potential for abuse doesn't mean that there is no potential for a good result.