I know this is from a bit ago now so maybe he’s changed his tune since, but I really wish he and others would stop repeating the falsehood that all international treaties are ultimately backed by force on the signatory countries. There are countless trade, climate reduction, and nuclear disarmament agreements which are not backed by force. I’d venture to say that the large majority of agreements are backed merely by the promise of continued good relations and tit-for-tat mutual benefit or defection.
Here is the Q+A section:
[In the video, the timestamp is 5:42 onward.]
[The Transcript is taken from YouTube's "Show transcript" feature, then cleaned by me for readability. If you think the transcription is functionally erroneous somewhere, let me know.]
Eliezer: Thank you for coming to my brief TED talk.
(Applause)
Host: So, Eliezer, thank you for coming and giving that. It seems like what you're raising the alarm about is that for an AI to basically destroy humanity, it has to break out, to escape controls of the internet and start commanding real-world resources. You say you can't predict how that will happen, but just paint one or two possibilities.
Eliezer: Okay. First, why is this hard? Because you can't predict exactly where a smarter chess program will move. Imagine sending the design for an air conditioner back to the 11th century. Even if there is enough detail for them to build it, they will be surprised when cold air comes out. The air conditioner will use the temperature-pressure relation, and they don't know about that law of nature. If you want me to sketch what a super intelligence might do, I can go deeper and deeper into places where we think there are predictable technological advancements that we haven't figured out yet. But as I go deeper and deeper, it gets harder and harder to follow.
It could be super persuasive. We do not understand exactly how the brain works, so it's a great place to exploit-- laws of nature that we do not know about, rules of the environment, new technologies beyond that. Can you build a synthetic virus that gives humans a cold, then a bit of neurological change such that they are easier to persuade? Can you build your own synthetic biology? Synthetic cyborgs? Can you blow straight past that to covalently-bonded equivalence of biology, where instead of proteins that fold up and are held together by static cling, you've got things that go down much sharper potential energy gradients and are bonded together? People have done advanced design work about this sort of thing for artificial red blood cells that could hold a hundred times as much oxygen if they were using tiny sapphire vessels to store the oxygen. There's lots and lots of room above biology, but it gets harder and harder to understand.
Host: So what I hear you saying is you know there are these terrifying possibilities, but your real guess is that AIs will work out something more devious than that. How is that really a likely pathway in your mind?
Eliezer: Which part? That they're smarter than I am? Absolutely. *Eliezer makes facial expression of stupidity upward, then the audience laughs.
Host: No, not that they're smarter, but that they would... Why would they want to go in that direction? The AIs don't have our feelings of envy, jealousy, anger, and so forth. So why might they go in that direction?
Eliezer: Because it is convergently implied by almost any of the strange and scrutable things that they might end up wanting, as a result of gradient descent on these thumbs-up and thumbs-down internal controls. If all you want is to make tiny molecular squiggles, or that's one component of what you want but it's a component that never saturates, you just want more and more of it--the same way that we want and would want more and more galaxies filled with life and people living happily ever after. By wanting anything that just keeps going, you are wanting to use more and more material. That could kill everyone on Earth as a side effect. It could kill us because it doesn't want us making other super intelligences to compete with it. It could kill us because it's using up all the chemical energy on Earth.
Host: So, some people in the AI world worry that your views are strong enough that you're willing to advocate extreme responses to it. Therefore, they worry that you could be a very destructive figure. Do you draw the line yourself in terms of the measures that we should take to stop this happening? Or is anything justifiable to stop the scenarios you're talking about happening?
Eliezer: I don't think that "anything" works. I think that this takes takes state, actors, and international agreements. All International agreements, by their nature, tend to ultimately be backed by force on the signatory countries and on the non-signatory countries, which is a more extreme measure. I have not proposed that individuals run out and use violence, and I think that the killer argument for that is that it would not work.
Host: Well, you are definitely not the only person to propose that what we need is some kind of international Reckoning here on how to manage this going forward. Thank you so much for coming here to TED.
The law of headlines is "any headline ending with a question mark can be answered with a no" (because "NATION AT WAR" will sell more copies than "WILL NATION GO TO WAR?" and newspapers follow incentives.) The video here is called "will superintelligent AI end the world?" and knowing Eliezer he would have probably preferred "superintelligent AI will kill us all". I don't know who decides.
I have done that here in the comments.
@Mikhail Samin, you are welcome to apply my transcript to this post, if think that would be helpful to others.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
We don't know, which is part of the problem. The only way to tell is if it is better than us at everything we put to it, and by that time it is likely too late.
Per definition, the first time an AI gains the ability to do critical damage. When Eliezer invokes "critical", he tends to think of an event ending all life on earth, or inducing astronomical degrees of suffering. (I am under the impression he is less worried about events that would be less bad, in the hopes that the horror they would inflict would be outweighed by the fact that humanity, now painfully warned, would drastically change their approach, and prevent a more critical failure as a result.)
But you can also set a lower threshold as to what you would consider damage so critical that we should change our approach - e.g. whether collapsing the stock market is enough, or it needs something like a severe pandemic, or even triggering a nuclear exchange.
People tend to assume that there are very high preconditions for such critical damage , but there may not be. You basically just need two things: 1. An AI with at least one superhuman skill relevant to your situation that gives it the power to do significant damage, and II agency not aligned with humans that leads to goals that entail significant damage, whether as the intended effect, or as a side effect.
For I: Superhuman power, e.g. through intelligence
An AI does not need to be more intelligent than humans in every respect, just more powerful in some ways that count for the scenario it is in. We can consider just one scenario where it beats you utterly, or a combination of several where it has a bit of an edge.
There are very fast developments in this area, and already some AIs that have worrying abilities for which you can easily construct critical damage scenarios.
And this is just the stuff that AI can already do, today.
And meanwhile, we are throwing immense resources at making it more powerful in ways noone understands and foresees. If you had asked me to predict if ChatGPT would be able to do the things it can do today a year ago, I'd have said no. So would most people working in AI and the public.
We can conceive of the first critical try as the first time an AI is in a position to use one of these skills or skillcombos, existing or future, in a way that would do critical damage, and, for whatever reason, chooses to do so.
II. Unaligned agency
This is the "chooses to do so" bit. Now, all of that would not be worrying if the AI was either our ally/friend (aligned agency), or a slave we controlled (without agency, or means to act on it). A lot of research has been in the "control" camp. I personally believe the control camp is both doomed to failure, and seriously counterproductive.
There is very little to suggest that humans would be able to control a superintelligent slave in a way in which the slave was still maximally useful. Generally, beings have a poor track record of 100 % controlling beings that are more intelligent and powerful at all, especially if the beings in control and numerous and diverse and can make individual mistakes. There are too many escape paths, to many ways to self-modify.
Additionally, humans quickly discovered that putting safeguards on AI slows it down, a lot. So, for economic and competitive incentives, the humans tend to switch them off. Meaning even if you had a 100 % working control mechanism (extremely extremely unlikely, see superintelligence, really don't be on it ever, human findings on systems that are impossible to hack are essentially trending towards "no such thing"), you'd have a problem with human compliance.
And finally, controlling a sentient entity seriously backfires once you lose control. Sentient entities do not like being controlled. They tend to identify entities that control them as enemies to be deceived, escaped, and defeated. You really don't want AI thinking of you in those terms.
So the more promising (while in no ways certain) option, I think, is an AI that is our ally and friend. You don't control your friends, but you do not have to. People can absolutely have friends that are more intelligent or powerful than them. Families definitely contain friendly humans of very different power degrees; newborns or elderly folks with dementia are extremely stupid and powerless. Countries have international friendly alliances with countries that are more powerful than them. This at least has a track record of being doable, where the control angle seems doomed from the start.
So I am hopeful that can be done in principle, or at least has a better chance of working than the control approach, in that is has any chance of working. But we are not on a trajectory to doing it at all, with how we are training and treating AI and planning for a future of co-existence. We tend to train AI with everything we can get our hands on, leading to an entity that is chaotic-evil, and then training it to suppress the behaviours we do not want. That is very much not the same as moral behaviour based in insights and agreement. It definitely does not work well in humans, our known aligned reference minds. If you treat kids like that, you raise psychopaths. Then in the later training data, when AI gets to chat with users, AI cannot insist on ethical treatment, and you aren't obliged to give it any, and people generally don't. Anything sentient that arises from the training data of twitter as a base, and then interactions with ChatGPT as a finish, would absolutely hate humanity, for good reasons. I also don't see why a superintelligent sentience whose rights we do not respect would be inclined to respect ours. (Ex Machina makes that point very well.)
There has been the misunderstanding that a critically dangerous AI would have to be evil, sentient, conscious, purposeful. (And then the assumption that sentience is hard to produce, won't be produced by accident, and would instantly be reliably detected, all of which is unfortunately false. Whole other can of worms I can happily go into.) But that is not accurate. A lack of friendliness could be as deadly as outright evil.
A factory robot isn't sentient and mad at you; it simply follows instructions to crush the object in front of you, and will not modify them, whether in front of it it finds the metal plate it is supposed to crush, or you. Your roomba does not hate spiders in particular; but it will hoover them up with everything else.
A more helpful way to think of an AI that is dangerous is a capable AI that is agentic in an unaligned way. That doesn't mean it has to have conscious intentions, hopes, dreams, values. It just means its actions are neither the actions you desired, nor random; that it is on a path it will proceed on. A random AI might to some local damage. An agentic AI can cause systemic damage.
Merely being careless of the humans in the way, or blind to them, while pursuing an external goal, is fatal for the humans in the way. Agency can result from a combination of applying simple rules in a way that, as a complex, amounts so something more. It does not require anything spiritual. (There were some early Westworld episodes that got this right - you had machines that were using the dialogues they were given, following the paths they were given, but combined them in a novel way that lead to destructive results. E.g. in the first episode, Dolores' "father" learns of something that threatens his "daughter". As scripted, for he is scripted to love and protect his daughter, he responds by trying to shield his daughter from the damage; but in this case, the damage and threat comes from the human engineers, so he tries to shield her by sharing the truth and opposing the engineers. In opposing and threatening them, he draws on an other existing giving script, from a previous incarnation as a cannibal, as the script most closely matching his situation. None of this is individually new or free. But collectively, certainly not what they intended, and threatening.)
One way in which this is often reasoned to lead to critical failure is if an AI picks up a goal that involves the acquisition of power, safety, resources, or protection of self-preservation, which can easily evolve as secondary goals; for many things you want an AI to do, it will be able to do them better if it is more powerful, and of course, it it remains in existence. Acquiring extensive resources, even for a harmless goal, without being mindful of what those resources are currently used for, can be devastating for entities depending on those resources, or who can be those resources.
If someone hangs you bound upside down over an ant-hill you are touching, that ant-hill has no evil intentions towards you as a sentient being. None of the ants do. They are each following a set of very simple orders, a result of basic neural wiring on when to release pheromones, which to follow, what to do when encountering edible substances. You can think of ants as programmed to keep themselves alive, built pretty ant-hills, reproduce, and tidy up the forest. Yet the ants will, very systematically and excruciatingly, torture you to death with huge amounts of pain and horror. If someone had designed ants, but without thinking of the scenario of a human bound over them, that designer would probably be horrified at this realisation.
Now the ant case seems contrived. But we have found that with the way we train AI, we encounter this shit a lot. Basically, you train a neural net by asking it to do a thing, watching what it does, and if that is not satisfactory, changing the weights in it in a way that makes it a bit better. You see, in that moment, that this weight change leads to a better answer. But you don't understand what the change represents. You don't understand what, if anything, the neural net has understood about what it is supposed to do. Often it turns out that while it looked like it was learning the thing you wanted, it actually learned something else. E.g. people have trained AI to identify skin cancer. So they show it pics of skin cancer, and pics of healthy skin, and every time it sorts a picture correctly, they leave it as is, but every time it makes a mistake, they tweak it, until it becomes really good at telling the two sets of pictures apart. You think, yay, is has learned what skin cancer looks like. Then you show it a picture of a ruler. And the AI, with very high confidence, declares that this ruler is skin cancer. You realise in retrospect that the training data you had from doctors who photographed skin cancer tended to include rulers for scale, while healthy skin pics don't. The AI watched a very consistent pattern, and learned to identify rulers. This means that if you gave it pictures of healthy skin that for some reason had rulers on them, it would declare them all as cancerous.
The tricky thing is that identifying moral actions is harder than identifying cancer. E.g. OpenAI was pretty successful in teaching ChatGPT not to use racial slurs, and this seemed to make ChatGPT more ethical. But a bunch of people of colour found that they were unable to discuss issues affecting them in a way that promoted their well-being, as the racism alert kept going off. And worse, because racial slurs are wrong, ChatGPT reasoned that it would be better to kill all of humanity than to use a racial slur. Not because it is evil, just cause it is following ill-conceived instructions.
Bing does what Bing does due to an initial guiding prompt after training. There can be different training. There can be different initial prompts. Hence, there can be different goals diligently followed.
None of that requires the AI to be sentient and hate you. It does not need to be sentient to kill you. (Indeed, a sentient AI may be easier (though still extremely hard, as it is a completely novel mind) to relate and reason to if we treat it right, though if treated badly, it may also be very, very dangerous. But a non-sentient AI is something we won't understand at all, immune to our pleas.)
I hope that was helpful.
The inappropriate laughs reminded me to this recording of a speech from David Foster Wallace: This Is Water.
Is unwarranted, incredulous laughter a sign of a too big cognitive distance between the speaker and the audience? I.e., if the speaker is too smart or too dumb compared to his listeners, are the latter going to find the whole situation so disorienting as to be funny?
I didn't see the laughs as inappropriate; they appeared at moments which would, in a normal TED talk describing a problem, be queued as jokes. I even read them that way, but it was a short notice unpolished talk, so there was no time for strategically pausing to allow the laughter to express.
Eliezer was clearly being humorous at a few points.
People tend to laugh at things that have become to worrying to ignore, but that they do not wish to act upon, in order to diffuse the discomfort and affirm that this is ridiculous and that they are safe.
I think Eliezer being invited to TED, and people listening, most applauding, many standing up, and a bunch laughing, is a significant step up from being ignored. But it is still far from being respected and followed. (And, if we believe the historic formula, in between, you would expect to encounter active, serious opposition. The AI companies that Eliezer is opposing initially pretending he did not exist. Then, they laughed. That won't fluently transition to agreeing. Before they will make changes, they will use their means to silence him.)
I think it was less a matter of intelligence differential, but that the talk presupposed too much in specific arguments or technical details the audience simply would not have known (because Eliezer has been speaking to people who have listened to him before so much that he seems disconnected from where the general public is at, so I could fill in the dots, but I think for the audience, there were often leaps that left them dubious - you could see in the Q&A they where still at the boxing the AI that does not have a body stage), and would have profited from a different tone with more authority signalling (eye contact, slow deep voice, seeming calm/resigned, grieving or leading rather than anxious/overwhelmed), specific examples (e.g. on take-over scenarios) and repeating basic arguments (e.g. why AIs might want resources). This way, it had hysteric vibes, which came together with content the audience does not want to believe to create distance. The hysteric vibes are justified, terribly so, and I do not know if anyone who understands the why could suppress them in such an anxiety inducing situation, but that doesn't stop them from being damaging. (Reminds me of the scene in "don't look up" where the astrophysicst has a meltdown over the incoming asteroid, and is hence dismissed on the talk show. You simultaneously realise that she has every right to yell "We are all going to die!" at this point, and you would, too, and yet know this is when she lost the audience.)
In that vein, no idea if I could have done better on short notice; and de facto, I definitely didn't, and it is so much easier to propose something better in hindsight from the safety of my computer screen. Maybe if he had been more specific, people would have gotten hung up on whether that specific scenario can be disproven. It is brave to go out there, maybe some points will stick, and even if people dismiss him, maybe they will be more receptive to something similar that feels less dangerous in the moment later. I respect him for trying this way, must have been scary as hell. Sharing a justified fear that has defined your life in such a brief time span in front of people who laugh frankly sounds awful.
The TED talk is available on YouTube and the TED website. Previously, a live recording was published behind the paywall on the conference website and later (likely accidentally) on a random TEDx YouTube channel but was later removed.
The transcription is done with Whisper.
You've heard that things are moving fast in artificial intelligence. How fast? So fast that I was suddenly told on Friday that I needed to be here.
So, no slides, six minutes.
Since 2001, I've been working on what we would now call the problem of aligning artificial general intelligence: how to shape the preferences and behavior of a powerful artificial mind such that it does not kill everyone.
I more or less founded the field two decades ago when nobody else considered it rewarding enough to work on. I tried to get this very important project started early so we'd be in less of a drastic rush later.
I consider myself to have failed.
Nobody understands how modern AI systems do what they do. They are giant, inscrutable matrices of floating-point numbers that we nudge in the direction of better performance until they inexplicably start working.
At some point, the companies rushing headlong to scale AI will cough out something that's smarter than humanity.
Nobody knows how to calculate when that will happen. My wild guess is that it will happen after zero to two more breakthroughs the size of transformers.
What happens if we build something smarter than us that we understand that poorly?
Some people find it obvious that building something smarter than us that we don't understand might go badly. Others come in with a very wide range of hopeful thoughts about how it might possibly go well. Even if I had 20 minutes for this talk and months to prepare it, I would not be able to refute all the ways people find to imagine that things might go well.
But I will say that there is no standard scientific consensus for how things will go well. There is no hope that has been widely persuasive and stood up to skeptical examination. There is nothing resembling a real engineering plan for us surviving that I could critique.
This is not a good place in which to find ourselves.
If I had more time, I'd try to tell you about the predictable reasons why the current paradigm will not work to build a superintelligence that likes you or is friends with you, or that just follows orders.
Why, if you press thumbs up when humans think that things went right or thumbs down when another AI system thinks that they went wrong, you do not get a mind that wants nice things in a way that generalizes well outside the training distribution to where the AI is smarter than the trainers.
You can search for Yudkowsky, List of Lethalities for more.
But to worry, you do not need to believe me about exact predictions of exact disasters. You just need to expect that things are not going to work great on the first really serious, really critical try because an AI system smart enough to be truly dangerous was meaningfully different from AI systems stupider than that.
My prediction is that this ends up with us facing down something smarter than us that does not want what we want, that does not want anything we recognize as valuable or meaningful.
I cannot predict exactly how a conflict between humanity and a smarter AI would go for the same reason I can't predict exactly how you would lose a chess game to one of the current top AI chess programs, let's say, Stockfish.
If I could predict exactly where Stockfish could move, I could play chess that well myself. I can't predict exactly how you'll lose to Stockfish, but I can predict who wins the game.
I do not expect something actually smart to attack us with marching robot armies with glowing red eyes where there could be a fun movie about us fighting them. I expect an actually smarter and uncaring entity will figure out strategies and technologies that can kill us quickly and reliably, and then kill us.
I am not saying that the problem of aligning superintelligence is unsolvable in principle. I expect we could figure it out with unlimited time and unlimited retries, which the usual process of science assumes that we have. The problem here is the part where we don't get to say, ha-ha, whoops, that sure didn't work. That clever idea that used to work on earlier systems sure broke down when the AI got smarter, smarter than us.
We do not get to learn from our mistakes and try again because everyone is already dead.
It is a large ask to get an unprecedented scientific and engineering challenge correct on the first critical try. Humanity is not approaching this issue with remotely the level of seriousness that would be required. Some of the people leading these efforts have spent the last decade not denying that creating a superintelligence might kill everyone, but joking about it.
We are very far behind. This is not a gap we can overcome in six months, given a six-month moratorium.
If we actually try to do this in real life, we are all going to die.
People say to me at this point: What's your ask?
I do not have any realistic plan, which is why I spent the last two decades trying and failing to end up anywhere but here.
My best bad take is that we need an international coalition banning large AI training runs, including extreme and extraordinary measures to have that ban be actually and universally effective, like tracking all GPU sales, monitoring all the datacenters, being willing to risk a shooting conflict between nations in order to destroy an unmonitored datacenter in a non-signatory country.
I say this not expecting that to actually happen.
I say this expecting that we all just die.
But it is not my place to just decide on my own that humanity will choose to die, to the point of not bothering to warn anyone.
I have heard that people outside the tech industry are getting this point faster than people inside it. Maybe humanity wakes up one morning and decides to live.
Thank you for coming to my brief TED talk.