I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was .
Suppose you went through the following exercise. For each scenario described under "What it might look like if this gap matters", ask:
and collected the obstacles, you might assemble a list like this one, which might update you toward AI x-risk being "overwhelmingly likely". (Personally, if I had to put a number on it, I'd say 80%.)
Agree directionally. I made a similar point in my review of "Is power-seeking AI an existential risk?":
In one sentence, my concern is that the framing of the report and decomposition is more like “avoid existential catastrophe” than “achieve a state where existential catastrophe is extremely unlikely and we are fulfilling humanity’s potential”, and this will bias readers toward lower estimates.
Do you think there's a way to reframe my position in a way that you'd agree with, or at least don't strongly disagree with? (In other words, I'm not sure how much of the disagreement is with the substance of what I'm saying vs the way I'm saying it.) Or, to approach this another way, how would you state/frame your own position on this topic?
Great post! I think this captures a lot of why I'm not ultradoomy (only, er, 45%-ish doomy, at the moment), especially A and B. I think it's at least possible that our reality is on easymode, where muddling could conceivably put an AI into close enough territory to not trigger an oops.
I'd be even less doomy if I agreed with the counterarguments in C. Unfortunately, I can't shake the suspicion that superintelligence is the kind of ridiculously powerful lever that would magnify small oopses into the largest possible oopses.
Hypothetically, if we took a clever human's general capacity for problem solving, stripped it of limitations like getting bored or tired, got rid of its pesky intuitions around ethics, and sped it up by a factor of 1,000 times... I'd be very worried about what it would be able to do. Even without greater capacity for insight or an enhanced working memory, simply thinking really fast would be a broken superpower.
Such an entity might not be able to recreate the technology of modern civilization starting from scratch (both in resources and knowledge) in the stone age within 30 years, primarily due to physical interaction requirements. But starting from anything like m...
There's a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy.
For at least about ten years in my experience people in this community have been saying the main problem isn't getting the AI to understand human values, it's getting the AI to have human values. Unfortunately the word "learn human values" is sometimes used to mean "have human values" and sometimes used to mean "understand human values" hence the confusion.
To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
To have human values the AI needs to either learn them or have them instilled. EY’s complexity fragility of human values argument is directed against early proposals for learning human values for AI utility function. Obviously at some point a powerful AI will learn a model of human values somewhere in its world model, but that is irrelevant because that doesn’t effect its utility function and it’s far too late - the AI needs a robust model of human values well before it becomes superhuman.
Katjas point is valid - DL did not fail in the way EY predicted, and the success of DL gives hope that we can learn superhuman models of human values to steer developing AI ( which again is completely unrelated to the AI later learning human values somewhere in its world model )
I agree Eliezer is wrong, though that's not enough to ensure success. In particular, you need to avoid inner alignment issues like deceptive alignment, where it learns values very well only for instrumental convergence reasons, and once it's strong, it overthrows the humans and pursues whatever terminal goal it has.
Ronny Fernandez asked me, Nate, and Eliezer for our take on Twitter. Copying over Nate's reply:
briefly: A) narrow non-optimizers can exist but won't matter; B) wake me when the allegedly maximally-facelike image looks human; C) ribosomes show that cognition-bound superpowers exist; D) humans can't stack into superintelligent corps, but if they could then yes plz value-load
(tbc, I appreciate Katja saying all that. Hooray for stating what you think, and hooray again when it's potentially locally unpopular! If I were less harried I might give more than a tweet of engagement, but in reality I probably won't, sorry.)
I asked Nate what he meant by B, and he said:
section B seemed to me to be saying "AIs can figure out what a face is". And, ok, sure, but if you ask them for the faciest possible thing, it's not very human!facelike.
which is one of many objections, ofc (others including "ah yes but can you aim it at a human concept" )
Note: "ask them for the faciest possible thing" seems confused.
How I would've interpreted this if I were talking with another ML researcher is "Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
I'm guessing what he has in mind is more like "take a GAN discriminator / image classifier & find the image that maxes out the face logit", but if so, why is that the relevant operationalization? It doesn't correspond to how such a model is actually used.
EDIT: Here is what the first looks like for StyleGAN2-ADA.
It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
(also, the maximum likelihood essay looks like a single word, or if you normalize for length, the same word repeated over and over again up to the context length)
EY argues that human values are hard to learn. Katja uses human faces as an analogy, pointing out that ML systems learn natural concepts far easier than EY 2009 expected.
The analogy is between A: a function which maps noise to realistic images of human faces and B: a function which maps predicted future world states to utility scores similar to how a human would score them. The lesson is that since ML systems can learn A very well, they can probably also learn B.
Function A (human face generator) does not even use max-likelihood sampling and it isn't even an optimizer, so your operationalization is just confused. Nor is function B an optimizer itself.
I claim that A and B are in fact very disanalogous objects, and that the claim that A can be learned well does not imply that B can probably be learned well. I am very confused by your claims about the functions A and B not being optimizers, because to me this is true but also irrelevant.
The reason we want a function B that can map world states to utilities is so that we can optimize on that number. We want to select for world states that we think will have high utility using B; otherwise function B is pretty useless. Therefore, this function has to be reliable enough that putting lots of optimization pressure on it does not break it. This is not the same as claiming that the function itself is an optimizer or anything like that. Making something reliable against lots of optimization pressure is a lot harder than making it reliable in the training distribution.
The function A effectively allows you to sample from the distribution of faces. Function A does not have to be robust against adversarial optimization to approximate the distribution. The analogous function in the domain of human values would be a function that lets you sample from some prior distribution of world states, not one that scores utility of states.
More generally, I think the confusion here stems from the fact that a) robustness against optimization is far harder than modelling typical elements of a distribution, and b) distributions over states are fundamentally different objects from utility functions over states.
Nate's analogy is confused: diffusion models do not generate convincing samples of faces by maximizing for faciness - see how they actually work, and make sure we agree there. This is important because previous systems (such as deepdream) could be described as maximizing for X, such that nate's critique would be more relevant.
Your comment here about "optimizing for X-ness" indicates you also were adopting the wrong model of how diffusion models operate:
It's the relevant operationalization because in the context of an AI system optimizing for X-ness of states S, the thing that matters is not what the max-likelihood sample of some prior distribution over S is, but rather what the maximum X-ness sample looks like. In other words, if you're trying to write a really good essay, you don't care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.
That simply isn't out how diffusion models work. A diffusion model for essays would sample from realistic essays that summarize to some short prompt; so they absolutely do care about high likelihood from the distribution of human essay...
I object to your characterization that I am claiming that diffusion models work by maximizing faciness, or that I am confused about how diffusion models work. I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
(Unimportant nitpicking: This Person Does Not Exist doesn't actually use a diffusion model, but rather a StyleGAN trained on a face dataset.)
You're also eliding over the difference between training an unconditional diffusion model on a face dataset and training an unconditional diffusion model over a general image dataset and doing classifier based guidance. I've been talking about unconditional models on a face dataset, which does not optim...
First a reply to interpretations of previous words:
I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I'm confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don't optimize for faciness, they aren't a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.
I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images - because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.
Optimizing only for faciness via a discriminator does not work well - that's the old deepdream approach. Opti...
I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predict that if you stop training the discriminator and keep training the generator, the generator starts generating weird looking nonhuman images.
This is relevant to Reinforcement Learning because of the actor-critic class of systems, where the actor is like the generator and the critic is like the discriminator. We'd ideally like the RL system to stay on course after we stop providing it with labels, but stopping labels means we stop training the critic. Which means that the actor is free to start generating adversarial policies that hack the critic, rather than policies that actually perform well in the way we'd want them to.
Upvoted because I agree with all of the above.
AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn't claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn't the point it's making. It claims that learned models of faces don't "leave anything important out" in the way that one might expect some key feature to be "left out" when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might've thought, even if building adversarially robust classifiers is very hard. (As much as I'd like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)
Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values. Of course a super intelligent AI will have a good model of human values, the same way it will have a good model of engineering, chemistry, the ecological environment and physics. That doesn't mean it will do things that are aligned with its accurate model of human values.
I am not sure who thought that learning such models was much harder than it turned out to be. It seems clear that an AI will learn what human faces are before the AI is very dangerous to the world. It would have been extremely surprising to have a dangerous AGI incapable of learning what human faces are like.
Hmm, but I don't understand what relevance it has to alignment. The problem was never that the AI won't learn human values, it's that the AI won't care about human values
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us, so that very much was believed to be part of the problem (at least by the EY/MIRI/LW/etc crowed).
But that's all now mostly irrelevant - an altruistic AI probably doesn't even need to know or care about human values at all, as it can simply optimize for our empowerment - our future optionality or ability to do anything we want. (some previous discussion here. and in these comments. )
I wasn't that active around the time of the sequences, but I had a good number of discussions with people, and the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us
the point "the AI will of course know what your values are, it just wont' care" was made many times, and I am also pretty sure was made in the sequences
Notice I said "before it killed us". Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that's irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I don't think "empowerment" is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
I'm not entirely sure what people mean when they say "X won't survive heavy optimization pressure" - but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn't even require detailed modeling of the agent - they can just be a black box that produces outputs. I'm curious what you think is an example of "the kind of concept that particularly survives heavy optimization pressure".
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling - and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is - by definition - the best possible proxy utility function (under optimization scaling).
Let's apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically - with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizin...
EY claims this will fail and instead learn a utility function of “smiles”, resulting in a SI which tiles the future light-cone of Earth with tiny molecular smiley-face, in a paper literally titled "Complex Value Systems are Required to Realize Valuable Futures"
This is really misunderstanding what Eliezer is saying here, and also like, look, from my perspective it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me, so I am experiencing a good amount of frustration with you repeatedly saying things in this comment thread like "irrefutable proof", when it's just like an obviously wrong statement (though like a fine one to arrive at when just reading some random subset of Eliezer's writing, but a clearly wrong summary nevertheless).
Now to go back to the object level:
Eliezer is really not saying that the AI will fail to learn that there is something more complicated than smiles that the human is trying to point it to. He is explicitly saying "look, you won't know what the AI w...
This is really misunderstanding what Eliezer is saying here [...] it's been a decade of explaining to people almost once every two weeks that "yes, the AI will of course know what you care about, but it won't care", so you claiming that this is somehow a new claim related to the deep learning revolution seems completely crazy to me
I think this is much more ambiguous than you're making it out to be. In 2008's "Magical Categories", Yudkowsky wrote:
I shall call this the fallacy of magical categories—simple little words that turn out to carry all the desired functionality of the AI. Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences? Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.
I claim that this paragraph didn't age well in light of the deep learning revolution: "running a neural network [...] over a set of winning and losing sequences of chess moves" basically is how AlphaZero learns from self-play! As the Yudkowsky quote illustrates, it wasn't obvious in 2008 that this wo...
Katja says:
You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces?
Nate's comment:
B) wake me when the allegedly maximally-facelike image looks human;
Katja is talking about current ML systems and how the fragility issue EY predicted didn't materialize (actually it arguably did in earlier systems). Nate's comment is clearly referencing Katja's analogy - faciness - and he's clearly implying we haven't seen the problem with face generators yet because they haven't pushed the optimization hard enough to find the maximally-facelike image. But he's just wrong there - they don't have that problem, no matter how hard you scale their optimization power - and that is part of why Katja's analogy works so well at a deeper level: future ML systems do not work the way AI risk folks thought they would.
Diffusion models are relevant because they improve on conditional GANs by leveraging powerful pretrained discriminative foundation models and by allowing for greater optimization power at inference time, improvements that also could be applied to planning agents.
Also, we don’t know what would happen if we exactly optimized an image to maximize the activation of a particular human’s face detection circuitry. I expect that the result would be pretty eldritch as well.
We may be already doing that in case of cartoon faces with their exaggerated features. Cartoon faces don't look eldritch to us, but why would they?
Also from Ronny:
There's also an important disanalogy between generating/recognizing faces and learning 'human values', which is that humans are perfect human face recognizers but not perfect recognizers of worlds high in 'human values'.
That means that there might be world states or plans in the training data or generated by adversarial training that look to us, and ML trained to recognize these things the way we recognize them, like they are awesome, but are actually awful.
Great post!
A. Contra “superhuman AI systems will be ‘goal-directed’”
I somewhat agree, see Consequentialism & Corrigibility. I’m a bit unclear on whether this is intended as an argument for “AGI almost definitely won’t have a zealous drive to control the universe” versus “AGI won’t necessarily have a zealous drive to control the universe”. I agree with the latter but not the former.
Also, the more different groups make AGIs, the more likely it is that someone will make one with a “zealous drive to control the universe”. Then we have to think about whether the non-zealous ones will have solved the problem posed by the zealous ones. In this context, there starts to be a contradiction between “we don’t need to worry about the non-zealous ones because they won’t be doing hardcore long-term consequentialist planning” versus “we don’t need to worry about the zealous ones because the non-zealous ones are so powerful and foresightful that, whatever plan the latter might come up with, the former can preemptively think of it and defend against it”. More on this topic in a forthcoming post hopefully in the next couple weeks. (EDIT—I added the link)
...B. Contra “goal-directed AI systems’ goals
Thanks for writing this!
Regarding your point on corporations: One of the reasons to worry about some forms of AI is that they might soon build other, more powerful forms of AI. So the development of very human-like Ems, for example might lead relatively quickly to the development of de novo AI, and so on; hence we worry about Ems even if we think extremely human-like Ems do not pose an x-risk on their own. In the same way, corporations are the ones moving forward fastest on building ML-based AI, and the misalignment between corporations and the long-term future of life on Earth is a very significant cause of the overall level of AI-related x-risk in the world today. So if someone had said 500 years ago "hey let's not build corporations because they will probably be subtly or overtly misaligned with us and that will lead to the destruction of life on Earth", then fastforward to today and it seems like that person has been proven correct.
Here are my quick takes from skimming the post.
In short, the arguments I think are best are A1, B4, C3, C4, C5, C8, C9 and D. I don't find any of them devastating.
A1. Different calls to ‘goal-directedness’ don’t necessarily mean the same concept
I am not sure I parse this one.I am reading it as "AI systems might be more like imitators than optimizers" from the example, which I find moderately persuasive
A2. Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk
I am not sure I understand this one either.I am reading it as "there might be no incentive for generality" which I dont find persuasive - I think there is a strong incentive
B1. Small differences in utility functions may not be catastrophic
I dont find this persuasive. I think the evidence from optimization theory setting variables to extreme values is suggestive enough to suggest this is not the default
B2. Differences between AI and human values may be small
B3. Maybe value isn’t fragile
The only example we have of general intelligence (humans) seems to have strayed pretty far from evolutionary incentives, so I find this unpersuasive
B4. [AI might only care about]Short-term goals
I find ...
I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).
I have now published a conversation between Ege Erdil and Ronny Fernandez about this post. You can find it here.
One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.
I don't think this is quite fair. You created an argument outline that doesn't directly reference these things, so you can only blame yourself for excluding them unless you are claiming that such things have not been discussed extensively.
One extremely important difference between corporations and potential AGIs is the level of high-speed, high-bandwidth coordination (which has been discussed extensively) that may be possible for AGIs. If a massive corporation could be as internally coordinated and self-aligned as might be possible for an AGI, it would be absolutely terrifying. Imagine Elon Musk as a Borg Queen with everyone related to Tesla as part of the "collective" under his control...
Competence does not seem to aggressively overwhelm other advantages in humans:
[...]
g. One might counter-counter-argue that humans are very similar to one another in capability, so even if intelligence matters much more than other traits, you won’t see that by looking at the near-identical humans. This does not seem to be true. Often at least, the difference in performance between mediocre human performance and top level human performance is large, relative to the space below, iirc. For instance, in chess, the Elo difference between the best and worst players is about 2000, whereas the difference between the amateur play and random play is maybe 400-2800 (if you accept Chess StackExchange guesses as a reasonable proxy for the truth here).
The usage of capabilities/competence is inconsistent here. In points a-f, you argue that general intelligence doesn't aggressively overwhelm other advantages in humans. But in point g, the ELO difference between the best and worst players is less determined by general intelligence than by how much practice people have had.
If we instead consistently talk about domain-relevant skills: In the real world, we do see huge advantages from havin...
Thank you for posting this, as I find it helpful for practicing my own skills of argumentation. Here are my brief counterarguments to your counterarguments, I'd appreciate it if anyone could point out any flaws in my logic:
A. Contra "superhuman AI systems will be goal-directed"
As far as I understand it, "intelligence" is the ability to achieve one's goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent -...
Eight examples, no cherry-picking:
Nit: Having a wall of images makes this post unnecessarily harder to read.
I'd recommend making a 4x2 collage with the photos so they don't take that much space.
I really like this post. I also like that you provide concrete and specific observables which you think would obtain under each counterargument. I found it refreshing to imagine so many non-orthodox futures.
Small differences in utility functions may not be catastrophic
For three months, I have been sitting on a post (originally) called "What's up with humans with different values not wanting to kill each other?". It seems to me like "value has to be perfect or Goodhart into oblivion" just... doesn't make sense, that isn't how the world works AFAICT. But I g...
Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as val...
However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.
A strategically aware utility maximizer would try to figure out what your expectations are, satisfy them while preparing a take-over, and strike decisively without warning. We should not expect to see an intermediate level of "great destruction".
Promoted to curated: I found engaging with this post quite valuable. I think in the end I disagree with the majority of arguments in it (or at least think they omit major considerations that have previously been discussed on LessWrong and the AI Alignment Forum), but I found thinking through these counterarguments and considering each one of them seriously a very valuable thing to do to help me flesh out my models of the AI X-Risk space.
There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.
Not one of the main points of the post, but FWIW it seems to me that thinking could be considered the main bottleneck for medicine, if we can include simulation and modeling a la AlphaFold as thinking.
My guess is that with sufficient computation you could invent new treatments / drugs that are so overwhelmingly better than what we have now that regulatory or other bot...
Here's a selection of notes I wrote while reading this (in some cases substantially expanded with explanation).
...The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. G
I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).
I agree, but if there is any kind of evolutionary variation in the thing then surely the variations that move towards stronger goal-directedness will be favored.
I think that overcoming this molochian dynamic is the alignment problem: how do you build a powerful system ...
I really appreciate this post!
For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.
Fascinatingly, EA employers in particular seem to seek employees who do try to forward organization goals in unforeseen ways!
I just want to say that I appreciate this post, and especially the "What it might look like if this gap matters" sections. They were super useful for contextualizing the more abstract arguments, and I often found myself scrolling down to read them before actually reading the corresponding section.
The argument overall proves too much about corporations
Does it? Aren't corporations the ones building ASI right now?
A few thoughts that occurred while reading
If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand.
Intelligence and speed might need to be considered separately. If an AI is only as smart as a human, but can run much faster, then "one AI" could potentially be more closely analogous to one human civilization than to one human.
...Another line of evidence is that for
Speed of intelligence growth is ambiguous
Three months ago, I learned that narcolepsy patients quite literally experience sleep and unconsciousness asynchronously, and synchronization is normally achieved through regulatory cells that produce hypocretin. Hypocretin, like anesthesia, acts on neuron microtubules. This has led me to a greatly increased interest and confidence in theory surrounding neuron microtubules as a processing unit, and I wonder if anyone in the AI community has considered the implications.
If microtubule lattices are storing or calcul...
A) You seem to agree that in principle more goal-directed agents would be more capable. I think this alone implies that those will be the dominant force in the future no matter if they are rare among many less goal-directed agents.
B) I'm deeply unsure about this and have conflicting intuitions. On the one hand, if you thing total utilitarianism is true any world where AI is not explicitly maximizing for total utility is much much worse than one where it is. On the other hand, I agree that humans are able to agree.
C) I think you are missing two key features...
Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly.
I am open to being corrected, but I do not recall ever seeing a requirement of "perfect" alignment in the cases made for doom. Eliezer Yudkowsky in "AGI Ruin: A List of Lethalities" only asks for 'this will not kill literally everyone'.
Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.
A sufficient criteria for a desire to cause catastrophe (distinct from having the means to cause catastrophe) is if the AI is sufficiently goal-directed to be influenced by Stephen Omohundro's "Basic AI Drives".
For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one.
It is possible that an AI will try to become more coherent and fail, but we are worried about the most capable AI and cannot rely on the hope that it will fail such a simple task. Being coherent is easy if the fruits are instrumental: Just look up the prices of the fruits.
"AI agents may not be radically superior to combinations of humans and non-agentic machines"
I'm not sure that the evidence supports this unless the non-agentic machines are also AI.
In particular: (i) humans are likely to subtract from this mix and (ii) AI is likely to be better than non-AI.
In the case of chess, after two decades of non-AI programming advances from the time that computers beat the best human, involving humans no longer provides an advantage over just using the computer programs. And, Alpha Zero fairly de...
Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster?
A simple example could be that the humans involved in the initial training are negative utilitarians. Once the AI is powerful enough, it would be able to implement omnicide rather than just curing diseases.
I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’
I think in its roots, AGI should have survival instinct as a goal. Everything else should be secondary. Its a hard choice, but if we want AGI to be like us, we have to follow that route. If its roots are different from ours, it will be close to impossible to replicate our behavior and our values.
(Crossposted from AI Impacts Blog)
This is going to be a list of holes I see in the basic argument for existential risk from superhuman AI systems1.
To start, here’s an outline of what I take to be the basic case2:
I. If superhuman AI systems are built, any given system is likely to be ‘goal-directed’
Reasons to expect this:
II. If goal-directed superhuman AI systems are built, their desired outcomes will probably be about as bad as an empty universe by human lights
Reasons to expect this:
III. If most goal-directed superhuman AI systems have bad goals, the future will very likely be bad
That is, a set of ill-motivated goal-directed superhuman AI systems, of a scale likely to occur, would be capable of taking control over the future from humans. This is supported by at least one of the following being true:
Below is a list of gaps in the above, as I see it, and counterarguments. A ‘gap’ is not necessarily unfillable, and may have been filled in any of the countless writings on this topic that I haven’t read. I might even think that a given one can probably be filled. I just don’t know what goes in it.
This blog post is an attempt to run various arguments by you all on the way to making pages on AI Impacts about arguments for AI risk and corresponding counterarguments. At some point in that process I hope to also read others’ arguments, but this is not that day. So what you have here is a bunch of arguments that occur to me, not an exhaustive literature review.
Counterarguments
A. Contra “superhuman AI systems will be ‘goal-directed’”
Different calls to ‘goal-directedness’ don’t necessarily mean the same concept
‘Goal-directedness’ is a vague concept. It is unclear that the ‘goal-directednesses’ that are favored by economic pressure, training dynamics or coherence arguments (the component arguments in part I of the argument above) are the same ‘goal-directedness’ that implies a zealous drive to control the universe (i.e. that makes most possible goals very bad, fulfilling II above).
One well-defined concept of goal-directedness is ‘utility maximization’: always doing what maximizes a particular utility function, given a particular set of beliefs about the world.
Utility maximization does seem to quickly engender an interest in controlling literally everything, at least for many utility functions one might have3. If you want things to go a certain way, then you have reason to control anything which gives you any leverage over that, i.e. potentially all resources in the universe (i.e. agents have ‘convergent instrumental goals’). This is in serious conflict with anyone else with resource-sensitive goals, even if prima facie those goals didn’t look particularly opposed. For instance, a person who wants all things to be red and another person who wants all things to be cubes may not seem to be at odds, given that all things could be red cubes. However if these projects might each fail for lack of energy, then they are probably at odds.
Thus utility maximization is a notion of goal-directedness that allows Part II of the argument to work, by making a large class of goals deadly.
You might think that any other concept of ‘goal-directedness’ would also lead to this zealotry. If one is inclined toward outcome O in any plausible sense, then does one not have an interest in anything that might help procure O? No: if a system is not a ‘coherent’ agent, then it can have a tendency to bring about O in a range of circumstances, without this implying that it will take any given effective opportunity to pursue O. This assumption of consistent adherence to a particular evaluation of everything is part of utility maximization, not a law of physical systems. Call machines that push toward particular goals but are not utility maximizers pseudo-agents.
Can pseudo-agents exist? Yes—utility maximization is computationally intractable, so any physically existent ‘goal-directed’ entity is going to be a pseudo-agent. We are all pseudo-agents, at best. But it seems something like a spectrum. At one end is a thermostat, then maybe a thermostat with a better algorithm for adjusting the heat. Then maybe a thermostat which intelligently controls the windows. After a lot of honing, you might have a system much more like a utility-maximizer: a system that deftly seeks out and seizes well-priced opportunities to make your room 68 degrees—upgrading your house, buying R&D, influencing your culture, building a vast mining empire. Humans might not be very far on this spectrum, but they seem enough like utility-maximizers already to be alarming. (And it might not be well-considered as a one-dimensional spectrum—for instance, perhaps ‘tendency to modify oneself to become more coherent’ is a fairly different axis from ‘consistency of evaluations of options and outcomes’, and calling both ‘more agentic’ is obscuring.)
Nonetheless, it seems plausible that there is a large space of systems which strongly increase the chance of some desirable objective O occurring without even acting as much like maximizers of an identifiable utility function as humans would. For instance, without searching out novel ways of making O occur, or modifying themselves to be more consistently O-maximizing. Call these ‘weak pseudo-agents’.
For example, I can imagine a system constructed out of a huge number of ‘IF X THEN Y’ statements (reflexive responses), like ‘if body is in hallway, move North’, ‘if hands are by legs and body is in kitchen, raise hands to waist’.., equivalent to a kind of vector field of motions, such that for every particular state, there are directions that all the parts of you should be moving. I could imagine this being designed to fairly consistently cause O to happen within some context. However since such behavior would not be produced by a process optimizing O, you shouldn’t expect it to find new and strange routes to O, or to seek O reliably in novel circumstances. There appears to be zero pressure for this thing to become more coherent, unless its design already involves reflexes to move its thoughts in certain ways that lead it to change itself. I expect you could build a system like this that reliably runs around and tidies your house say, or runs your social media presence, without it containing any impetus to become a more coherent agent (because it doesn’t have any reflexes that lead to pondering self-improvement in this way).
It is not clear that economic incentives generally favor the far end of this spectrum over weak pseudo-agency. There are incentives toward systems being more like utility maximizers, but also incentives against.
The reason any kind of ‘goal-directedness’ is incentivised in AI systems is that then the system can be given an objective by someone hoping to use their cognitive labor, and the system will make that objective happen. Whereas a similar non-agentic AI system might still do almost the same cognitive labor, but require an agent (such as a person) to look at the objective and decide what should be done to achieve it, then ask the system for that. Goal-directedness means automating this high-level strategizing.
Weak pseudo-agency fulfills this purpose to some extent, but not as well as utility maximization. However if we think that utility maximization is difficult to wield without great destruction, then that suggests a disincentive to creating systems with behavior closer to utility-maximization. Not just from the world being destroyed, but from the same dynamic causing more minor divergences from expectations, if the user can’t specify their own utility function well.
That is, if it is true that utility maximization tends to lead to very bad outcomes relative to any slightly different goals (in the absence of great advances in the field of AI alignment), then the most economically favored level of goal-directedness seems unlikely to be as far as possible toward utility maximization. More likely it is a level of pseudo-agency that achieves a lot of the users’ desires without bringing about sufficiently detrimental side effects to make it not worthwhile. (This is likely more agency than is socially optimal, since some of the side-effects will be harms to others, but there seems no reason to think that it is a very high degree of agency.)
Some minor but perhaps illustrative evidence: anecdotally, people prefer interacting with others who predictably carry out their roles or adhere to deontological constraints, rather than consequentialists in pursuit of broadly good but somewhat unknown goals. For instance, employers would often prefer employees who predictably follow rules than ones who try to forward company success in unforeseen ways.
The other arguments to expect goal-directed systems mentioned above seem more likely to suggest approximate utility-maximization rather than some other form of goal-directedness, but it isn’t that clear to me. I don’t know what kind of entity is most naturally produced by contemporary ML training. Perhaps someone else does. I would guess that it’s more like the reflex-based agent described above, at least at present. But present systems aren’t the concern.
Coherence arguments are arguments for being coherent a.k.a. maximizing a utility function, so one might think that they imply a force for utility maximization in particular. That seems broadly right. Though note that these are arguments that there is some pressure for the system to modify itself to become more coherent. What actually results from specific systems modifying themselves seems like it might have details not foreseen in an abstract argument merely suggesting that the status quo is suboptimal whenever it is not coherent. Starting from a state of arbitrary incoherence and moving iteratively in one of many pro-coherence directions produced by whatever whacky mind you currently have isn’t obviously guaranteed to increasingly approximate maximization of some sensical utility function. For instance, take an entity with a cycle of preferences, apples > bananas = oranges > pears > apples. The entity notices that it sometimes treats oranges as better than pears and sometimes worse. It tries to correct by adjusting the value of oranges to be the same as pears. The new utility function is exactly as incoherent as the old one. Probably moves like this are rarer than ones that make you more coherent in this situation, but I don’t know, and I also don’t know if this is a great model of the situation for incoherent systems that could become more coherent.
What it might look like if this gap matters: AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Ambiguously strong forces for goal-directedness need to meet an ambiguously high bar to cause a risk
The forces for goal-directedness mentioned in I are presumably of finite strength. For instance, if coherence arguments correspond to pressure for machines to become more like utility maximizers, there is an empirical answer to how fast that would happen with a given system. There is also an empirical answer to how ‘much’ goal directedness is needed to bring about disaster, supposing that utility maximization would bring about disaster and, say, being a rock wouldn’t. Without investigating these empirical details, it is unclear whether a particular qualitatively identified force for goal-directedness will cause disaster within a particular time.
What it might look like if this gap matters: There are not that many systems doing something like utility maximization in the new AI economy. Demand is mostly for systems more like GPT or DALL-E, which transform inputs in some known way without reference to the world, rather than ‘trying’ to bring about an outcome. Maybe the world was headed for more of the latter, but ethical and safety concerns reduced desire for it, and it wasn’t that hard to do something else. Companies setting out to make non-agentic AI systems have no trouble doing so. Incoherent AIs are never observed making themselves more coherent, and training has never produced an agent unexpectedly. There are lots of vaguely agentic things, but they don’t pose much of a problem. There are a few things at least as agentic as humans, but they are a small part of the economy.
B. Contra “goal-directed AI systems’ goals will be bad”
Small differences in utility functions may not be catastrophic
Arguably, humans are likely to have somewhat different values to one another even after arbitrary reflection. If so, there is some extended region of the space of possible values that the values of different humans fall within. That is, ‘human values’ is not a single point.
If the values of misaligned AI systems fall within that region, this would not appear to be worse in expectation than the situation where the long-run future was determined by the values of humans other than you. (This may still be a huge loss of value relative to the alternative, if a future determined by your own values is vastly better than that chosen by a different human, and if you also expected to get some small fraction of the future, and will now get much less. These conditions seem non-obvious however, and if they obtain you should worry about more general problems than AI.)
Plausibly even a single human, after reflecting, could on their own come to different places in a whole region of specific values, depending on somewhat arbitrary features of how the reflecting period went. In that case, even the values-on-reflection of a single human is an extended region of values space, and an AI which is only slightly misaligned could be the same as some version of you after reflecting.
There is a further larger region, ‘that which can be reliably enough aligned with typical human values via incentives in the environment’, which is arguably larger than the circle containing most human values. Human society makes use of this a lot: for instance, most of the time particularly evil humans don’t do anything too objectionable because it isn’t in their interests. This region is probably smaller for more capable creatures such as advanced AIs, but still it is some size.
Thus it seems that some amount of AI divergence from your own values is probably broadly fine, i.e. not worse than what you should otherwise expect without AI.
Thus in order to arrive at a conclusion of doom, it is not enough to argue that we cannot align AI perfectly. The question is a quantitative one of whether we can get it close enough. And how close is ‘close enough’ is not known.
What it might look like if this gap matters: there are many superintelligent goal-directed AI systems around. They are trained to have human-like goals, but we know that their training is imperfect and none of them has goals exactly like those presented in training. However if you just heard about a particular system’s intentions, you wouldn’t be able to guess if it was an AI or a human. Things happen much faster than they were, because superintelligent AI is superintelligent, but not obviously in a direction less broadly in line with human goals than when humans were in charge.
Differences between AI and human values may be small
AI trained to have human-like goals will have something close to human-like goals. How close? Call it d, for a particular occasion of training AI.
If d doesn’t have to be 0 for safety (from above), then there is a question of whether it is an acceptable size.
I know of two issues here, pushing d upward. One is that with a finite number of training examples, the fit between the true function and the learned function will be wrong. The other is that you might accidentally create a monster (‘misaligned mesaoptimizer’) who understands its situation and pretends to have the utility function you are aiming for so that it can be freed and go out and manifest its own utility function, which could be just about anything. If this problem is real, then the values of an AI system might be arbitrarily different from the training values, rather than ‘nearby’ in some sense, so d is probably unacceptably large. But if you avoid creating such mesaoptimizers, then it seems plausible to me that d is very small.
If humans also substantially learn their values via observing examples, then the variation in human values is arising from a similar process, so might be expected to be of a similar scale. If we care to make the ML training process more accurate than the human learning one, it seems likely that we could. For instance, d gets smaller with more data.
Another line of evidence is that for things that I have seen AI learn so far, the distance from the real thing is intuitively small. If AI learns my values as well as it learns what faces look like, it seems plausible that it carries them out better than I do.
As minor additional evidence here, I don’t know how to describe any slight differences in utility functions that are catastrophic. Talking concretely, what does a utility function look like that is so close to a human utility function that an AI system has it after a bunch of training, but which is an absolute disaster? Are we talking about the scenario where the AI values a slightly different concept of justice, or values satisfaction a smidgen more relative to joy than it should? And then that’s a moral disaster because it is wrought across the cosmos? Or is it that it looks at all of our inaction and thinks we want stuff to be maintained very similar to how it is now, so crushes any efforts to improve things?
What it might look like if this gap matters: when we try to train AI systems to care about what specific humans care about, they usually pretty much do, as far as we can tell. We basically get what we trained for. For instance, it is hard to distinguish them from the human in question. (It is still important to actually do this training, rather than making AI systems not trained to have human values.)
Maybe value isn’t fragile
Eliezer argued that value is fragile, via examples of ‘just one thing’ that you can leave out of a utility function, and end up with something very far away from what humans want. For instance, if you leave out ‘boredom’ then he thinks the preferred future might look like repeating the same otherwise perfect moment again and again. (His argument is perhaps longer—that post says there is a lot of important background, though the bits mentioned don’t sound relevant to my disagreement.) This sounds to me like ‘value is not resilient to having components of it moved to zero’, which is a weird usage of ‘fragile’, and in particular, doesn’t seem to imply much about smaller perturbations. And smaller perturbations seem like the relevant thing with AI systems trained on a bunch of data to mimic something.
You could very analogously say ‘human faces are fragile’ because if you just leave out the nose it suddenly doesn’t look like a typical human face at all. Sure, but is that the kind of error you get when you try to train ML systems to mimic human faces? Almost none of the faces on thispersondoesnotexist.com are blatantly morphologically unusual in any way, let alone noseless. Admittedly one time I saw someone whose face was neon green goo, but I’m guessing you can get the rate of that down pretty low if you care about it.
Eight examples, no cherry-picking:
Skipping the nose is the kind of mistake you make if you are a child drawing a face from memory. Skipping ‘boredom’ is the kind of mistake you make if you are a person trying to write down human values from memory. My guess is that this seemed closer to the plan in 2009 when that post was written, and that people cached the takeaway and haven’t updated it for deep learning which can learn what faces look like better than you can.
What it might look like if this gap matters: there is a large region ‘around’ my values in value space that is also pretty good according to me. AI easily lands within that space, and eventually creates some world that is about as good as the best possible utopia, according to me. There aren’t a lot of really crazy and terrible value systems adjacent to my values.
Short-term goals
Utility maximization really only incentivises drastically altering the universe if one’s utility function places a high enough value on very temporally distant outcomes relative to near ones. That is, long term goals are needed for danger. A person who cares most about winning the timed chess game in front of them should not spend time accruing resources to invest in better chess-playing.
AI systems could have long-term goals via people intentionally training them to do so, or via long-term goals naturally arising from systems not trained so.
Humans seem to discount the future a lot in their usual decision-making (they have goals years in advance but rarely a hundred years) so the economic incentive to train AI to have very long term goals might be limited.
It’s not clear that training for relatively short term goals naturally produces creatures with very long term goals, though it might.
Thus if AI systems fail to have value systems relatively similar to human values, it is not clear that many will have the long time horizons needed to motivate taking over the universe.
What it might look like if this gap matters: the world is full of agents who care about relatively near-term issues, and are helpful to that end, and have no incentive to make long-term large scale schemes. Reminiscent of the current world, but with cleverer short-termism.
C. Contra “superhuman AI would be sufficiently superior to humans to overpower humanity”
Human success isn’t from individual intelligence
The argument claims (or assumes) that surpassing ‘human-level’ intelligence (i.e. the mental capacities of an individual human) is the relevant bar for matching the power-gaining capacity of humans, such that passing this bar in individual intellect means outcompeting humans in general in terms of power (argument III.2), if not being able to immediately destroy them all outright (argument III.1.). In a similar vein, introductions to AI risk often start by saying that humanity has triumphed over the other species because it is more intelligent, as a lead in to saying that if we make something more intelligent still, it will inexorably triumph over humanity.
This hypothesis about the provenance of human triumph seems wrong. Intellect surely helps, but humans look to be powerful largely because they share their meager intellectual discoveries with one another and consequently save them up over time4. You can see this starkly by comparing the material situation of Alice, a genius living in the stone age, and Bob, an average person living in 21st Century America. Alice might struggle all day to get a pot of water, while Bob might be able to summon all manner of delicious drinks from across the oceans, along with furniture, electronics, information, etc. Much of Bob’s power probably did flow from the application of intelligence, but not Bob’s individual intelligence. Alice’s intelligence, and that of those who came between them.
Bob’s greater power isn’t directly just from the knowledge and artifacts Bob inherits from other humans. He also seems to be helped for instance by much better coordination: both from a larger number people coordinating together, and from better infrastructure for that coordination (e.g. for Alice the height of coordination might be an occasional big multi-tribe meeting with trade, and for Bob it includes global instant messaging and banking systems and the Internet). One might attribute all of this ultimately to innovation, and thus to intelligence and communication, or not. I think it’s not important to sort out here, as long as it’s clear that individual intelligence isn’t the source of power.
It could still be that with a given bounty of shared knowledge (e.g. within a given society), intelligence grants huge advantages. But even that doesn’t look true here: 21st Century geniuses live basically like 21st Century people of average intelligence, give or take.
Why does this matter? Well for one thing, if you make AI which is merely as smart as a human, you shouldn’t then expect it to do that much better than a genius living in the stone age. That’s what human-level intelligence gets you: nearly nothing. A piece of rope after millions of lifetimes. Humans without their culture are much like other animals.
To wield the control-over-the-world of a genius living in the 21st Century, the human-level AI would seem to need something like the other benefits that the 21st century genius gets from their situation in connection with a society.
One such thing is access to humanity’s shared stock of hard-won information. AI systems plausibly do have this, if they can get most of what is relevant by reading the internet. This isn’t obvious: people also inherit information from society through copying habits and customs, learning directly from other people, and receiving artifacts with implicit information (for instance, a factory allows whoever owns the factory to make use of intellectual work that was done by the people who built the factory, but that information may not available explicitly even for the owner of the factory, let alone to readers on the internet). These sources of information seem likely to also be available to AI systems though, at least if they are afforded the same options as humans.
My best guess is that AI systems easily do better than humans on extracting information from humanity’s stockpile, and on coordinating, and so on this account are probably in an even better position to compete with humans than one might think on the individual intelligence model, but that is a guess. In that case perhaps this misunderstanding makes little difference to the outcomes of the argument. However it seems at least a bit more complicated.
Suppose that AI systems can have access to all information humans can have access to. The power the 21st century person gains from their society is modulated by their role in society, and relationships, and rights, and the affordances society allows them as a result. Their power will vary enormously depending on whether they are employed, or listened to, or paid, or a citizen, or the president. If AI systems’ power stems substantially from interacting with society, then their power will also depend on affordances granted, and humans may choose not to grant them many affordances (see section ‘Intelligence may not be an overwhelming advantage’ for more discussion).
However suppose that your new genius AI system is also treated with all privilege. The next way that this alternate model matters is that if most of what is good in a person’s life is determined by the society they are part of, and their own labor is just buying them a tiny piece of that inheritance, then if they are for instance twice as smart as any other human, they don’t get to use technology that it twice as good. They just get a larger piece of that same shared technological bounty purchasable by anyone. Because each individual person is adding essentially nothing in terms of technology, so twice that is still basically nothing.
In contrast, I think people are often imagining that a single entity somewhat smarter than a human will be able to quickly use technologies that are somewhat better than current human technologies. This seems to be mistaking the actions of a human and the actions of a human society. If a hundred thousand people sometimes get together for a few years and make fantastic new weapons, you should not expect an entity somewhat smarter than a person to make even better weapons. That’s off by a factor of about a hundred thousand.
There might be places you can get far ahead of humanity by being better than a single human—it depends how much accomplishments depend on the few most capable humans in the field, and how few people are working on the problem. But for instance the Manhattan Project took a hundred thousand people several years, and von Neumann (a mythically smart scientist) joining the project did not reduce it to an afternoon. Plausibly to me, some specific people being on the project caused it to not take twice as many person-years, though the plausible candidates here seem to be more in the business of running things than doing science directly (though that also presumably involves intelligence). But even if you are an ambitious somewhat superhuman intelligence, the influence available to you seems to plausibly be limited to making a large dent in the effort required for some particular research endeavor, not single-handedly outmoding humans across many research endeavors.
This is all reason to doubt that a small number of superhuman intelligences will rapidly take over or destroy the world (as in III.i.). This doesn’t preclude a set of AI systems that are together more capable than a large number of people from making great progress. However some related issues seem to make that less likely.
Another implication of this model is that if most human power comes from buying access to society’s shared power, i.e. interacting with the economy, you should expect intellectual labor by AI systems to usually be sold, rather than for instance put toward a private stock of knowledge. This means the intellectual outputs are mostly going to society, and the main source of potential power to an AI system is the wages received (which may allow it to gain power in the long run). However it seems quite plausible that AI systems at this stage will generally not receive wages, since they presumably do not need them to be motivated to do the work they were trained for. It also seems plausible that they would be owned and run by humans. This would seem to not involve any transfer of power to that AI system, except insofar as its intellectual outputs benefit it (e.g. if it is writing advertising material, maybe it doesn’t get paid for that, but if it can write material that slightly furthers its own goals in the world while also fulfilling the advertising requirements, then it sneaked in some influence.)
If there is AI which is moderately more competent than humans, but not sufficiently more competent to take over the world, then it is likely to contribute to this stock of knowledge and affordances shared with humans. There is no reason to expect it to build a separate competing stock, any more than there is reason for a current human household to try to build a separate competing stock rather than sell their labor to others in the economy.
In summary:
Overall these are reasons to expect AI systems with around human-level cognitive performance to not destroy the world immediately, and to not amass power as easily as one might imagine.
What it might look like if this gap matters: If AI systems are somewhat superhuman, then they do impressive cognitive work, and each contributes to technology more than the best human geniuses, but not more than the whole of society, and not enough to materially improve their own affordances. They don’t gain power rapidly because they are disadvantaged in other ways, e.g. by lack of information, lack of rights, lack of access to positions of power. Their work is sold and used by many actors, and the proceeds go to their human owners. AI systems do not generally end up with access to masses of technology that others do not have access to, and nor do they have private fortunes. In the long run, as they become more powerful, they might take power if other aspects of the situation don’t change.
AI agents may not be radically superior to combinations of humans and non-agentic machines
‘Human level capability’ is a moving target. For comparing the competence of advanced AI systems to humans, the relevant comparison is with humans who have state-of-the-art AI and other tools. For instance, the human capacity to make art quickly has recently been improved by a variety of AI art systems. If there were now an agentic AI system that made art, it would make art much faster than a human of 2015, but perhaps hardly faster than a human of late 2022. If humans continually have access to tool versions of AI capabilities, it is not clear that agentic AI systems must ever have an overwhelmingly large capability advantage for important tasks (though they might).
(This is not an argument that humans might be better than AI systems, but rather: if the gap in capability is smaller, then the pressure for AI systems to accrue power is less and thus loss of human control is slower and easier to mitigate entirely through other forces, such as subsidizing human involvement or disadvantaging AI systems in the economy.)
Some advantages of being an agentic AI system vs. a human with a tool AI system seem to be:
1 and 2 may or may not matter much. 3 matters more for brief, fast, unimportant tasks. For instance, consider again people who can do mental calculations better than others. My guess is that this advantages them at using Fermi estimates in their lives and buying cheaper groceries, but does not make them materially better at making large financial choices well. For a one-off large financial choice, the effort of getting out a calculator is worth it and the delay is very short compared to the length of the activity. The same seems likely true of humans with tools vs. agentic AI with the same capacities integrated into their minds. Conceivably the gap between humans with tools and goal-directed AI is small for large, important tasks.
What it might look like if this gap matters: agentic AI systems have substantial advantages over humans with tools at some tasks like rapid interaction with humans, and responding to rapidly evolving strategic situations. One-off large important tasks such as advanced science are mostly done by tool AI.
Trust
If goal-directed AI systems are only mildly more competent than some combination of tool systems and humans (as suggested by considerations in the last two sections), we still might expect AI systems to out-compete humans, just more slowly. However AI systems have one serious disadvantage as employees of humans: they are intrinsically untrustworthy, while we don’t understand them well enough to be clear on what their values are or how they will behave in any given case. Even if they did perform as well as humans at some task, if humans can’t be certain of that, then there is reason to disprefer using them. This can be thought of as two problems: firstly, slightly misaligned systems are less valuable because they genuinely do the thing you want less well, and secondly, even if they were not misaligned, if humans can’t know that (because we have no good way to verify the alignment of AI systems) then it is costly in expectation to use them. (This is only a further force acting against the supremacy of AI systems—they might still be powerful enough that using them is enough of an advantage that it is worth taking the hit on trustworthiness.)
What it might look like if this gap matters: in places where goal-directed AI systems are not typically hugely better than some combination of less goal-directed systems and humans, the job is often given to the latter if trustworthiness matters.
Headroom
For AI to vastly surpass human performance at a task, there needs to be ample room for improvement above human level. For some tasks, there is not—tic-tac-toe is a classic example. It is not clear how close humans (or technologically aided humans) are from the limits to competence in the particular domains that will matter. It is to my knowledge an open question how much ‘headroom’ there is. My guess is a lot, but it isn’t obvious.
How much headroom there is varies by task. Categories of task for which there appears to be little headroom:
Categories of task where a lot of headroom seems likely:
What it might look like if this gap matters: many challenges in today’s world remain challenging for AI. Human behavior is not readily predictable or manipulable very far beyond what we have explored, only slightly more complicated schemes are feasible before the world’s uncertainties overwhelm planning; much better ads are soon met by much better immune responses; much better commercial decision-making ekes out some additional value across the board but most products were already fulfilling a lot of their potential; incredible virtual prosecutors meet incredible virtual defense attorneys and everything is as it was; there are a few rounds of attack-and-defense in various corporate strategies before a new equilibrium with broad recognition of those possibilities; conflicts and ‘social issues’ remain mostly intractable. There is a brief golden age of science before the newly low-hanging fruit are again plucked and it is only lightning fast in areas where thinking was the main bottleneck, e.g. not in medicine.
Intelligence may not be an overwhelming advantage
Intelligence is helpful for accruing power and resources, all things equal, but many other things are helpful too. For instance money, social standing, allies, evident trustworthiness, not being discriminated against (this was slightly discussed in section ‘Human success isn’t from individual intelligence’). AI systems are not guaranteed to have those in abundance. The argument assumes that any difference in intelligence in particular will eventually win out over any differences in other initial resources. I don’t know of reason to think that.
Empirical evidence does not seem to support the idea that cognitive ability is a large factor in success. Situations where one entity is much smarter or more broadly mentally competent than other entities regularly occur without the smarter one taking control over the other:
And theoretically I don’t know why one would expect greater intelligence to win out over other advantages over time. There are actually two questionable theories here: 1) Charlotte having more overall control than David at time 0 means that Charlotte will tend to have an even greater share of control at time 1. And, 2) Charlotte having more intelligence than David at time 0 means that Charlotte will have a greater share of control at time 1 even if Bob has more overall control (i.e. more of other resources) at time 1.
What it might look like if this gap matters: there are many AI systems around, and they strive for various things. They don’t hold property, or vote, or get a weight in almost anyone’s decisions, or get paid, and are generally treated with suspicion. These things on net keep them from gaining very much power. They are very persuasive speakers however and we can’t stop them from communicating, so there is a constant risk of people willingly handing them power, in response to their moving claims that they are an oppressed minority who suffer. The main thing stopping them from winning is that their position as psychopaths bent on taking power for incredibly pointless ends is widely understood.
Unclear that many goals realistically incentivise taking over the universe
I have some goals. For instance, I want some good romance. My guess is that trying to take over the universe isn’t the best way to achieve this goal. The same goes for a lot of my goals, it seems to me. Possibly I’m in error, but I spend a lot of time pursuing goals, and very little of it trying to take over the universe. Whether a particular goal is best forwarded by trying to take over the universe as a substep seems like a quantitative empirical question, to which the answer is virtually always ‘not remotely’. Don’t get me wrong: all of these goals involve some interest in taking over the universe. All things equal, if I could take over the universe for free, I do think it would help in my romantic pursuits. But taking over the universe is not free. It’s actually super duper duper expensive and hard. So for most goals arising, it doesn’t bear considering. The idea of taking over the universe as a substep is entirely laughable for almost any human goal.
So why do we think that AI goals are different? I think the thought is that it’s radically easier for AI systems to take over the world, because all they have to do is to annihilate humanity, and they are way better positioned to do that than I am, and also better positioned to survive the death of human civilization than I am. I agree that it is likely easier, but how much easier? So much easier to take it from ‘laughably unhelpful’ to ‘obviously always the best move’? This is another quantitative empirical question.
What it might look like if this gap matters: Superintelligent AI systems pursue their goals. Often they achieve them fairly well. This is somewhat contrary to ideal human thriving, but not lethal. For instance, some AI systems are trying to maximize Amazon’s market share, within broad legality. Everyone buys truly incredible amounts of stuff from Amazon, and people often wonder if it is too much stuff. At no point does attempting to murder all humans seem like the best strategy for this.
Quantity of new cognitive labor is an empirical question, not addressed
Whether some set of AI systems can take over the world with their new intelligence probably depends how much total cognitive labor they represent. For instance, if they are in total slightly more capable than von Neumann, they probably can’t take over the world. If they are together as capable (in some sense) as a million 21st Century human civilizations, then they probably can (at least in the 21st Century).
It also matters how much of that is goal-directed at all, and highly intelligent, and how much of that is directed at achieving the AI systems’ own goals rather than those we intended them for, and how much of that is directed at taking over the world.
If we continued to build hardware, presumably at some point AI systems would account for most of the cognitive labor in the world. But if there is first an extended period of more minimal advanced AI presence, that would probably prevent an immediate death outcome, and improve humanity’s prospects for controlling a slow-moving AI power grab.
What it might look like if this gap matters: when advanced AI is developed, there is a lot of new cognitive labor in the world, but it is a minuscule fraction of all of the cognitive labor in the world. A large part of it is not goal-directed at all, and of that, most of the new AI thought is applied to tasks it was intended for. Thus what part of it is spent on scheming to grab power for AI systems is too small to grab much power quickly. The amount of AI cognitive labor grows fast over time, and in several decades it is most of the cognitive labor, but humanity has had extensive experience dealing with its power grabbing.
Speed of intelligence growth is ambiguous
The idea that a superhuman AI would be able to rapidly destroy the world seems prima facie unlikely, since no other entity has ever done that. Two common broad arguments for it:
These both seem questionable.
A large number of other arguments have been posed for expecting very fast growth in intelligence at around human level. I previously made a list of them with counterarguments, though none seemed very compelling. Overall, I don’t know of strong reason to expect very fast growth in AI capabilities at around human-level AI performance, though I hear such arguments might exist.
What it would look like if this gap mattered: AI systems would at some point perform at around human level at various tasks, and would contribute to AI research, along with everything else. This would contribute to progress to an extent familiar from other technological progress feedback, and would not e.g. lead to a superintelligent AI system in minutes.
Key concepts are vague
Concepts such as ‘control’, ‘power’, and ‘alignment with human values’ all seem vague. ‘Control’ is not zero sum (as seemingly assumed) and is somewhat hard to pin down, I claim. What an ‘aligned’ entity is exactly seems to be contentious in the AI safety community, but I don’t know the details. My guess is that upon further probing, these conceptual issues are resolvable in a way that doesn’t endanger the argument, but I don’t know. I’m not going to go into this here.
What it might look like if this gap matters: upon thinking more, we realize that our concerns were confused. Things go fine with AI in ways that seem obvious in retrospect. This might look like it did for people concerned about the ‘population bomb’ or as it did for me in some of my youthful concerns about sustainability: there was a compelling abstract argument for a problem, and the reality didn’t fit the abstractions well enough to play out as predicted.
D. Contra the whole argument
The argument overall proves too much about corporations
Here is the argument again, but modified to be about corporations. A couple of pieces don’t carry over, but they don’t seem integral.
I. Any given corporation is likely to be ‘goal-directed’
Reasons to expect this:
Goal-directed entities may tend to arise from machine learning training processes not intending to create them (at least via the methods that are likely to be used).II. If goal-directed superhuman corporations are built, their desired outcomes will probably be about as bad as an empty universe by human lights
Reasons to expect this:
, and we have theoretical reasons to expect that AI systems produced through machine learning training will generally end up with goals other than those that they were trained according to. Randomly aberrant goals resulting are probably extinction-level bad, for reasons described in II.1 above.III. If most goal-directed corporations have bad goals, the future will very likely be bad
That is, a set of ill-motivated goal-directed corporations, of a scale likely to occur, would be capable of taking control of the future from humans. This is supported by at least one of the following being true:
This argument does point at real issues with corporations, but we do not generally consider such issues existentially deadly.
One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.
What it might look like if this counterargument matters: something like the current world. There are large and powerful systems doing things vastly beyond the ability of individual humans, and acting in a definitively goal-directed way. We have a vague understanding of their goals, and do not assume that they are coherent. Their goals are clearly not aligned with human goals, but they have enough overlap that many people are broadly in favor of their existence. They seek power. This all causes some problems, but problems within the power of humans and other organized human groups to keep under control, for some definition of ‘under control’.
Conclusion
I think there are quite a few gaps in the argument, as I understand it. My current guess (prior to reviewing other arguments and integrating things carefully) is that enough uncertainties might resolve in the dangerous directions that existential risk from AI is a reasonable concern. I don’t at present though see how one would come to think it was overwhelmingly likely.