Copying over a Twitter reply from Quintin Pope (which I haven't replied to, and which was responding to the wording of the Twitter draft of this post):
...I think your intuition about how SGD works is wildly wrong. E.g., SGD doesn't do anything like "randomly sample from the set of all low loss NN parameter configurations". https://arxiv.org/abs/2110.00683
Also, your point about human plans not looking like randomly sampled plans is a point against your intuition that multi-level search processes will tend to generate such plans.
Finally, I don't think it's even possible to have a general intelligence which operates on principles that fundamentally different from the human brain.
Partially, this is a consequence of singular learning theory forcing inductive biases to significantly reflect data distribution properties, as opposed to inductive biases deriving entirely from architecture / optimizer / etc. https://www.lesswrong.com/posts/M3fDqScej7JDh4s7a/quintin-pope-s-shortform?commentId=aDqhtgbjDiC6tWQjp
It also seems like the current ML paradigm converged to very similar principles as the brain, where most cognition comes from learning to predictively model the environment (tex
Quintin, in case you are reading this, I just wanna say that the link you give to justify
I think your intuition about how SGD works is wildly wrong. E.g., SGD doesn't do anything like "randomly sample from the set of all low loss NN parameter configurations". https://arxiv.org/abs/2110.00683
really doesn't do nearly enough to justify your bold "wildly wrong" claim. First of all, it's common for papers to overclaim, this seems like the sort of paper that could turn out to be basically just flat wrong. (I lack the expertise to decide for myself, it would take me many hours of reading the paper and talking to people probably). Secondly, even if I assume the paper is correct, it just shows that the simplicity bias of SGD on NNs is different than some people think -- it is weighted towards broad basins / connected regions. It's still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior. (Unless you can argue that this specific different bias leads to the consequences/conclusions you like, and in particular leads to doom being much less likely. Maybe you can, I'd like to see that.)
SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time - I do not). By SGD I specifically mean SGD variants that don't use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it's fairly common to use some additional regularization with Adam.
It's also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples - as SGD doesn't move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven't spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like "randomly sample from the set of all low loss NN parameter configurations" is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say - in part yes because I don't generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is "randomly sample from the set of all simple low loss configs" or similar.
But it's also not quite clear to me how relevant that subpoint is, just sharing my impression.
I want to revisit what Rob actually wrote:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it's invoking a simplicity prior to me!
Note I didn't actually reply to that quote. Sure that's an explicit simplicity prior. However there's a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think "randomly sample from the space of solutions with low combined complexity&calculation cost" doesn't actually help us that much over a pure "randomly sample" when it comes to alignment.
It could mean that the relation between your network's learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don't consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird,
I feel like there's a significant distance between what's being said formally versus the conclusions being drawn. From Rob:
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language)
From you:
the simplicity bias of SGD on NNs is different than some people think -- it is weighted towards broad basins / connected regions. It's still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior.
The issue is that literally any plan generation / NN training process can be described in either manner, regardless of the actual prior involved. In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
It's not clear to me what specific priors Rob has in mind for the "random plan" sampling process, unless by "extant formal language" he literally means "formal language that currently exists right now", in which case:
That's not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).
Thanks for weighing in Quintin! I think I basically agree with dxu here. I think this discussion shows that Rob should probably rephrase his argument as something like "When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different -- much bigger than the difference between any two humans, for example. This is reason to expect doom, because of instrumental convergence..."
I take your point that the differences between humans seem... not so large... though actually I guess a lot of people would argue the opposite and say that many humans are indeed terminally misaligned with many other humans.
I also take the point about human-originated data hopefully instilling human-like inductive biases.
But IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine, rather than the person who is running around screaming expecting doom. The AIs we are building are going to be more ...
There are differences between ANNs and BNNs but they don't matter that much - LLMs converge to learn the same internal representations as linguistic cortex anyway.
When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
Moravec was absolutely correct to use the term 'mind children' and all that implies. I outlined the case why the human brain and DL systems are essentially the same way way back in 2015 and every year since we have accumulated further confirming evidence. The closely related scaling hypothesis - predicted in that post - was extensively tested by openAI and worked at least as well as I predicted/expected, taking us to the brink of AGI.
LLMs:
Full Solomon Induction on a hypercomputer absolutely does not just "learn very similar internal functions models", it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
But I will agree the bigger LLMs are now in a somewhat different territory - more like human cortices trained for millennia, perhaps ten millennia for GPT4.
Are you seriously saying that e.g. if and when we one day encounter aliens on another planet, the kind of aliens smart enough to build an industrial civilization, they'll be more alien than LLMs?
Yes! obviously more alien than our LLMs. LLMs are distillations of aggregated human linguistic cortices. Anytime you train one network on the output of others, you clone distill the original(s)! The algorithmic content of NNs is determined by the training data, and the data here in question is human thought.
This was always the way it was going to be, this was all predicted long in advance by the systems/cybernetics futurists like Moravec - AI was/will be our mind children.
EY misled many people here with the bad "human mindspace is narrow meme", I mostly agree with Quintin's recent takedown, but I of course also objected way back when.
I find Quintin's reply here somewhat unsatisfying, because I think it is too narrowly focused on current DL-paradigm methods and the artifacts they directly produce, without much consideration for how those artifacts might be composed and used in real systems. I attempted to describe my objections to this general kind of argument in a bit more detail here.
It's true that if humans were reliably very ambitious, consequentialist, and power-seeking, then this would be stronger evidence that superintelligent AI tends to be ambitious and power-seeking. So the absence of that evidence has to be evidence against "superintelligent AI tends to be ambitious and power-seeking", even if it's not a big weight in the scales.
Thanks for writing this up as a shorter summary Rob. Thanks also for engaging with people who disagree with you over the years.
Here's my main area of disagreement:
General intelligence is very powerful, and once we can build it at all, STEM-capable artificial general intelligence (AGI) is likely to vastly outperform human intelligence immediately (or very quickly).
I don't think this is likely to be true. Perhaps it is true of some cognitive architectures, but not for the connectionist architectures that are the only known examples of human-like AI intelligence and that are clearly the top AIs available today. In these cases, I expect human-level AI capabilities to grow to the point that they will vastly outperform humans much more slowly than immediately or "very quickly". This is basically the AI foom argument.
And I think all of your other points are dependent on this one. Because if this is not true, then humanity will have time to iteratively deal with the problems that emerge, as we have in the past with all other technologies.
My reasoning for not expecting ultra-rapid takeoff speeds is that I don't view connectionist intelligence as having a sort of "sec...
Agreed. A common failure mode in these discussions is to treat intelligence as equivalent to technological progress, instead of as an input to technological progress.
Yes, in five years we will likely have AIs that will be able to tell us exactly where it would be optimal to allocate our scientific research budget. Notably, that does not mean that all current systemic obstacles to efficient allocation of scarce resources will vanish. There will still be the same perverse incentive structure for funding allocated to scientific progress as there is today, general intelligence or no.
Likewise, researchers will likely be able to make the actual protocols and procedures necessary to generate scientific knowledge as optimized as is possible with the use of AI. But a centrifuge is a centrifuge is a centrifuge. No amount of intelligence will make a centrifuge that takes a minimum of an hour to run take less than an hour to run.
Intelligence is not an unbounded input to frontiers of technological progress that are reasonably bounded by the constraints of physical systems.
There's a lot of stuff I agree with in your post, but one thing I disagree with is point 3. See Where do you get your capabilities from?, especially the bounded breakdown of the orthogonality thesis part at the end.
Not that I think this makes GPT models fully safe, but I think its unsafety will look a lot more like the unsafety of humans, plus some changes in the price of things. (Which can make a huge difference.)
This post evolved from a Twitter thread I wrote two weeks ago. Copying over a Twitter reply by Richard Ngo (n.b. Richard was replying to the version on Twitter, which differed in lots of ways):
Rob, I appreciate your efforts, but this is a terrible framing for trying to convey "the basics", and obscures way more than it clarifies.
I'm worried about agents which try to achieve goals. That's the core thing, and you're calling it a misconception?! That's blatantly false.
In my first Alignment Fundamentals class I too tried to convey all the nuances of my thinking about agency as "the basics". It failed badly. One lesson: communication is harder than you think. More importantly: "actually trying" means getting feedback from your target audience.
(I replied and we had a short back-and-forth on Twitter.)
I definitely agree with Richard that the post would probably benefit from more iteration with intended users, if new people are the audience you want to target. (In particular, I doubt that the section quoted from the Aryeh interview will clarify much for new people.)
That said, I definitely think that it's the right call to emphasize up-front that instrumental convergence is a property of problem-space rather than of agency. More generally: when there's a common misinterpretation, which very often ends up load-bearing, then it makes sense to address that upfront; that's not nuance, it's central. Nuance is addressing misinterpretations which are rare or not very load-bearing. Instrumental convergence being a property of problem-spaces rather than "agents" is pretty central to a MIRI-ish view, and underlies a lot of common confusions new-ish people have about such views.
This post seems to argue for fast/discontinuous takeoff without explicitly noting that people working in alignment often disagree. Further I think many of the arguments given here for fast takeoff seem sloppy or directly wrong on my own views.
It seems reasonable to just give your views without noting disagreement, but if the goal is for this to be a reference for the AI risk case, then I think you should probably note where people (who are still sold on AI risk) often disagree. (Edit: It looks like Rob explained his goals in a footnote.)
Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention. ... Empirically, humans aren't near a cognitive ceiling, and even narrow AI often suddenly blows past the human reasoning ability range on the task it's designed for. It would be weird if scientific reasoning were an exception.
The most general AI systems we currently have are large language models and we (broadly speaking) see their overall performance reasonably steadily improve year after year. Addi...
Sorry, just wanted to focus on one sentence close to the beginning:
We can barely multiply smallish multi-digit numbers together in our head, when in principle a reasoner could hold thousands of complex mathematical structures in its working memory simultaneously and perform complex operations on them.
Strangely enough, current LLMs have the exact same issue as humans: they guess the ballpark numerical answers reasonably well, but they are terrible at being precise. Be it drawing the right number of fingers, or writing a sentence with exactly 10 words, or mu...
Small suggestion: add LW headings so there's a linkable table of contents, especially if you're going to direct other people to this post.
Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention.
I don't understand your reasoning for this conclusion. Unless I'm misunderstanding something, almost all your points in support of this thesis appear to be arguments that the upper bound of intelligence is high. But the thesis was about the rate of improvement, not the upper bound.
There are many things in the rea...
...A common misconception is that STEM-level AGI is dangerous because of something murky about "agents" or about self-awareness. Instead, I'd say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.
Call such sequences "plans".
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-
Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn't release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn't released for months after pre-training, OpenAI won't even say how bi...
Quoting a recent conversation between Aryeh Englander and Eliezer Yudkowsky
Out of curiosity, is this conversation publicly posted anywhere? I didn't see a link.
A common misconception is that STEM-level AGI is dangerous because of something murky about “agents” or about self-awareness. Instead, I’d say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.[8]
Call such sequences “plans”.
...If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “inv
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
6. We don't currently know how to do alignment, we don't seem to have a much better idea now than we did 10 years ago, and there are many large novel visible difficulties. (See AGI Ruin and the Capabilities Generalization, and the Sharp Left Turn.)
The first link should probably go to https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
"Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I stron...
Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don't see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is "Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that ...
I notice I am confused by two assumptions about STEM-capable AGI and its ascent:
Assumption 1: The difficulty of self-improvement of an intelligent system is either linear, or if not, its less steep over time than its increase in capabilities. (counter scenario: an AI system achieves human level intelligence, then soon after intelligence 200% of an average human. Once it reaches say, 248% of human intelligence it hits an unforeseen roadblock because achieving 249% of human intelligence in any way is a Really Hard Problem, orders of magnitude beyond passing ...
The common belief that Artificial General Intelligence (AGI) would pose a significant threat to humanity is predicated on several assumptions that warrant further scrutiny. It is often suggested that an entity with vastly superior intelligence would inevitably perceive humans as a threat to its own survival and resort to destructive behavior. However, such a view overlooks several key factors that could contribute to the development of a more cooperative relationship between AGI and humanity.
One factor that could mitigate any perceived threat is that an AG...
But if they do, we face the problem that most ways of successfully imitating humans don't look like "build a human (that's somehow superhumanly good at imitating the Internet)". They look like "build a relatively complex and alien optimization process that is good at imitation tasks (and potentially at many other tasks)".
I think this point could use refining. Once we get our predictor AI, we don't say "do X", we say "how do you predict a human would do X" and then follow that plan. So you need to argue why plans that an AI predicts humans will use to do X tend to be dangerous. This is clearly a very different set than the set of plans for doing X.
I've been citing AGI Ruin: A List of Lethalities to explain why the situation with AI looks lethally dangerous to me. But that post is relatively long, and emphasizes specific open technical problems over "the basics".
Here are 10 things I'd focus on if I were giving "the basics" on why I'm so worried:[1]
1. General intelligence is very powerful, and once we can build it at all, STEM-capable artificial general intelligence (AGI) is likely to vastly outperform human intelligence immediately (or very quickly).
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".
It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too.
Human brains aren't perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
More concretely:
When I say "general intelligence is very powerful", a lot of what I mean is that science is very powerful, and that having all of the sciences at once is a lot more powerful than the sum of each science's impact.[4]
Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention.
80,000 Hours gives the (non-representative) example of how AlphaGo and its successors compared to humanity:
I expect general-purpose science AI to blow human science ability out of the water in a similar fashion.
Reasons for this include:
And on a meta level: the hypothesis that STEM AGI can quickly outperform humans has a disjunctive character. There are many different advantages that individually suffice for this, even if STEM AGI doesn't start off with any other advantages. (E.g., speed, math ability, scalability with hardware, skill at optimizing hardware...)
In contrast, the claim that STEM AGI will hit the narrow target of "par-human scientific ability", and stay at around that level for long enough to let humanity adapt and adjust, has a conjunctive character.[7]
2. A common misconception is that STEM-level AGI is dangerous because of something murky about "agents" or about self-awareness. Instead, I'd say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.[8]
Call such sequences "plans".
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like "invent fast-running whole-brain emulation", then hitting a button to execute the plan would kill all humans, with very high probability. This is because:
The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.
It isn't that the abstract space of plans was built by evil human-hating minds; it's that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like "build WBE" tend to be dangerous.
This isn't true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it's true of the vast majority of them.
This is counter-intuitive because most of the impressive "plans" we encounter today are generated by humans, and it’s tempting to view strong plans through a human lens. But humans have hugely overlapping values, thinking styles, and capabilities; AI is drawn from new distributions.
3. Current ML work is on track to produce things that are, in the ways that matter, more like "randomly sampled plans" than like "the sorts of plans a civilization of human von Neumanns would produce". (Before we're anywhere near being able to produce the latter sorts of things.)[9]
We're building "AI" in the sense of building powerful general search processes (and search processes for search processes), not building "AI" in the sense of building friendly ~humans but in silicon.
(Note that "we're going to build systems that are more like A Randomly Sampled Plan than like A Civilization of Human Von Neumanns" doesn't imply that the plan we'll get is the one we wanted! There are two separate problems: that current ML finds things-that-act-like-they're-optimizing-the-task-you-wanted rather than things-that-actually-internally-optimize-the-task-you-wanted, and also that internally ~maximizing most superficially desirable ends will kill humanity.)
Note that the same problem holds for systems trained to imitate humans, if those systems scale to being able to do things like "build whole-brain emulation". "We're training on something related to humans" doesn't give us "we're training things that are best thought of as humans plus noise".
It's not obvious to me that GPT-like systems can scale to capabilities like "build WBE". But if they do, we face the problem that most ways of successfully imitating humans don't look like "build a human (that's somehow superhumanly good at imitating the Internet)". They look like "build a relatively complex and alien optimization process that is good at imitation tasks (and potentially at many other tasks)".
You don't need to be a human in order to model humans, any more than you need to be a cloud in order to model clouds well. The only reason this is more confusing in the case of "predict humans" than in the case of "predict weather patterns" is that humans and AI systems are both intelligences, so it's easier to slide between "the AI models humans" and "the AI is basically a human".
4. The key differences between humans and "things that are more easily approximated as random search processes than as humans-plus-a-bit-of-noise" lies in lots of complicated machinery in the human brain.
(Cf. Detached Lever Fallacy, Niceness Is Unnatural, and Superintelligent AI Is Necessary For An Amazing Future, But Far From Sufficient.)
Humans are not blank slates in the relevant ways, such that just raising an AI like a human solves the problem.
This doesn't mean the problem is unsolvable; but it means that you either need to reproduce that internal machinery, in a lot of detail, in AI, or you need to build some new kind of machinery that’s safe for reasons other than the specific reasons humans are safe.
(You need cognitive machinery that somehow samples from a much narrower space of plans that are still powerful enough to succeed in at least one task that saves the world, but are constrained in ways that make them far less dangerous than the larger space of plans. And you need a thing that actually implements internal machinery like that, as opposed to just being optimized to superficially behave as though it does in the narrow and unrepresentative environments it was in before starting to work on WBE. "Novel science work" means that pretty much everything you want from the AI is out-of-distribution.)
5. STEM-level AGI timelines don't look that long (e.g., probably not 50 or 150 years; could well be 5 years or 15).
I won't try to argue for this proposition, beyond pointing at the field's recent progress and echoing Nate Soares' comments from early 2021:
I think timing tech is very difficult (and plausibly ~impossible when the tech isn't pretty imminent), and I think reasonable people can disagree a lot about timelines.
I also think converging on timelines is not very crucial, since if AGI is 50 years away I would say it's still the largest single risk we face, and the bare minimum alignment work required for surviving that transition could easily take longer than that.
Also, "STEM AGI when?" is the kind of argument that requires hashing out people's predictions about how we get to STEM AGI, which is a bad thing to debate publicly insofar as improving people's models of pathways can further shorten timelines.
I mention timelines anyway because they are in fact a major reason I'm pessimistic about our prospects; if I learned tomorrow that AGI were 200 years away, I'd be outright optimistic about things going well.
6. We don't currently know how to do alignment, we don't seem to have a much better idea now than we did 10 years ago, and there are many large novel visible difficulties. (See AGI Ruin and the Capabilities Generalization, and the Sharp Left Turn.)
On a more basic level, quoting Nate Soares: "Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems."
7. We should be starting with a pessimistic prior about achieving reliably good behavior in any complex safety-critical software, particularly if the software is novel. Even more so if the thing we need to make robust is structured like undocumented spaghetti code, and more so still if the field is highly competitive and you need to achieve some robustness property while moving faster than a large pool of less-safety-conscious people who are racing toward the precipice.
The default assumption is that complex software goes wrong in dozens of different ways you didn't expect. Reality ends up being thorny and inconvenient in many of the places where your models were absent or fuzzy. Surprises are abundant, and some surprises can be good, but this is empirically a lot rarer than unpleasant surprises in software development hell.
The future is hard to predict, but plans systematically take longer and run into more snags than humans naively expect, as opposed to plans systematically going surprisingly smoothly and deadlines being systematically hit ahead of schedule.
The history of computer security and of safety-critical software systems is almost invariably one of robust software lagging far, far behind non-robust versions of the same software. Achieving any robustness property in complex software that will be deployed in the real world, with all its messiness and adversarial optimization, is very difficult and usually fails.
In many ways I think the foundational discussion of AGI risk is Security Mindset and Ordinary Paranoia and Security Mindset and the Logistic Success Curve, and the main body of the text doesn't even mention AGI. Adding in the specifics of AGI and smarter-than-human AI takes the risk from "dire" to "seemingly overwhelming", but adding in those specifics is not required to be massively concerned if you think getting this software right matters for our future.
8. Neither ML nor the larger world is currently taking this seriously, as of April 2023.
This is obviously something we can change. But until it's changed, things will continue to look very bad.
Additionally, most of the people who are taking AI risk somewhat seriously are, to an important extent, not willing to worry about things until after they've been experimentally proven to be dangerous. Which is a lethal sort of methodology to adopt when you're working with smarter-than-human AI.
My basic picture of why the world currently isn't responding appropriately is the one in Four mindset disagreements behind existential risk disagreements in ML, The inordinately slow spread of good AGI conversations in ML, and Inadequate Equilibria.[10]
9. As noted above, current ML is very opaque, and it mostly lets you intervene on behavioral proxies for what we want, rather than letting us directly design desirable features.
ML as it exists today also requires that data is readily available and safe to provide. E.g., we can’t robustly train the AGI on "don’t kill people" because we can’t provide real examples of it killing people to train against the behavior we don't want; we can only give flawed proxies and work via indirection.
10. There are lots of specific abilities which seem like they ought to be possible for the kind of civilization that can safely deploy smarter-than-human optimization, that are far out of reach, with no obvious path forward for achieving them with opaque deep nets even if we had unlimited time to work on some relatively concrete set of research directions.
(Unlimited time suffices if we can set a more abstract/indirect research direction, like "just think about the problem for a long time until you find some solution". There are presumably paths forward; we just don’t know what they are today, which puts us in a worse situation.)
E.g., we don’t know how to go about inspecting a nanotech-developing AI system’s brain to verify that it’s only thinking about a specific room, that it’s internally representing the intended goal, that it’s directing its optimization at that representation, that it internally has a particular planning horizon and a variety of capability bounds, that it’s unable to think about optimizers (or specifically about humans), or that it otherwise has the right topics internally whitelisted or blacklisted.
Individually, it seems to me that each of these difficulties can be addressed. In combination, they seem to me to put us in a very dark situation.
One common response I hear to points like the above is:
I'm sympathetic to this because I agree that the future is hard to predict.
I'm not totally confident things will go poorly; if I were, I wouldn't be trying to solve the problem! I think things are looking extremely dire, but not hopeless.
That said, some people think that even "extremely dire" is an impossible belief state to be in, in advance of an AI apocalypse actually occurring. I disagree here, for two basic reasons:
a. There are many details we can get into, but on a core level I don't think the risk is particularly complicated or hard to reason about. The core concern fits into a tweet:
Zvi Mowshowitz puts the core concern in even more basic terms:
The details do matter for evaluating the exact risk level, but this isn't the sort of topic where it seems fundamentally impossible for any human to reach a good understanding of the core difficulties and whether we're handling them.
b. Relatedly, as Nate Soares has argued, AI disaster scenarios are disjunctive. There are many bad outcomes for every good outcome, and many paths leading to disaster for every path leading to utopia.
Quoting Eliezer Yudkowsky:
Quoting Jack Rabuck:
I don't consider "AGI ruin is disjunctive" a knock-down argument for high p(doom) on its own. NASA has a high success rate for rocket launches even though success requires many things to go right simultaneously. Humanity is capable of achieving conjunctive outcomes, to some degree; but I think this framing makes it clearer why it's possible to rationally arrive at a high p(doom), at all, when enough evidence points in that direction.[11]
Eliezer Yudkowsky's So Far: Unfriendly AI Edition and Nate Soares' Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome are two other good (though old) introductions to what I'd consider "the basics".
To state the obvious: this post consists of various claims that increase my probability on AI causing an existential catastrophe, but not all the claims have to be true in order for AI to have a high probability of causing such a catastrophe.
Also, I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others. I don't expect others to be similarly confident based on such a quick overview, unless perhaps you've read other sources on AI risk in the past. (Including more optimistic ones, since it's harder to be confident when you've only heard from one side of a disagreement. I've written in the past about some of the things that give me small glimmers of hope, but people who are overall far more hopeful will have very different reasons for hope, based on very different heuristics and background models.)
E.g., the physical world is too complex to simulate in full detail, unlike a Go board state. An effective general intelligence needs to be able to model the world at many different levels of granularity, and strategically choose which levels are relevant to think about, as well as which specific pieces/aspects/properties of the world at those levels are relevant to think about.
More generally, being a general intelligence requires an enormous amount of laserlike focus and strategicness when it comes to which thoughts you do or don't think. A large portion of your compute needs to be relentlessly funneled into exactly the tiny subset of questions about the physical world that bear on the question you're trying to answer or the problem you're trying to solve. If you fail to be relentlessly targeted and efficient in "aiming" your cognition at the most useful-to-you things, you can easily spend a lifetime getting sidetracked by minutiae, directing your attention at the wrong considerations, etc.
And given the variety of kinds of problems you need to solve in order to navigate the physical world well, do science, etc., the heuristics you use to funnel your compute to the exact right things need to themselves be very general, rather than all being case-specific.
(Whereas we can more readily imagine that many of the heuristics AlphaGo uses to avoid thinking about the wrong aspects of the game state (or getting otherwise sidetracked) are Go-specific heuristics.)
Of course, if your brain has all the basic mental machinery required to do other sciences, that doesn't mean that you have the knowledge required to actually do well in those sciences. An STEM-level artificial general intelligence could lack physics ability for the same reason many smart humans can't solve physics problems.
E.g., because different sciences can synergize, and because you can invent new scientific fields and subfields, and more generally chain one novel insight into dozens of other new insights that critically depended on the first insight.
More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter.
"Human engineers have an enormous variety of tools available that evolution lacked" is often noted as a reason to think that we may be able to align AGI to our goals, even though evolution failed to align humans to its "goal". It's additionally a reason to expect AGI to have greater cognitive ability, if engineers try to achieve great cognitive ability.
And my understanding is that, e.g., Paul Christiano's soft-takeoff scenarios don't involve there being much time between par-human scientific ability and superintelligence. Rather, he's betting that we have a bunch of decades between GPT-4 and par-human STEM AGI.
I'll classify thoughts and text outputs as "actions" too, not just physical movements.
Obviously, neither is a particularly good approximation for ML systems. The point is that our optimism about plans in real life generally comes from the fact that they're weak, and/or it comes from the fact that the plan generators are human brains with the full suite of human psychological universals. ML systems don't possess those human universals, and won't stay weak indefinitely.
Quoting Four mindset disagreements behind existential risk disagreements in ML:
Quoting The inordinately slow spread of good AGI conversations in ML:
On the more basic level, Inadequate Equilibria paints a picture of the world's baseline civilizational competence that I think makes it less mysterious why we could screw up this badly on a novel problem that our scientific and political institutions weren't designed to address. Inadequate Equilibria also talks about the nuts and bolts of Modest Epistemology, which I think is a key part of the failure story.
Quoting a recent conversation between Aryeh Englander and Eliezer Yudkowsky:
(See Inadequate Equilibria for a detailed discussion of Modest Epistemology, deference, and "outside views", and Strong Evidence Is Common for the basic first-order case that people can often reach confident conclusions about things.)