Here are two specific objections to this post[1]:
Perfectly aligned AI systems which were exactly as smart as humans and had the same capability profile, but which operated at 100x speed and were cheap would be extremely useful. In particular, we could vastly exceed all current alignment work in the span of a year.
In practice, the capability profile of AIs is unlikely to be exactly the same as humans. Further, even if the capability profile was the same, merely human level systems likely pose substantial misalignment concerns.
However, it does seem reasonably likely that AI systems will have a reasonably similar capability profile to humans and will also run faster and be cheaper. Thus, approaches like AI control could be very useful.
I think it's unlikely that AI will be capable of automating R&D while also being incapable of doing non-trivial consequentialist reasoning in a forward pass, but this still seems around 15% likely overall. And if there are quite powerful LLM agents with relatively weak forward passes, then we should update in this direction.
I expect the probability to be >> 15% for the following reasons.
There will likely still be incentives to make architectures more parallelizable (for training efficiency) and parallelizable architectures will probably be not-that-expressive in a single forward pass (see The Parallelism Tradeoff: Limitations of Log-Precision Transformers). CoT is known to increase the expressivity of Transformers, and the longer the CoT, the greater the gains (see The Expressive Power of Transformers with Chain of Thought). In principle, even a linear auto-regressive next-token predictor is Turing-complete, if you have fine-grained enough CoT data to train it on, and you can probably tradeoff between length (CoT supervision) complexity and single-pass computational complexity (see Auto-Regressive Next-Token Predictors are Universal Learners). We also see empirically that...
I think those objections are important to mention and discuss, but they don't undermine the conclusion significantly.
AIs which are qualitatively just as smart as humans could still be dangerous in the classic ways. The OP's argument still applies to them, insofar as they are agentic and capable of plotting on the inside etc.
As for LLM agents with weak forward passes: Yes, if we could achieve robust faithful CoT properties, we'd be in pretty damn good shape from an AI control perspective. I have been working on this myself & encourage others to do so also. I don't think it undermines the OP's points though? We are not currently on a path to have robust faithful CoT properties by default.
This post seemed overconfident in a number of places, so I was quickly pushing back in those places.
I also think the conclusion of "Nearly No Data" is pretty overstated. I think it should be possible to obtain significant data relevant to AGI alignment with current AIs (though various interpretations of current evidence can still be wrong and the best way to obtain data might look more like running careful model organism experiments than observing properties of chatgpt). But, it didn't seem like I would be able to quickly argue against this overall conclusion in a cohesive way, so I decided to just push back on small separable claims which are part of the reason why I think current systems provide some data.
If this post argued "the fact that current chat bots trained normally don't seem to exhibit catastrophic misalignment isn't much evidence about catastrophic misalignment in more powerful systems", then I wouldn't think this was overstated (though this also wouldn't be very original). But, it makes stronger claims which seem false to me.
Mm, I concede that this might not have been the most accurate title. I might've let the desire for hot-take clickbait titles get the better of me some. But I still mostly stand by it.
My core point is something like "the algorithms that the current SOTA AIs execute during their forward passes do not necessarily capture all the core dynamics that would happen within an AGI's cognition, so extrapolating the limitations of their cognition to AGI is a bold claim we have little evidence for".
I agree that the current training setups shed some data on how e. g. optimization pressures / reinforcement schedules / SGD biases work, and I even think the shard theory totally applies to general intelligences like AGIs and humans. I just think that theory is AGI-incomplete.
You stated it as established fact rather than opinion, which caused me to believe that the argument had already been made somewhere, and someone could just send me a link to it.4
You may have interpreted it that way, but I certainly don't follow a policy of prefacing everything with "in my opinion" unless I have a citation ready. I bet you don't either. Claims are by default just claims, not claims-presented-as-established-facts. If I wanted to present it as an established fact I would have done something to indicate that, e.g. cited something or said "it is well-known that..."
Anyhow, I'm happy to defend the claim here. It would help if I knew where to begin. I'll just briefly link a couple things here to get started and then we can zoom in on whatever bits you are most interested in.
First: There are probably going to be incentives for AIs to conceal their thoughts sometimes. Sometimes this will allow them to perform better in training, for example. Link to example post making this point, though many others have made it also.
Second: Some AI designs involve a natural language bottleneck; the only way for the system to communicate with its future self is via outputting tokens that the...
On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have. That is a strong claim, yes, but I am making it.
I agree that it is conceivable that LLMs embedded in CoT-style setups would be able to be transformative in some manner without "taking off". Indeed, I touch on that in the post some: that scaffolded and slightly tweaked LLMs may not be "mere LLMs" as far as capability and safety upper bounds go.
That said, inasmuch as CoT-style setups would be able to turn LLMs into agents/general intelligences, I mostly expect that to be prohibitively computationally intensive, such that we'll get to AGI by architectural advances before we have enough compute to make a CoT'd LLM take off.
But that's a hunch based on the obvious stuff like AutoGPT consistently falling plus my private musings regarding how an AGI based on scaffolded LLMs would work (which I won't share, for obvious reasons). I won't be totally flabbergasted if some particularly clever way of doing that worked.
On my inside model of how cognition works, I don't think "able to automate all research but can't do consequentialist reasoning" is a coherent property that a system could have.
I actually basically agree with this quote.
Note that I said "incapable of doing non-trivial consequentialist reasoning in a forward pass". The overall llm agent in the hypothetical is absolutely capable of powerful consequentialist reasoning, but it can only do this by reasoning in natural language. I'll try to clarify this in my comment.
I'm not particularly concerned about AI being "transformative" or not. I'm concerned about AGI going rogue and killing everyone. And LLMs automatic workflow is great and not (by itself) omnicidal at all, so that's... fine?
But I think what happens between now and when AIs that are better than humans-in-2023 at everything matters.
As in, AIs boosting human productivity might/should let us figure out how to make stuff safe as it comes up, so no need to be concerned about us not having a solution to the endpoint of that process before we've made the first steps?
The problem is that boosts to human productivity also boost the speed at which we're getting to that endpoint, and there's no reason to think they differentially improve our ability to make things safe. So all that would do is accelerate us harder as we're flying towards the wall at a lethal speed.
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take. The 'canonical' AI risk argument is implicitly based on a set of interdependent assumptions/predictions about the nature of future AI:
the inherent 'alien-ness' of AI and AI values
supposed magical coordination advantages of AIs
arguments from analogies: namely evolution
These arguments are old enough that we can now update based on how the implicit predictions of the implied worldviews turned out. The traditional EY/MIRI/LW view has not aged well, which in part can be traced to its dependence on an old flawed theory of how the brain works.
For those who read HPMOR/LW in their teens/20's, a big chunk of your worldview is downst...
You aren't really engaging with the evidence against the purely theoretical canonical/classical AI risk take
Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories. The novel theories' main claims are that powerful cognitive systems aren't necessarily (isomorphic to) utility-maximizers, that shards (i. e., context-activated heuristics) reign supreme and value reflection can't arbitrarily slip their leash, that "general intelligence" isn't a compact algorithm, and so on. None of that relies on nanobots/Moore's law/etc.
What you've outlined might or might not be the relevant historical reasons for how Eliezer/the LW community arrived at some of their takes. But the takes themselves, or at least the subset of them that I care about, are independent of these historical reasons.
fast takeoff is more likely than slow
Fast takeoff isn't load-bearing on my model. I think it's plausible for several reasons, but I think a non-self-improving human-genius-level AGI would probably be enough to kill off humanity.
...the inherent
Yes, but it's because the things you've outlined seem mostly irrelevant to AGI Omnicide Risk to me? It's not how I delineate the relevant parts of the classical view, and it's not what's been centrally targeted by the novel theories
They are critically relevant. From your own linked post ( how I delineate ) :
We only have one shot. There will be a sharp discontinuity in capabilities once we get to AGI, and attempts to iterate on alignment will fail. Either we get AGI right on the first try, or we die.
If takeoff is slow (1) because brains are highly efficient and brain engineering is the viable path to AGI, then we naturally get many shots - via simulation simboxes if nothing else, and there is no sharp discontinuity if moore's law also ends around the time of AGI (an outcome which brain efficiency - as a concept - predicts in advance).
We need to align the AGI's values precisely right.
Not really - if the AGI is very similar to uploads, we just need to align them about as well as humans. Note this is intimately related to 1. and the technical relation between AGI and brains. If they are inevitably very similar then much of the classical AI risk argument dissolves.
You see...
My argument for the sharp discontinuity routes through the binary nature of general intelligence + an agency overhang, both of which could be hypothesized via non-evolution-based routes. Considerations about brain efficiency or Moore's law don't enter into it.
You claim later to agree with ULM (learning from scratch) over evolved-modularity, but the paragraph above and statements like this in your link:
The homo sapiens sapiens spent thousands of years hunter-gathering before starting up civilization, even after achieving modern brain size.
It would still be generally capable in the limit, but it wouldn't be instantly omnicide-capable.
So when the GI component first coalesces,
Suggest to me that you have only partly propagated the implications of ULM and the scaling hypothesis. There is no hard secret to AGI - the architecture of systems capable of scaling up to AGI is not especially complex to figure out, and has in fact been mostly known for decades (schmidhuber et al figured most of it out long before the DL revolution). This is all strongly implied by ULM/scaling, because the central premise of ULM is that GI is the result of massively scaling up simple algorithms and...
- not only is there nothing special about the human brain architecture, there is not much special about the primate brain other than hyperpameters better suited to scaling up to our size
I don't think this is entirely true. Injecting human glial cells into mice made them smarter. certainly that doesn't provide evidence for any sort of exponential difference, and you could argue it's still just hyperparams, but it's hyperparams that work better small too. I think we should be expecting sub linear growth in quality of the simple algorithms but should also be expecting that growth to continue for a while. It seems very silly that you of all people insist otherwise, given your interests.
We found that the glial chimeric mice exhibited both increased synaptic plasticity and improved cognitive performance, manifested by both enhanced long-term potentiation and improved performance in a variety of learning tasks (Han et al., 2013). In the context of that study, we were surprised to note that the forebrains of these animals were often composed primarily of human glia and their progenitors, with overt diminution in the relative proportion of resident mouse glial cells.
The paper which more directly supports the "made them smarter" claim seems to be this. I did somewhat anticipate this - "not much special about the primate brain other than ..", but was not previously aware of this particular line of research and certainly would not have predicted their claimed outcome as the most likely vs various obvious alternatives. Upvoted for the interesting link.
Specifically I would not have predicted that the graft of human glial cells would have simultaneously both 1.) outcompeted the native mouse glial cells, and 2.) resulted in higher performance on a handful of interesting cognitive tests.
I'm still a bit skeptical of the "made them smarter" claim as it's always best to taboo 'smarter' and they naturally could have cherrypicked the tests (even unintentionally), but it does look like the central claim - that injection of human GPCs (glial progenitor cells) into fetal mice does result in mice brains that learn at least some important tasks more quickly, and this is probably caused by facilitation of higher learning rates. However it seems to come at a cost of higher energy expenditure, so it's not clear yet that this is a pure pareto improvement - could be a tradeoff worthwhile in larger sparser human brains but not in the mouse brain such that it wouldn't translate into fitness advantage.
Or perhaps it is a straight up pareto improvement - that is not unheard of, viral horizontal gene transfer is a thing, etc.
Wouldn't you expect (the many) current attempts to agentize LLMs to eat up a lot of the 'agency overhang'? Especially since, something like the reflection/planning loops of agentized LLMs seem to me like a pretty plausible description of what human brains might be doing (e.g. system 2 / system 1, or see many of Seth Herd's recent writings on agentized / scaffolded LLMs and similarities to cognitive architectures).
No, and that's a reasonable ask.
To a first approximation my futurism is time acceleration; so the risks are the typical risks sans AI, but the timescale is hyperexponential ala roodman. Even a more gradual takeoff would imply more risk to global stability on faster timescales than anything we've experience in history; the wrong AGI race winners could create various dystopias.
It seems to me that you have very high confidence in being able to predict the "eventual" architecture / internal composition of AGI. I don't know where that apparent confidence is coming from.
The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
I would instead say:
The canonical views dreamed up systems which don't exist, which have never existed, and which might not ever exist.[1] Given those assumptions, some people have drawn strong conclusions about AGI risk.
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized). And so rather than justifying "does current evidence apply to 'superintelligences'?", I'd like to see justification of "under what conditions does speculation about 'superintelligent consequentialism' merit research attention at all?" and "why do we think 'future architectures' will have property X, or whatever?!".
The views might have, f
We have to remember that there is AI which we know can exist (LLMs) and there is first-principles speculation about what AGI might look like (which may or may not be realized).
I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
As I'd tried to outline in the post, I think "what are AIs that are known to exist, and what properties do they have?" is just the wrong question to focus on. The shared "AI" label is a red herring. The relevant question is "what are scarily powerful generally-intelligent systems that exist, and what properties do they have?", and the only relevant data point seems to be humans.
And as far as omnicide risk is concerned, the question shouldn't be "how can you prove these systems will have the threatening property X, like humans do?" but "how can you prove these systems won't have the threatening property X, like humans do?".
I disagree that it is actually "first-principles". It is based on generalizing from humans, and on the types of entities (idealized utility-maximizing agents) that humans could be modeled as approximating in specific contexts in which they steer the world towards their goals most powerfully.
Yeah, but if you generalize from humans another way ("they tend not to destroy the world and tend to care about other humans"), you'll come to a wildly different conclusion. The conclusion should not be sensitive to poorly motivated reference classes and frames, unless it's really clear why we're using one frame. This is a huge peril of reasoning by analogy.
Whenever attempting to draw conclusions by analogy, it's important that there be shared causal mechanisms which produce the outcome of interest. For example, I can simulate a spring using an analog computer because both systems are roughly governed by similar differential equations. In shard theory, I posited that there's a shared mechanism of "local updating via self-supervised and TD learning on ~randomly initialized neural networks" which leads to things like "contextually activated heuristics" (or "shards").
Here, it isn't clear what...
It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI.
My understanding is that many of those studying MNIST-classifier CNNs circa 2010 were in fact studying this because they believed similar neural-net inspired mechanisms would go much further, and would not be surprised if very similar mechanisms were at play inside LLMs. And they were correct! Such studies led to ReLU, backpropagation, residual connections, autoencoders for generative AI, and ultimately the scaling laws we see today.
If you traveled back to 2010, and you had to choose between already extant fields, having that year's GPU compute prices and software packages, what would you study to learn about LLMs? Probably neural networks in general, both NLP and image classification. My understanding is there was & is much cross-pollination between the two.
Of course, maybe this is just a misunderstanding of history on my part. Interested to hear if my understanding's wrong!
"Nearly no data" is way too strong a statement, and relies on this completely binary distinction between things that are not AGI and things that are AGI.
The right question is, what level of dangerous consequentialist goals are needed for systems to reach certain capability levels, e.g. novel science? It could have been that to be as useful as LLMs, systems would be as goal-directed as chimpanzees. Animals display goal-directed behavior all the time, and to get them to do anything you mostly have to make the task instrumental to their goals e.g. offer them treats. However we can control LLMs way better than we can animals, and the concerns are of goal misgeneralization, misspecification, robustness, etc. rather than affecting the system's goals at all.
It remains to be seen what happens at higher capability levels, and alignment will likely get harder, but current LLMs are definitely significant evidence. Like, imagine if people were worried about superintelligent aliens invading Earth and killing everyone due to their alien goals, and scientists were able to capture an animal from their planet as smart as chimpanzees and make it as aligned as LLMs, such that it would happily sit around and summarize novels for you, follow your instructions, try to be harmless for personality rather than instrumental reasons, and not eat your body if you die alone. This is not the whole alignment problem but seems like a decent chunk of it! It could have been much harder.
They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
I disagree with this, and I think you should too, even considering your own views. For example, DeepMind recently discovered 2.2 million new crystals, increasing the number of stable crystals we know about by an order of magnitude. Perhaps you don't think this is revolutionary, but 5, 10, 15, 50 more papers like it? One of them is bound to be revolutionary.
Maybe you don't think this is autonomous enough for you. After all its people writing the paper, people who will come up with the ideas of what to use the materials for, and people who built this very particular ML setup in the first place. But then your prediction becomes these tasks will not be automateable by LLMs without making them dangerous. To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.
(edit note: last sentence was edited in)
Maybe you don't think this is autonomous enough for you
Yep. The core thing here is iteration. If an AI can execute a whole research loop on its own – run into a problem it doesn't know how to solve, figure out what it needs to learn to solve it, construct a research procedure for figuring that out, carry out that procedure, apply the findings, repeat – then research-as-a-whole begins to move at AI speeds. It doesn't need to wait for a human to understand the findings and figure out where to point it next – it can go off and invent whole new fields at inhuman speeds.
Which means it can take off; we can meaningfully lose control of it. (Especially if it starts doing AI research itself.)
Conversely, if there's a human in the loop, that's a major bottleneck. As I'd mentioned in the post, I think LLMs and such AIs are a powerful technology, and greatly boosting human research speeds is something where they could contribute. But without a fully closed autonomous loop, that's IMO not an omnicide risk.
To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.
That's a point of disagreement:...
Maybe a more relevant concern I have with this is it feels like a "Can you write a symphony" type test to me. Like, there are very few people alive right now who could do the process you outline without any outside help, guidance, or prompting.
Your view may have a surprising implication: Instead of pushing for an AI pause, perhaps we should work hard to encourage the commercialization of current approaches.
If you believe that LLMs aren't a path to full AGI, successful LLM commercialization means that LLMs eat low-hanging fruit and crowd out competing approaches which could be more dangerous. It's like spreading QWERTY as a standard if you want everyone to type a little slower. If tons of money and talent is pouring into an AI approach that's relatively neutered and easy to align, that could actually be a good thing.
A toy model: Imagine an economy where there are 26 core tasks labeled from A to Z, ordered from easy to hard. You're claiming that LLMs + CoT provide a path to automate tasks A through Q, but fundamental limitations mean they'll never be able to automate tasks R through Z. To automate jobs R through Z would require new, dangerous core dynamics. If we succeed in automating A through Q with LLMs, that reduces the economic incentive to develop more powerful techniques that work for the whole alphabet. It makes it harder for new techniques to gain a foothold, since the easy tasks already have incumbent playe...
That's not surprising to me! I pretty much agree with all of this, yup. I'd only add that:
I see people discussing how far we can go with LLM or other simulator/predictor systems. I particularly like porby's takes on this. I am excited for that direction of research, but I great it misses an important piece. The missing piece is this: There will consistently be a set of tasks that, with any given predictor skill level, are easier to achieve with that predictor wrapped in an agent-layer. AutoGPT is tempting for a real reason. There is significant reward available to those who successfully integrate the goal-less predictor into a goal-pursuing agent program. To avoid this, you must convince everyone who could do this not to do this. This could be by convincing them it wouldn't be profitable after all, or would be too dangerous, or that enforcement mechanisms will stop them. Unless you manage to do this convincing for all possible people in a position to do this, then someone does it. And then you have to deal with the agent-thing. What I'm saying is that you can't count on there never being the agent version. You have to assume that someone will try it. So the argument, "we can get lots of utility much more safely from goal-less predictors" can be true and yet we will stil...
I agree with the spirit of the post but not the kinda clickbaity title. I think a lot of people are over updating on single forward pass behavior of current LLMs. However, I think it is still possible to get evidence using current models with careful experiment design and being careful with what kinds of conclusions to draw.
At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems... and then you claim we can get such systems with just a few tweaks? I don't see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent and kills everyone. Homunculi seem really, really hard to build. By your characterization of how different actual AGI is from current models, it seems this would have to be fundamentally architecturally different from anything we've built so far. Not some kind of thing that would be created by near-accident.
Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.
I'll address this post section by section, to see where my general disagreements lie:
"What the Fuss Is All About"
https://www.lesswrong.com/posts/HmQGHGCnvmpCNDBjc/#What_the_Fuss_Is_All_About
I agree with the first point on humans, with a very large caveat: While a lot of normies tend to underestimate the G-factor in how successful you are, nerd communities like LessWrong systematically overestimate it's value, to the point where I actually understand the normie/anti-intelligence primacy position, and IQ/Intelligence discourse is fucked by people who either ...
This aligns similarly with my current view. Wanted to add a thought - current LLMs could still have unintended problems/misalignment like factuality or privacy or copyrights or harmful content, which still should be studied/mitigated, together with thinking about other more AGI like models (we don’t know what exactly yet, but could exist.) And a LLM (especially a fine tuned one), if doing increasingly well on generalization ability, should still be monitored. To be prepared for future, having a safety mindset/culture is important for all models.
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.
LLMs are simulators. They are normally trained to simulate humans (and fictional characters, and groups of humans cooperating to write something), though DeepMind has trained them to instead simulate weather patterns. Humans are not well aligned to other humans: Joseph Stalin was not well aligned to the citizenry of Russia, and as you correctly note, a very smart manipulative ...
An additional distinction between contemporary and future alignment challenges is that the latter concerns the control of physically deployed, self aware system.
Alex Altair has previously highlighted that they will (microscopically) obey time reversal symmetry[1] unlike the information processing of a classical computer program. This recent paper published in Entropy[2] touches on the idea that a physical learning machine (the "brain" of a causal agent) is an "open irreversible dynamical system" (pg 12-13).
Altair A. "Consider using reversible au
But you wouldn't study ... MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work.
This particular bit seems wrong; CNNs and LLMs are both built on neural networks. If the findings don't generalize, that could be called a "failure of theory", not an impossibility thereof. (Then again, maybe humans don't have good setups for going 20 steps ahead of data when building theory, so...)
(To clarify, this post is good and needed, so thank you for writing it.)
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
This is an optimistic take. If we could be rightfully confident that our random search through mindspace with modern ML methods can never produce "scary agents", a lot of our concerns would go away. I don't think that it's remotely the case.
...The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
So the only argument for "LLMs can't do it safely" is "only humans can do it now, and humans are not safe"? The same argument works for any capability LLMs already have: LLMs can't talk, because the space of words is so vast, you'll need generality to navigate it.
Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It's clearer if you taboo the word "AI":
It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.
By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.[1]
What the Fuss Is All About
To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.
The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.
Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.
The only systems in this reference class known to us are humans, and some human collectives.
Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.
So What About Current AIs?
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.
Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.
The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.
And I agree! I think those arguments are right.
But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?
And I do dispute that implicit claim.
I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?
Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.
Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.
Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
On Safety Guarantees
That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.
I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.
Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.
So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.
And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
A Concrete Scenario
Here's a very specific worry of mine.
Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.
The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.
Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.
They fail to update the cached thought of "AI is safe".
And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.
That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.
The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)
And then everyone dies.
The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".
Closing Summary
I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).
That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.
Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.
That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.
The class to which the classical AGI Omnicide Risk arguments apply full-force.
The class for which no known alignment technique suffices.
And that switch would be very easy, yet very lethal, to miss.
Slightly edited for clarity after an exchange with Ryan.