Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It's clearer if you taboo the word "AI":
- The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
- The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.
By the same token, LLMs' algorithms do not necessarily generalize to how an AGI's cognition will function. Their limitations are not necessarily an AGI's limitations.[1]
What the Fuss Is All About
To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
- Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
- Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
- Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
- And when people with different values interact...
- People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
- People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks.
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.
The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious "g-factor". There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.
Systems like this, systems the possibility of whose existence is extrapolated from humans' existence, are precisely what we're worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.
The only systems in this reference class known to us are humans, and some human collectives.
Viewing it from another angle, one can say that the systems we're concerned about are defined as cognitive systems in the same reference class as humans.
So What About Current AIs?
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it's doing so by demonstrating that they lie outside the reference class of human-like systems.
Indeed, that's often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a "homunculus" inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.
The novel arguments are often based around arguing that there's no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi's formation.
And I agree! I think those arguments are right.
But one man's modus ponens is another's modus tollens. I don't take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don't exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
Which isn't really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we've essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don't seem to agree with how human cognition feels from the inside?
And I do dispute that implicit claim.
I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF'd masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that'd try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?
Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.
Studying gorilla neurology isn't going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.
Similarly, studying LLMs isn't necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
On Safety Guarantees
That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we're not on-track to build them, so who cares?
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they're not gonna grow agency or end the world.
I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don't believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They're not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
They're a powerful technology in their own right, yes. But just that: just another technology. Not something that's going to immanentize the eschaton.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they're the same limit.
Current AIs are safe, in practice and in theory, because they're not as scarily generally capable as humans. On the flip side, current AIs aren't as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.
So if you figure out how to remove the capability upper bound, you'll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.
And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They're not here just to have fun scaling LLMs. So inasmuch as I'm right that LLMs and such are not AGI-complete, they'll eventually move on from them, and find some approach that does lead to AGI.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
A Concrete Scenario
Here's a very specific worry of mine.
Take an AI Optimist who'd built up a solid model of how AIs trained by SGD work. Based on that, they'd concluded that the AGI Omnicide Risk arguments don't apply to such systems. That conclusion is, I argue, correct and valid.
The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they're not endangering the world, and are instead helping to usher in a new age of prosperity.
Eventually, they notice or realize some architectural limitation of the paradigm they're working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don't notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.
They fail to update the cached thought of "AI is safe".
And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they'd been previously working on, but an actual AGI.
That AGI has a scheming homunculus deep inside. The people working with it don't believe in homunculi, they have convinced themselves those can't exist, so they're not worrying about that. They're not ready to deal with that, they don't even have any interpretability tools pointed in that direction.
The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)
And then everyone dies.
The point is that the safety guarantees that the current optimists' arguments are based on are not simply fragile, they're being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they'll give out under the optimization pressures being applied – and it'll be easy to miss the moment the break happens. It'd be easy to cache the belief of, say, "LLMs are safe", then introduce some architectural tweak, keep thinking of your system as "just an LLM with some scaffolding and a tiny tweak", and overlook the fact that the "tiny tweak" invalidated "this system is an LLM, and LLMs are safe".
Closing Summary
I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the "canonical" least-forgiving take on alignment, are both correct. They're just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).
That view isn't unreasonable. Same way it's not unreasonable to claim that studying GOFAI algorithms wouldn't shed much light on LLM cognition, despite them both being advanced AIs.
Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn't solidly backed, empirically or theoretically. If anything, it's the opposite: the arguments for current AIs' safety are based on arguing that they're incapable-by-design of engaging in human-style scheming.
That doesn't guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what's hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They'll figure out how to remove them... and in that very act, they will remove the safety guarantees. The systems they're working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.
The class to which the classical AGI Omnicide Risk arguments apply full-force.
The class for which no known alignment technique suffices.
And that switch would be very easy, yet very lethal, to miss.
- ^
Slightly edited for clarity after an exchange with Ryan.
Yes.
I consider a ULM to already encompass general/universal intelligence in the sense that a properly scaled ULM can learn anything, could become a superintelligence with vast scaling, etc.
I think I used specifically that example earlier in a related thread: The most common algorithm most humans are taught and learn is memorization of a small lookup table for single digit addition (and multiplication), combined with memorization of a short serial mental program for arbitrary digit addition. Some humans learn more advanced 'tricks' or short cuts, and more rarely perhaps even more complex, lower latency parallel addition circuits.
Core to the ULM view is the scaling hypothesis: once you have a universal learning architecture, novel capabilities emerge automatically with scale. Universal learning algorithms (as approximations of bayesian inference) are more powerful/scalable than genetic evolution, and if you think through what (greatly sped up) evolution running inside a brain during its lifetime would actually entail it becomes clear it could evolve any specific capabilities within hardware constraints, given sufficient training compute/time and an appropriate environment (training data).
There is nothing more general/universal than that, just as there is nothing more general/universal than turing machines.
Not really - evolution converged on a similar universal architecture in many different lineages. In vertebrates we have a few species of cetaceans, primates and pachyderms which all scaled up to large brain sizes, and some avian species also scaled up to primate level synaptic capacity (and associated tool/problem solving capabilities) with different but similar/equivalent convergent architecture. Language simply developed first in the primate homo genus, probably due to a confluence of factors. But its clear that brain scale - especially specifically the synaptic capacity of 'upper' brain regions - is the single most important predictive factor in terms of which brain lineage evolves language/culture first.
But even some invertebrates (octupi) are quite intelligent - and in each case there is convergence to similar algorithmic architecture, but achieved through different mechanisms (and predecessor structures).
It is not the case that the architecture of general intelligence is very complex and hard to evolve. It's probably not more complex than the heart, or high quality eyes, etc. Instead it's just that for a general purpose robot to invent recursive turing complete language from primitive communication - that development feat first appeared only at foundation model training scale ~10^25 flops equivalent. Obviously that is not the minimum compute for a ULM to accomplish that feat - but all animal brains are first and foremost robots, and thriving at real world robotics is incredibly challenging (general robotics is more challenging than language or early AGI, as all self-driving car companies are now finally learning). So language had to bootstrap from some random small excess plasticity budget, not the full training budget of the brain.
The greatest validation of the scaling hypothesis (and thus my 2015 ULM post) is the fact that AI systems began to match human performance once scaled up to similar levels of net training compute. GPT4 is at least as capable as human linguistic cortex in isolation; and matches a significant chunk of the capabilities of an intelligent human. It has far more semantic knowledge, but is weak in planning, creativity, and of course motor control/robotics. But none of that is surprising as it's still missing a few main components that all intelligent brains contain (for agentic planning/search). But this is mostly a downstream compute limitation of current GPUs and algos vs neuromorphic/brains, and likely to be solved soon.