As an overly simplistic example, consider an overseer that attempts to train a cleaning robot by providing periodic feedback to the robot, based on how quickly the robot appears to clean a room; such a robot might learn that it can more quickly “clean” the room by instead sweeping messes under a rug.[15]
This doesn't seem concerning as human users would eventually discover that the robot has a tendency to sweep messes under the rug, if they ever look under the rug, and the developers would retrain the AI to resolve this issue. Can you think of an example that would be more problematic, in which the misbehavior wouldn't be obvious enough to just be trained away?
- GPT-3, for instance, is notorious for outputting text that is impressive, but not of the desired “flavor” (e.g., outputting silly text when serious text is desired), and researchers often have to tinker with inputs considerably to yield desirable outputs.
Is this specifically referring to the base version of GPT-3 before instruction fine-tuning (davinci rather than text-davinci-002, for example)? I think it would be good to clarify that.
This piece gives an overview of the alignment problem and makes the case for AI alignment research. It is crafted both to be broadly accessible to those without much background knowledge (not assuming any previous knowledge of AI alignment or much knowledge of AI) and to make the compositional logic behind the case very clear.
I expect the piece will be particularly appealing to those who are more reductive in their thinking and who want to better understand the arguments behind AI alignment (I’m imagining this audience includes people within the community doing field building, as well as people in various STEM fields who have only vaguely heard of AI alignment).
Crossposted to the AGI Safety Fundamentals website, with minor edits:
https://www.agisafetyfundamentals.com/alignment-introduction
This piece describes the basic case for AI alignment research, which is research that aims to ensure that advanced AI systems can be controlled or guided towards the intended goals of their designers. Without such work, advanced AI systems could potentially act in ways that are severely at odds with their designers’ intended goals. Such a situation could have serious consequences, plausibly even causing an existential catastrophe. In this piece, I elaborate on five key points to make the case for AI alignment research:
1 – Advanced AI is possible
By advanced AI, I mean, roughly speaking, AI systems capable of performing almost all cognitive work humans perform (e.g., able to substitute for scientists, CEOs, novelists, and so on).
Researchers disagree about the form of advanced AI that is most likely to be developed. Many of the most popular visions involve an “artificial general intelligence,” or “AGI” – a hypothetical AI system that could learn any cognitive task that a human can.[1] One possibility for advanced AI is a singular AGI that could outcompete human experts in most fields across their areas of expertise. Another possibility is an ecosystem of specialized AI systems that could collectively accomplish almost all cognitive work – some researchers speculate that this setup would involve multiple AGIs performing complex tasks alongside more narrow AI systems.[2] Researchers also disagree about whether advanced AI will more likely be developed using current AI methods like deep learning or via a future AI paradigm that has not been discovered yet.[3]
There is approximate consensus among almost all relevant experts, however, that advanced AI is at the very least physically possible. The human brain is itself an information-processing machine with the relevant capabilities, and thus it serves as proof that machines with such capabilities are possible; an AI system capable of the same cognitive tasks as the human brain would, by definition, be advanced AI.[4]
2 – Advanced AI might not be that far away
Below, I outline some reasons for thinking advanced AI might be achieved within the next few decades. Each argument has its limitations, and there is a lot of uncertainty; nevertheless, considering these arguments collectively, it seems appropriate to put at least decent odds (e.g., double-digit percent chance) on advanced AI becoming a reality within a few decades.
Many AI experts and generalist forecasters think it’s likely advanced AI will be developed within the next few decades:
Extrapolating AI capabilities plausibly suggests advanced AI within a few decades:
3 – Advanced AI might be difficult to direct
Current AI systems are often accidentally misdirected:
Advanced AI systems may similarly technically satisfy specifications in ways that violate what we actually want (i.e., the “King Midas” problem):
Training AI on human feedback may help address the above specification problems, but this introduces its own problems, including incentivizing misleading behavior:
Additionally, advanced AI systems may come to pursue proxies for goals that work well in training, but these proxies may break down during deployment:
To be clear, the above worries don’t imply that advanced AI wouldn’t be able to “understand” what we really wanted, but instead that this understanding wouldn’t necessarily translate to the AI systems acting in accordance with our wants:
It’s possible advanced AI will be built before we solve the above problems, or even without anyone really understanding the systems that are built:
4 – Poorly-directed advanced AI could be catastrophic for humanity
Our typical playbook regarding new technologies is to deploy them before tackling all potential major issues, then course correct them over time, solving problems after they crop up. For instance, modern seatbelts were not invented until 1951, 43 years after the model T Ford’s introduction; consumer gasoline contained the neurotoxin lead for decades, before being phased out; etc.
With advanced AI, on the other hand, relatively early failures at appropriately directing these systems may preclude later course correction, possibly yielding catastrophe. This dynamic necessitates flipping the typical script – anticipating and solving problems sufficiently far ahead of time, so that our ability as humans to course correct is never extinguished.
As mentioned above, poorly-directed advanced AI systems may curtail humanity’s ability to course correct:
From there, the world could develop in unexpected and undesirable ways, with no recourse:
While the above worries may sound extreme, they are not particularly fringe among relevant experts who have examined the issue (though there is considerable disagreement among experts and not all share these concerns):
5 – There are steps we can take now to reduce the danger
To reduce the risks discussed above, two broad types of work are being pursued – developing technical solutions that enable advanced AI to be directed as its designers intend (i.e., technical AI alignment research) and other, nontechnical work geared towards ensuring these technical solutions are developed and implemented where necessary (this nontechnical work falls under the larger umbrella of AI governance[27]).
Some technical AI alignment research involves working with current AI systems to direct them towards desired goals, with the hope that insights transfer to advanced AI:
Other technical AI alignment research involves more theoretical or abstract work:
The next two paragraphs list two broad areas of technical AI alignment research – note that I’m listing these areas simply for illustrative purposes, and there are many more areas that I don’t list.
Understanding the inner workings of current black-box AI systems:
Developing methods for ensuring the honesty or truthfulness of AI systems:
See more: the online AI Alignment Curriculum from the AGI Safety Fundamentals program describes several further technical AI alignment research avenues in more detail, as does the paper Unsolved Problems in ML Safety.[31]
On the nontechnical side, several areas of AI governance are relevant for reducing misalignment risks from advanced AI, including work to:
See more: the AI Governance Curriculum from the AGI Safety Fundamentals program describes further areas of AI governance work in more detail.
Note that technical problems can sometimes take decades to solve, so even if advanced AI is decades away, it’s still reasonable to begin working on developing solutions now. Current technical AI alignment work is occurring in academic labs (e.g., at UC Berkeley's CHAI, among many other academic labs), in nonprofits and public benefit corporations (e.g., Redwood Research and Anthropic), and in industrial labs (e.g., DeepMind and OpenAI). A recent survey of top AI researchers, however, indicates most (69%) think society should prioritize “AI safety research”[33] either “more” or “much more” than currently.
It should be noted that some researchers view the concept of “general intelligence” as flawed and consider the term “AGI” to be either a misnomer at best or confused at worst. Nevertheless, in this piece we are concerned with the capabilities of AI systems, not whether such systems should be referred to as “generally intelligent,” so disagreement over the coherency of the term “AGI” doesn’t affect the arguments in this piece.
In this second scenario, different AGIs might specialize in a similar manner to how human workers specialize in the economy today.
A future paradigm could, for instance, be based on future discoveries in neuroscience.
The brain is a physical object, and its mechanisms of operation must therefore obey the laws of physics. In theory, these mechanisms could be described in a manner that a computer could replicate.
As of today's date: February 10, 2023.
E.g., in January 2020, back when conventional wisdom was that COVID would not become a huge deal, Metaculus instead predicted >100,000 people would eventually become infected with the disease.
E.g., Metaculus predicted a breakthrough in the computational biology technique of protein structure prediction, before DeepMind’s AI AlphaFold astounded scientists with its performance in this task.
Other examples where AI has recently made large strides include: conversing with humans via text, speech recognition, speech synthesis, music generation, language translation, driving vehicles, summarizing books, answering high school- or college-level essay questions, creative storytelling, writing computer code, scientific advancement, mathematical advancement, hardware advancement, mastering classic board games and video games, mastering multiplayer strategy games, doing any one task from a large number of unrelated tasks and switching flexibly between these tasks based on context, using robotics to interact with the world in a flexible manner, integrating cognitive subsystems via an “inner monologue,” etc.
Technically, this description is a slight simplification; GPT-3 was actually programmed to learn to predict the next “token” from a sequence of text, where a “token” would generally correspond to either a word or a portion of a word.
Depending on whether we extrapolate linearly or using an “S-curve,” most such tasks are implied to reach near-perfect performance with ~1028 to ~1031 computer operations of training. Assuming a $100M project, an extrapolation of 2.5 year doubling time in the price-performance of GPUs (computer chips commonly used in AI), and a current GPU computational cost of ~1017 operations/$, such performance would be expected to be reached in 25 to 50 years. Note this extrapolation is highly uncertain; for instance, high performance on these metrics may not in actuality imply advanced AI (implying this estimate is an underestimate) or algorithmic progress may reduce necessary computing power (implying it’s an overestimate).
The most powerful supercomputers today likely already have enough computing power to surpass that of the human brain. However, an arguably more important factor is the amount of computing power necessary to train an AI of this size (the amount of computing power necessary to train large AI systems typically far exceeds the computing power necessary to run such systems). One extensive report used a few different angles of attack to estimate the amount of computing power needed to train an AI system that was as powerful as the human brain, and this report concluded that such computing power would likely become economically available within the next few decades (with a median estimate of 2052).
This problem is known as “specification gaming” or “outer misalignment.”
E.g., “maximize profits,” if interpreted literally and outside a human lens, may yield all sorts of extreme psychopathic and illegal behavior that would deeply harm others for the most marginal gain in profit.
The general phenomena at play here (sometimes referred to as “Goodhart’s law”) has many examples – in one classic-but-possibly-fictitious example, the British Empire put a bounty on cobras within colonial India (to try to reduce the cobra population), but some locals responded by breeding cobras to kill in order to collect the bounty, thus eventually leading to a large increase in the cobra population.
Similarly, attempts to train AI systems to not mislead their overseers (by punishing these systems for behavior that the overseer deems to be misleading) might instead train these systems to simply become better at deception so they don't get caught (for instance, only sweeping a mess under the rug when the overseer isn’t looking).
This problem is known as “goal misgeneralization” or “inner misalignment.”
Note the true story is somewhat more complicated, as evolution “trained” individuals to also support the survival and reproduction of their relatives.
As one simple example, we don’t want a video-game-playing AI to hack into its console to give itself a high score once it learns how to accomplish this feat.
For instance, understanding what we really mean when we use imprecise language.
At least insofar as AI can be said to “understand” anything.
The logic here is the AI may reason that if it defected in training, the overseer would simply provide negative feedback (which would adjust its internal processes) until it stopped defecting. Under such a scenario, the AI would be unlikely to be deployed in the world with its current goals, so it would presumably not achieve these goals. Thus, the AI may choose to instead forgo defecting in training so it might be deployed with its current goals.
It’s common for cutting-edge AI capabilities to move relatively quickly from matching human abilities in a domain to far surpassing human abilities in that domain (see: chess, Jeopardy!, and Go for high-profile examples). Alternatively, even if it takes a while for advanced AI capabilities to progress to far surpassing human abilities in the relevant domains, the worries sketched out below may still occur in a more drawn-out fashion.
In the same way that AI systems can now outcompete humans in chess and Go.
Such AI systems might guard against being shut down by using their social-persuasion or cyber-operation abilities. As just one example, these systems might initially pretend to be aligned with the interests of humans who had the ability to shut them off, while clandestinely hacking into various data centers to distribute copies of themselves across the internet.
Note that for many animals, the problem is not due to idiosyncrasies of human nature, but instead simply due to human interests steamrolling animal interests where interests collide (e.g., competing for land).
For instance, Toby Ord, a leading existential risk researcher at Oxford, estimates that “unaligned AI” is by far the most likely source of existential risk over the next 100 years – greater than all other risks combined.
AI governance also encompasses several other areas. For instance, it includes work geared towards ensuring advanced AI isn’t misused by bad actors who intentionally direct such systems towards undesirable goals. Such misuse may, in an extreme scenario, also constitute an existential risk (if it enables the permanent “locking-in” of an undesirable future order) – note this outcome would be conceptually distinct from the alignment failure modes described in this piece (which, instead of being “intentional misuse” are “accidents”), so such misuse cases are not covered in this piece.
Feedback on outward behavior may be inadequate for training AI systems away from deception, as if one is being deceptive, then one will generally outwardly behave in a manner designed to not appear deceptive.
Interestingly, these same systems are reasonably good at evaluating their own previous claims – that is, if they are asked to evaluate how likely a previous claim they made is to be accurate, they tend to give substantially higher probability of accuracy for claims that are in fact accurate compared to those that are inaccurate.
Honest AI may therefore make false claims if it had learned inaccurate information, but it would not generally make false claims on an issue where it had learned accurate information and assimilated this information into its “knowledge” of the world. (Note that researchers disagree about whether current AI systems should or should not be said to have “knowledge” in the sense that the word is commonly used, even setting aside the thorny issue of precisely defining the word “knowledge.”)
Note that the latter paper defines alignment research differently than I have – by my definition, most of the research avenues in that paper would be considered technical AI alignment research, even ones the paper does not classify within the section on “alignment.”
The more that various organizations feel they are in a competitive race towards advanced AI, the more pressure there may be for at least some of these organizations to cut corners to win the race.
The survey described “AI safety research” as having significant overlap with what I’m calling “technical AI alignment research.”