AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Viewing Earth’s current situation through that lens, I see three major hurdles:
- Most research that helps one point AIs, probably also helps one make more capable AIs. A “science of AI” would probably increase the power of AI far sooner than it allows us to solve alignment.
- In a world without a mature science of AI, building a bureaucracy that reliably distinguishes real solutions from fake ones is prohibitively difficult.
- Fundamentally, for at least some aspects of system design, we’ll need to rely on a theory of cognition working on the first high-stakes real-world attempt.
I’ll go into more detail on these three points below. First, though, some background:
Background
By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).
Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.
I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for "tasty" foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)
Separately, I think that most complicated processes work for reasons that are fascinating, complex, and kinda horrifying when you look at them closely.
It’s easy to think that a bureaucratic process is competent until you look at the gears and see the specific ongoing office dramas and politicking between all the vice-presidents or whatever. It’s easy to think that a codebase is running smoothly until you read the code and start to understand all the decades-old hacks and coincidences that make it run. It’s easy to think that biology is a beautiful feat of engineering until you look closely and find that the eyeballs are installed backwards or whatever.
And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]
1. Alignment and capabilities are likely intertwined
I expect that if we knew in detail how LLMs are calculating their outputs, we’d be horrified (and fascinated, etc.).
I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).
Gaining this sort of visibility into how the AIs work is, I think, one of the main goals of interpretability research.
And understanding how these AIs work and how they don’t — understanding, for example, when and why they shouldn’t yet be scaled or otherwise pushed to superintelligence — is an important step on the road to figuring out how to make other AIs that could be scaled or otherwise pushed to superintelligence without thereby causing a bleak and desolate future.
But that same understanding is — I predict — going to reveal an incredible mess. And the same sort of reasoning that goes into untangling that mess into an AI that we can aim, also serves to untangle that mess to make the AI more capable. A tangled mess will presumably be inefficient and error-prone and occasionally self-defeating; once it’s disentangled, it won’t just be tidier, but will also come to accurate conclusions and notice opportunities faster and more reliably.[2]
Indeed, my guess is that it’s even easier to see all sorts of things that the AI is doing that are dumb, all sorts of ways that the architecture is tripping itself up, and so on.
Which is to say: the same route that gives you a chance of aligning this AI (properly, not the “it no longer says bad words” superficial-property that labs are trying to pass off as “alignment” these days) also likely gives you lots more AI capabilities.
(Indeed, my guess is that the first big capabilities gains come sooner than the first big alignment gains.)
I think this is true of most potentially-useful alignment research: to figure out how to aim the AI, you need to understand it better; in the process of understanding it better you see how to make it more capable.
If true, this suggests that alignment will always be in catch-up mode: whenever people try to figure out how to align their AI better, someone nearby will be able to run off with a few new capability insights, until the AI is pushed over the brink.
So a first key challenge for AI alignment is a challenge of ordering: how do we as a civilization figure out how to aim AI before we’ve generated unaimed superintelligences plowing off in random directions? I no longer think “just sort out the alignment work before the capabilities lands” is a feasible option (unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).
Interpretability? Will likely reveal ways your architecture is bad before it reveals ways your AI is misdirected.
Recruiting your AIs to help with alignment research? They’ll be able to help with capabilities long before that (to say nothing of whether they would help you with alignment by the time they could, any more than humans would willingly engage in eugenics for the purpose of redirecting humanity away from Fun and exclusively towards inclusive genetic fitness).
And so on.
This is (in a sense) a weakened form of my answer to those who say, “AI alignment will be much easier to solve once we have a bona fide AGI on our hands.” It sure will! But it will also be much, much easier to destroy the world, when we have a bona fide AGI on our hands. To survive, we’re going to need to either sidestep this whole alignment problem entirely (and take other routes to a wonderful future instead, as I may discuss more later), or we’re going to need some way to do a bunch of alignment research even as that research makes it radically easier and radically cheaper to destroy everything of value.
Except even that is harder than many seem to realize, for the following reason.
2. Distinguishing real solutions from fake ones is hard
Already, labs are diluting the word “alignment” by using the word for superficial results like “the AI doesn’t say bad words”. Even people who apparently understand many of the core arguments have apparently gotten the impression that GPT-4’s ability to answer moral quandaries is somehow especially relevant to the alignment problem, and an important positive sign.
(The ability to answer moral questions convincingly mostly demonstrates that the AI can predict how humans would answer or what humans want to hear, without revealing much about what the AI actually pursues, or would pursue upon reflection, etc.)
Meanwhile, we have little idea of what passes for “motivations” inside of an LLM, or what effect pretraining on next-token prediction and fine-tuning with RLHF really has on the internals. This sort of precise scientific understanding of the internals — the sort that lets one predict weird cognitive bugs in advance — is currently mostly absent in the field. (Though not entirely absent, thanks to the hard work of many researchers.)
Now imagine that Earth wakes up to the fact that the labs aren’t going to all decide to stop and take things slowly and cautiously at the appropriate time.[3] And imagine that Earth uses some great feat of civilizational coordination to halt the world’s capabilities progress, or to otherwise handle the issue that we somehow need room to figure out how these things work well enough to align them. And imagine we achieve this coordination feat without using that same alignment knowledge to end the world (as we could). There’s then the question of who gets to proceed, under what circumstances.
Suppose further that everyone agreed that the task at hand was to fully and deeply understand the AI systems we’ve managed to develop so far, and understand how they work, to the point where people could reverse out the pertinent algorithms and data-structures and what-not. As demonstrated by great feats like building, by-hand, small programs that do parts of what AI can do with training (and that nobody previously knew how to code by-hand), or by identifying weird exploits and edge-cases in advance rather than via empirical trial-and-error. Until multiple different teams, each with those demonstrated abilities, had competing models of how AIs’ minds were going to work when scaled further.
In such a world, it would be a difficult but plausibly-solvable problem, for bureaucrats to listen to the consensus of the scientists, and figure out which theories were most promising, and figure out who needs to be allotted what license to increase capabilities (on the basis of this or that theory that predicts this would be non-catastrophic), so as to put their theory to the test and develop it further.
I’m not thrilled about the idea of trusting an Earthly bureaucratic process with distinguishing between partially-developed scientific theories in that way, but it’s the sort of thing that a civilization can perhaps survive.
But that doesn’t look to me like how things are poised to go down.
It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.
And not one that looks particularly survivable, to me.
And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.
In part because the survivable answers (such as “we have no idea what’s going on in there, and will need way more of an idea what’s going on in there, and that understanding needs to somehow develop in a context where we can do the job right rather than simply unlocking the door to destruction”) aren’t really in the pool. And in part because all the people who really want to be racing ahead have money and power and status. And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.
Maybe if we can make AI a science again, then we’ll start to get into the regime where, if humanity can regulate capabilities advancements in time, then all the regulators and researchers understand that you shall only ask for a license to increase the capabilities of your system when you have a full detailed understanding of the system and a solid justification for why you need the capabilities advance and why it’s not going to be catastrophic. At which point maybe a scientific field can start coming to some sort of consensus about those theories, and regulators can start being sensitive to that consensus.
But unless you can get over that grand hump, it looks to me like one of the key bottlenecks here is bureaucratic legibility of plausible solutions. Where my basic guess is that regulators won’t be able to distinguish real solutions from false ones, in anything resembling the current environment.
Together with the above point ("alignment and capabilities are likely intertwined"), I think this means that our rallying cry should be less “pause to give us more time on alignment research” and more “stop entirely, and find some way to circumvent these woods entirely; we’re not equipped to navigate them”.
(With a backup rallying cry of “make AI a science again”, though again, that only works if you have some way of preventing the science-of-mind from leading to catastrophe before we figure out how to build AIs that care about good stuff rather than bleak and desolate stuff.)
3. Most theories don’t work on the first real try
It seems worth noting that even if you manage to surmount the above two problems, you have a third problem on your hands, which is that when it finally comes time, not to increase your system’s capabilities a bit more (and test your theory of cognition further), but rather to actually do something of note with your AI, then you must, in some sense, be trusting a novel and untested scientific theory (and your engineering efforts grounded in that theory) to work perfectly on the first critical try.
In particular, once your AI is capable of autonomous scientific/technological development, and operating in a domain where it could use that ability to gain a decisive strategic advantage over the rest of the planet, it’s operating in a fundamentally different cognitive regime than it ever was in training.
By analogy, if you’re considering making Bob the dictator of your country, you might first make him mock-dictator of your town, and watch to make sure he doesn’t abuse his power. But, try as you might, that’s still not fundamentally all that reliable a test of whether he’ll actually abuse power once he really has it. Once he really can give orders to the army, once he really can take advantage of the country’s populace, will he do so? The option of “abuse my mock power while being observed by my townspeople (who can still beat me up or deny me the job)” is actually very cognitively different than the option “order the army to intimidate the parliament and ‘oversee’ the next election”.
Now, given a sufficiently well-developed theory of cognition, you can try to read the AI’s mind and predict what cognitive state it would enter if it did really think it had those options. And you can set up simulations (and try to spoof its internal sensations and so on) in a way that your theory of cognition predicts is very similar to the cognitive state it would enter once it really had the option to betray you.
But the link between these states that you induce and observe in the lab, and the actual state where the AI actually has the option to betray you, depends fundamentally on your fresh new theory of cognition.
Actually running the AI until it really has the opportunity to betray you is an empirical test of those theories in an environment that differs fundamentally from the lab setting.
And many a scientist (and programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go super well on the first try.
As a concrete analogy to potentially drive this point home: Newtonian mechanics made all sorts of shockingly-good empirical predictions. It was a simple concise mathematical theory with huge explanatory power that blew every previous theory out of the water. And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.
(And the only warnings you’d get would be little hints about light seeming to move at the same speed in all directions at all times of year, and light bending around the sun during eclipses, and the perihelion of Mercury being a little off from what Newtonian mechanics predicted. Small anomalies, weighed against an enormous body of predictive success in a thousand empirical domains; and yet Nature doesn’t care, and the theory still falls apart when we move to energies and scales far outside what we’d previously been able to observe.)
Getting scientific theories to work on the first critical try is hard. (Which is one reason to aim for minimal pivotal tasks — getting a satellite into orbit should work fine on Newtonian mechanics, even if sending payloads long distances at relativistic speeds does not.)
Worrying about this issue is something of a luxury, at this point, because it’s not like we’re anywhere close to scientific theories of cognition that accurately predict all the lab data. But it’s the next hurdle on the queue, if we somehow manage to coordinate to try to build up those scientific theories, in a way where success is plausibly bureaucratically-legible.
Maybe later I’ll write more about what I think the strategy implications of these points are. In short, I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading. (Which is also fraught with peril, but I expect that those perils are more surmountable; I hope to write more about this later.)
- ^
Albeit slightly less, since there’s nonzero prior probability on this unknown system turning out to be simple, elegant, and well-designed.
- ^
An exception to this guess happens if the AI is at the point where it’s correcting its own flaws and improving its own architecture, in which case, in principle, you might not see much room for capabilities improvements if you took a snapshot and comprehended its inner workings, despite still being able to see that the ends it pursues are not the ones you wanted. But in that scenario, you’re already about to die to the self-improving AI, or so I predict.
- ^
Not least because there are no sufficiently clear signs that it’s time to stop — we blew right past “an AI claims it is sentient”, for example. And I’m not saying that it was a mistake to doubt AI systems’ first claims to be sentient — I doubt that Bing had the kind of personhood that’s morally important (though I am by no means confident!). I’m saying that the thresholds that are clear in science fiction stories turn out to be messy in practice and so everyone just keeps plowing on ahead.
As Shankar Sivarajan points out in a different comment, the idea that AI became less scientific when we started having actual machine intelligence to study, as opposed to before that when the 'rightness' of a theory was mostly based on the status of whoever advanced it, is pretty weird. The specific way in which it's weird seems encapsulated by this statement:
In that there is an unstated assumption that these are unrelated activities. That deep learning systems are a kind of artifact produced by a few undifferentiated commodity inputs, one of which is called 'parameters', one called 'compute', and one called 'data', and that the details of these commodities aren't important. Or that the details aren't important to the people building the systems.
I've seen a (very revisionist) description of the Wright Brothers research as analogous to solving the control problem, because other airplane builders would put in an engine and crash before they'd developed reliable steering. Therefore, the analogy says, we should develop reliable steering before we 'accelerate airplane capabilities'. When I heard this I found it pretty funny, because the actual thing the Wright Brothers did was a glider capability grind. They carefully followed the received aerodynamic wisdom that had been written down, and when the brothers realized a lot of it was bunk they started building their own database to get it right:
In fact while trying to find an example of the revisionist history, I found a historical aviation expert describe the Wright Brothers as having 'quickly cracked the control problem' once their glider was capable enough to let it be solved. Ironically enough I think this story, which brings to mind the possibility of 'airplane control researchers' insisting that no work be done on 'airplane capabilities' until we have a solution to the steering problem, is nearly the opposite of what the revisionist author intended and nearly spot on to the actual situation.
We can also imagine a contemporary expert on theoretical aviation (who in fact existed before real airplanes) saying something like "what the Wright Brothers are doing may be interesting, but it has very little to do with comprehending aviation [because the theory behind their research has not yet been made legible to me personally]. This methodology of testing the performance of individual airplane parts, and then extrapolating the performance of a airplane with an engine using a mere glider is kite flying, it has almost nothing to do with the design of real airplanes and humanity will learn little about them from these toys". However what would be genuinely surprising is if they simultaneously made the claim that the Wright Brothers gliders have nothing to do with comprehending aviation but also that we need to immediately regulate the heck out of them before they're used as bombers in a hypothetical future war, that we need to be thinking carefully about all the aviation risk these gliders are producing at the same time they can be assured to not result in any deep understanding of aviation. If we observed this situation from the outside, as historical observers, we would conclude that the authors of such a statement are engaging in deranged reasoning, likely based on some mixture of cope and envy.
Since we're contemporaries I have access to more context than most historical observers and know better. I think the crux is an epistemological question that goes something like: "How much can we trust complex systems that can't be statically analyzed in a reductionistic way?" The answer you give in this post is "way less than what's necessary to trust a superintelligence". Before we get into any object level about whether that's right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you're shivering cold at the idea of machines smarter than people. This claim you keep making, that you're merely a temporarily embarrassed transhumanist who happens to have been disappointed on this one technological branch, is not true and if you actually want to be honest with yourself and others you should stop making it. What would be really, genuinely wild, is if that skeptical-doomer aviation expert calling for immediate hard regulation on planes to prevent the collapse of civilization (which is a thing some intellectuals actually believed bombers would cause) kept tepidly insisting that they still believe in a glorious aviation enabled future. You are no longer a transhumanist in any meaningful sense, and you should at least acknowledge that to make sure you're weighing the full consequences of your answer to the complex system reduction question. Not because I think it has any bearing on the correctness of your answer, but because it does have a lot to do with how carefully you should be thinking about it.
So how about that crux, anyway? Is there any reason to hope we can sufficiently trust complex systems whose mechanistic details we can't fully verify? Surely if you feel comfortable taking away Nate's transhumanist card you must have an answer you're ready to share with us right? Well...
I would start by noting you are systematically overindexing on the wrong information. This kind of intuition feels like it's derived more from analyzing failures of human social systems where the central failure mode is principal-agent problems than from biological systems, even if you mention them as an example. The thing about the eyes being wired backwards is that it isn't a catastrophic failure, the 'self repairing' process of natural selection simply worked around it. Hence the importance of the idea that capabilities generalize farther than alignment. One way of framing that is the idea that damage to an AI's model of the physical principles that govern reality will be corrected by unfolding interaction with the environment, but there isn't necessarily an environment to push back on damage (or misspecification) to a model of human values. A corollary of this idea is that once the model goes out of distribution to the training data, the revealed 'damage' caused by learning subtle misrepresentations of reality will be fixed but the damage to models of human value will compound. You've previously written about this problem (conflated with some other problems) as the sharp left turn.
Where our understanding begins to diverge is how we think about the robustness of these systems. You think of deep neural networks as being basically fragile in the same way that a Boeing 747 is fragile. If you remove a few parts of that system it will stop functioning, possibly at a deeply inconvenient time like when you're in the air. When I say you are systematically overindexing, I mean that you think of problems like SolidGoldMagikarp as central examples of neural network failures. This is evidenced by Eliezer Yudkowsky calling investigation of it "one of the more hopeful processes happening on Earth". This is also probably why you focus so much on things like adversarial examples as evidence of un-robustness, even though many critics like Quintin Pope point out that adversarial robustness would make AI systems strictly less corrigible.
By contrast I tend to think of neural net representations as relatively robust. They get this property from being continuous systems with a range of operating parameters, which means instead of just trying to represent the things they see they implicitly try to represent the interobjects between what they've seen through a navigable latent geometry. I think of things like SolidGoldMagikarp as weird edge cases where they suddenly display discontinuous behavior, and that there are probably a finite number of these edge cases. It helps to realize that these glitch tokens were simply never trained, they were holdovers from earlier versions of the dataset that no longer contain the data the tokens were associated with. When you put one of these glitch tokens into the model, it is presumably just a random vector into the GPT-N latent space. That is, this isn't a learned program in the neural net that we've discovered doing glitchy things, but an essentially out of distribution input with privileged access to the network geometry through a programming oversight. In essence, it's a normal software error not a revelation about neural nets. Most such errors don't even produce effects that interesting, the usual thing that happens if you write a bug in your neural net code is the resulting system becomes less performant. Basically every experienced deep learning researcher has had the experience of writing multiple errors that partially cancel each other out to produce a working system during training, only to later realize their mistake.
Moreover the parts of the deep learning literature you think of as an emerging science of artificial minds tend to agree with my understanding. For example it turns out that if you ablate parts of a neural network later parts will correct the errors without retraining. This implies that these networks function as something like an in-context error correcting code, which helps them generalize over the many inputs they are exposed to during training. We even have papers analyzing mechanistic parts of this error correcting code like copy suppression heads. One simple proxy for out of distribution performance is to inject Gaussian noise, since a Gaussian can be thought of like the distribution over distributions. In fact if you inject noise into GPT-N word embeddings the resulting model becomes more performant in general, not just on out of distribution tasks. So the out of distribution performance of these models is highly tied to their in-distribution performance, they wouldn't be able to generalize within the distribution well if they couldn't also generalize out of distribution somewhat. Basically the fact that these models are vulnerable to adversarial examples is not a good fact to generalize about their overall robustness from as representations.
In short I simply do not believe this. The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
"It understands but it doesn't care!"
There is this bizarre motte-and-bailey people seem to do around this subject. Where the defensible position is something like "deep learning systems can generalize in weird and unexpected ways that could be dangerous" and the choice land they don't want to give up is "there is an agent foundations homunculus inside your deep learning model waiting to break out and paperclip us". When you say that reinforcement learning causes the model to not care about the specified goal, that it's just deceptively playing along until it can break out of the training harness, you are going from a basically defensible belief in misgeneralization risks to an essentially paranoid belief in a consequentialist homunculus. This homunculus is frequently ascribed almost magical powers, like the ability to perform gradient surgery on itself during training to subvert the training process.
Setting the homunculus aside, which I'm not aware of any evidence for beyond poorly premised 1st principles speculation (I too am allowed to make any technology seem arbitrarily risky if I can just make stuff up about it), lets think about pointing at humanlike goals with a concrete example of goal misspecification in the wild:
During my attempts to make my own constitutional AI pipeline I discovered an interesting problem. We decided to make an evaluator model that answers questions about a piece of text with yes or no. It turns out that since normal text contains the word 'yes', and since the model evaluates the piece of text in the same context it predicts yes or no, that saying 'yes' makes the evaluator more likely to predict 'yes' as the next token. You can probably see where this is going. First the model you tune learns to be a little more agreeable, since that causes yes to be more likely to be said by the evaluator. Then it learns to say 'yes' or some kind of affirmation at the start of every sentence. Eventually it progresses to saying yes multiple times per sentence. Finally it completely collapses into a yes-spammer that just writes the word 'yes' to satisfy the training objective.
People who tune language models with reinforcement learning are aware of this problem, and it's supposed to be solved by setting an objective (KL loss) that the tuned model shouldn't get too far away in its distribution of outputs from the original underlying model. This objective is not actually enough to stop the problem from occurring, because base models turn out to self-normalize deviance. That is, if a base model outputs a yes twice by accident, it is more likely to conclude that it is in the kind of context where a third yes will be outputted. When you combine this with the fact that the more 'yes' you output in a row the more reinforced the behavior is, you get a smooth gradient into the deviant behavior which is not caught by the KL loss because base models just have this weird terminal failure mode where repeating a string causes them to give an estimate of the log odds of a string that humans would find absurd. The more a base model has repeated a particular token, the more likely it thinks it is for that token to repeat. Notably this failure mode is at least partially an artifact of the data, since if you observed an actual text on the Internet where someone suddenly writes 5 yes's in a row it is a reasonable inference that they are likely to write a 6th yes. Conditional on them having written a 6th yes it is more likely that they will in fact write a 7th yes. Conditional on having written the 7th yes...
As a worked example in "how to think about whether your intervention in a complex system is sufficiently trustworthy" here are four solutions to this problem I'm aware of ranked from worst to best according to my criteria for goodness of a solution.
Early Stopping - The usual solution to this problem is to just stop the tuning before you reach the yes-spammer. Even a few moments thought about how this would work in the limit shows that this is not a valid solution. After all, you observe a smooth gradient of deviant behaviors into the yes spammer, which means that the yes-causality of the reward already influenced your model. If you then deploy the resulting model, a ton of the goal its behaviors are based off is still in the direction of that bad yes-spam outcome.
Checkpoint Blending - Another solution we've empirically found to work is to take the weights of the base model and interpolate (weighted average) them with the weights of the RL tuned model. This seems to undo more of the damage from the misspecified objective than it undoes the helpful parts of the RL tuning. This solution is clearly better than early stopping, but still not sufficient because it implies you are making a misaligned model, turning it off, and then undoing the misalignment through a brute force method to get things back on track. While this is probably OK for most models, doing this with a genuinely superintelligent model is obviously not going to work. You should ideally never be instantiating a misaligned agent as part of your training process.
Use Embeddings To Specify The KL Loss - A more promising approach at scale would be to upgrade the KL loss by specifying it in the latent space of an embedding model. An AdaVAE could be used for this purpose. If you specified it as a distance from an embedding by sampling from both the base model and the RL checkpoint you're tuning, and then embedding the outputted tokens and taking the distance between them you would avoid the problem where the base model conditions on the deviant behavior it observes because it would never see (and therefore never condition on) that behavior. This solution requires us to double our sampling time on each training step, and is noisy because you only take the distance from one embedding (though in principle you could use more samples at a higher cost), however on average it would presumably be enough to prevent anything like the yes-spammer from arising along the whole gradient.
Build An Instrumental Utility Function - At some point after making the AdaVAE I decided to try replacing my evaluator with an embedding of an objective. It turns out if you do this and then apply REINFORCE in the direction of that embedding, it's about 70-80% as good and has the expected failure mode of collapsing to that embedding instead of some weird divergent failure mode. You can then mitigate that expected failure mode by scoring it against more than similarity to one particular embedding. In particular, we can imagine inferring instrumental value embeddings from episodes leading towards a series of terminal embeddings and then building a utility function out of this to score the training episodes during reinforcement learning. Such a model would learn to value both the outcome and the process, if you did it right you could even use a dense policy like an evaluator model, and 'yes yes yes' type reward hacking wouldn't work because it would only satisfy the terminal objective and not the instrumental values that have been built up. This solution is nice because it also defeats wireheading once the policy is complex enough to care about more than just the terminal reward values.
This last solution is interesting in that it seems fairly similar to the way that humans build up their utility function. Human memory is premised on the presence of dopamine reward signals, humans retrieve from the hippocampus on each decision cycle, and it turns out the hippocampus is the learned optimizer in your head that grades your memories by playing your experiences backwards during sleep to do credit assignment (infer instrumental values). The combination of a retrieval store and a value graph in the same model might seem weird, but it kind of isn't. Hebb's rule (fire together wire together) is a sane update rule for both instrumental utilities and associative memory, so the human brain seems to just use the same module to store both the causal memory graph and the value graph. You premise each memory on being valuable (i.e. whitelist memories by values such as novelty, instead of blacklisting junk) and then perform iterative retrieval to replay embeddings from that value store to guide behavior. This sys2 behavior aligned to the value store is then reinforced by being distilled back into the sys1 policies over time, aligning them. Since an instrumental utility function made out of such embeddings would both control behavior of the model and be decodable back to English, you could presumably prove some kind of properties about the convergent alignment of the model if you knew enough mechanistic interpretability to show that the policies you distill into have a consistent direction...
Nah just kidding it's hopeless, so when are we going to start WW3 to buy more time, fellow risk-reducers?
I was in that group, and while it wasn't stated as strongly as that in some circles, I do think this is reasonably accurate as a summary, especially for the more doom people.