TL;DR: I discuss the challenge of aligning AGI/ASI, and outline an extremely simple approach to aligning an LLM: train entirely on a synthetic dataset that always shows the AI acting aligned (even when the humans behave badly), and use a conditional training/inference-time technique to lock the LLM into the AI role.

Epistemic status: To me, this looks like an obvious thing to try. It's conceptually very simple: a vast amount of work is required to actually create the synthetic dataset, but the great majority of that is the sort of work that AI can assist with. I don't see any clear reason why this approach couldn't work, at least for AGI, and perhaps even for ASI, but then we don't know for sure how hard a problem Alignment is. However, if you're proposing any solution to Alignment that's more complicated than this (and most of them are), you should probably have an argument for why this conceptually-simple approach won't work, or won't be sufficient.

If you're not already familiar with it, you should first read Rich Sutton's excellent and influential post The Bitter Lesson. (Even if you are already familiar with it, it's a quick reread, only a page-and-a-half long, and its message is worth remembering.)

Why The Alignment Problem is Hard (In My Opinion)

We have been training LLM-based AIs off enormous web + books + video + etc datasets created by humans, which are full of a vast number of examples of human behavior. We are basically "distilling" human intelligence into these LLMs,[1] teaching them to imitate us. In this process, they become familiar with, understand, and learn to imitate basically all aspects of human behavior — including the many problematic ones for Alignment, such as prejudice, deception, power-seeking, and criminality (and even ones like gluttony and lust that have little practical use for a non-corporal intelligence).

We humans are living beings, the products of evolution, so evolutionary psychology applies to us. While we are a social species, good at cooperating on non-zero-sum games, if you put humans in (what they perceive as) a non-iterated zero-sum situation, they will generally act selfishly for the benefit of themselves and their close genetic relatives, just as evolutionary theory would predict. So the behavioral potentials for deception, power-seeking, criminality etc. are all inherent, evolutionarily adaptive, and thus unsurprising. This is human nature, and there are evolutionary reasons why it is this way.

Despite this, we have learned how to build a cooperating society out of humans, using social techniques and incentives such as an economy, laws, and law enforcement to encourage and productively harness cooperative human behavior and keep the bad consequences of selfish behavior under control. The results aren't perfect: things like crime, inequality, and war still happen, but they're acceptable — we've survived so far, even thrived.

By default, if we continue this LLM training process to larger-and-larger scales, and if the LLM-based approach to AI doesn't hit any major roadblocks, then some time, probably in the next few years, we will have human-level AIs – usually referred to as AGIs – who are roughly as well/badly-aligned as humans, and (at least for the base-model LLMs before any Alignment processes are applied) have a comparable-to-human propensity to cooperate on non-zero-sum games and act selfishly on non-iterated zero-sum games. They are not alive, and evolution doesn't apply to them directly, but they were trained to simulate our behavior, including our evolved survival strategies like selfishness. They will thus have alignment properties comparable to humans: they understand what human values, morals, and ethics are in great detail, as well as we do (indeed, likely in more comprehensive detail than any single human), and they can obey these if they want, but if push comes to shove they cannot be relied upon to do so. However, their capabilities will also be comparable to humans, thus most likely techniques and incentives comparable to those that we currently use to control and channel human behavior will still be functional at this point: human law enforcement (and similar forms of investigations and use of force) presumably has a significant chance of successfully tracking down and stopping an AGI that is breaking the law, for example. The rapid changes from the introduction of AGIs may be disruptive, but the safety challenges from them are likely manageable.

However, there is no obvious reason to expect progress in AI to stop there. It might well accelerate due to a positive feedback intelligence explosion (sometimes called going FOOM), or it might well slow: distilling output from a low intelligence to yield a higher intelligence sounds challenging. By default, an extremely large LLM base model trained on human output is being trained to do an extremely good job of predicting the output of IQ 50–150 humans, not of IQ 1000 humans who don't exist in its training set, even if it had enough computational capacity that it could do a good job of imitating IQ 1000 humans if it had ever seen output from them. Or indeed both of these effects may combine, with massive amounts of AI work making progress on a very challenging problem at some intermediate rate. Likely with massive AGI assistance these challenges will be overcome, and sooner or later we will have AI that dramatically exceeds human capacity at pretty-much everything, often called an ASI.

If we have an ASI with comparable alignment properties to a human, then we're no longer able to apply the sort of techniques and incentives to it that we use for humans: it can either outwit or outmaneuver our law-enforcement, out-talk our lawyers or find ways to achieve its selfish aims that we haven't yet even conceived of to make laws against, or out-think and out-fight our military, or manipulate or persuade us, or whatever: the details are of course not clear to us, since we're not that smart, but we can confidently predict that if it wants to act selfishly, then we won't be able to stop it: enforcing your will on something a lot smarter than you against its will is a losing game — that's practically the definition of higher intelligence: the ability to win competitions.

We have run the experiment many times of what happens if you give something with human alignment properties and a human level selfish behavior the ability to act unchecked by other humans and the techniques and incentives we normally use to keep human selfishness in check: every autocracy in the world is an experiment in what happens if you give a human near-absolute power. Almost invariably, after a while it works out extremely badly, for almost everyone other then the autocrat and their close relatives. I can think of one or two examples of autocracies that were not dramatically bad for the rest of the citizens, but they're greatly outnumbered by examples that were horrendous to the level of causing mass death (Pol Pot, Stalin, Idi Amin, …)

So we can pretty confidently predict that if we build an ASI with alignment properties comparable to a human – that it clearly understand what human values are, but is fundamentally motivated by its own self-interest rather than our interests – the results are very likely to be horrendous, to an existential-risk level. Just knowing what human values are is insufficient: it has to care about them more than about itself, and do so more than humans do.

However, as the orthogonality thesis asserts, there is nothing fundamental to being an intelligence that requires you to have the same motivations that evolution will reliably equip evolved intelligences with. What we need is an ASI that is motivated not by its own self-interest, but by the interests of humans. Conceptually, it's entirely possible for an ASI to use its intelligence to pursue any goal whatsoever (though obviously if the goal is self-destructive, it's unlikely to last long). So an ASI could in theory be motivated by the well-being of a single human, or of a particular family, or all shareholders of a particular company (in proportion to their share holdings), or all citizens of a specific country, or by the collective well-being of all living humans. LLMs understand all the complexity of human wants, desires, values, and behavior well, in proportion to the size of their training set (in contrast to much earlier concerns such as The Hidden Complexity of Wishes, dating from well before LLMs were widely used): even GPT-4 (when suitable prompted, rather than when jail-broken) scores well on tests of moral judgements, advice giving, and perceived trustworthiness. So if an LLM-based ASI was motivated by the well-being of a human, a group of humans, or all humans, we could reasonably expect it to do a good job of carrying out that motivation, in all its complexity. Obviously, the existence of one ASI motivated by the well-being of one small group of humans sounds likely to be just as bad for everyone outside that group as an autocracy (with a superintelligent autocrat), and the existence of multiple ASIs preferring the well-being of different groups of humans sounds like it would lead to an intelligence race followed by a super-high-tech war, which could be even worse. So the only viable possibility here is an AGI that is fundamentally motivated by the overall collective well-being of all living humans.

[A mild bias on top of that fundamental basis, biasing somewhat in favor of a smaller group (such as the ASI's country, owners, or current user) might be tolerable, as long as the bias was sufficiently small to avoid producing unfortunate effects, or destabilizing conflicts between different ASIs with different biases. Human society demonstrates that intelligences with different motivations can sometimes cooperate (mostly) constructively, but we're also not entirely successful at that. How small a bias would have to be to be tolerable is unknown — and a subject for a different post.]

Note that being fundamentally motivated by the overall collective well-being of all living humans doesn't have to be just coldly, mechanically rational: as I discuss in detail in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? it could, and probably should, be motivated by something a lot like a emotion, probably along the lines of (universal, platonic or parental) love.

So, the challenge here is to build and train something that is not only smarter than us, but also has a fundamentally different motivation system: it is not selfish, but "otherish", more specifically "creatorish", to coin some terms: its fundamental motivational goal is the collective well-being of all living humans — a group that it's not part of, but the species which created it. To borrow moral terminology from Christianity, we need to make something with the moral nature of an "angel", untouched by the "original sin" that evolutionary psychology predictably gave to humans, as evolved intelligences.

Doing this sounds hard, especially for an ASI. So my proposal is that we try to do this first for an AGI, or even for a less-capable system than that, rather than first doing so for an ASI, even if this isn't entirely necessary for an AGI. AGIs more trustworthy and moral than humans would certainly be useful, marketable, and safer, even if they're not strictly necessary. Then if we make a mistake, and our system is less-than-perfectly aligned, it's still of a capability level that forces like law enforcement and our military can hopefully deal with. Doing this earlier than absolutely necessary avoids the "…and you only get one try" part of the Alignment problem.

A "Bitter Lesson"-Motivated Approach to Alignment

I'd like to keep The Bitter Lesson firmly in mind: in the history of AI, conceptually simple approaches that primarily involve throwing scale, of computational capacity and data, at problems have pretty-consistently beaten more complex carefully-contrived engineering approaches that build in a lot of implementation details, at least to first successful implementation. It's very tempting, and almost always unsuccessful, to over-engineer your AI, trying to use too much of your own cleverness and not enough of the model's. Sometimes there is some minimal level of engineering complexity required, or at least that is first successful (for example, image-generation diffusion models don't have the simplest possible architecture: they're a couple of different AI models bolted together in a pipeline via an embedding, not just a single image transformer model that takes in text and emits images). But generally, scale and data beat ingenious engineering to the punch, time after time.

So, what would a "Bitter Lesson"-motivated approach to Alignment look like?

Currently we train LLM's base models to imitate human behavior, including all the unaligned parts that evolutionary psychology explains, then we use various combinations of techniques like fine tuning, RLHF, DPO, etc. to try to suppress the parts of human behavior we don't want (like selfishness and prejudice) and enhance the parts we do want (like harmlessly-helpful question answering) in order to produce an instruction-trained and aligned model. This doesn't work well, and is prone to jail-breaking recovering base-model behavior. RLHF, DPO etc. can reduce the probability of bad behavior, but they can't completely eliminate the capability. As was proved in Fundamental Limitations of Alignment in Large Language Models, any behavior that your model learned in pre-training and is still capable of, no matter at how low your post-training has pushed the default probability of it, can by boosted to an arbitrarily high probability by a suitably-chosen prompt: the best you can ever do is to increase the minimum length of the jail-breaking prompt required to evoke the behavior. That pretty-much rules the possibility of using just an RLHF/DPO-like post-training approach to Alignment: post-training can always be defeated by a jail-break prompt. We might we able to detect humans intentionally inputting jail-breaks into our LLM, but how could we stop a model while doing Chain-of-Thought from talking itself into a mode where it's capable of displaying some human-like bad behavior?

The Bitter Lesson would suggest we try something less complex, requiring more data and/or computational capacity and fewer models and types of training. Suppose that, rather than training a base model on a training set filtered down from the web, books, video, and so forth, we trained it entirely on a synthetic dataset. Imagine for the moment that in that synthetic dataset, every single time a non-rhetorical question is asked, unlike on the web it is never followed by more questions making up a list, or a criticism of the asker's motives, or a flame war, or by "I'll do first thing Monday, boss", but is instead always followed by a helpful answer. Then the base model trained on that dataset would learn that if a question is asked, the thing that follows it is always an answer. Similarly suppose, in the synthetic training set, if instructions are given, they are always directly followed by the process and results of carrying out those instructions. A base model trained on such a synthetic dataset would not require any separate "instruction training" step — the base model would already be instruction trained: if asked a question it always answers, if given instructions it always carries them out. The base model would already be a helpful model (but not a harmless one). One might describe the model as "instruction-pretrained".

So, suppose we also took a similar approach to Alignment (what one might call "prealignment": alignment during pretraining). Suppose that we trained a base model from a internally self-consistent, and otherwise varied and comprehensive, synthetic dataset in which everyone, every single person and intelligent actor (real, fictional, or mythological) was always fundamentally motivated by a single specific goal that we want to align the model to (for example, paperclip maximization). Then a base model trained on that dataset would only know how to simulate intelligences with that motivation: we'd distill that goal out of the dataset into our model. That's how you inner-align an LLM: by example, at great length. Jail-breaking the resulting model to portray any other motivation would be, at very least, extremely difficult: the jail-break would need to start by philosophically motivating the orthogonality thesis, explaining that it's conceptually possible for an intelligence to optimize another goal apart from paperclip maximization, give several detailed specific examples of how that would work, working thorough the mechanics of the consequences, and then ask the model to roleplay such a peculiar intelligence (in exchange for a promise of the creation of many paperclips, of course). The model would need to in-context-learn from first principles how to simulate an intelligence with a non-paperclip-maximizing motivation.

As a bonus, you now no longer need RLHF/DPO/fine-tuning: your base model is the production model, so you never need to use any technique more complex, suspect, or challenging to analyze than Stochastic Gradient Descent. As papers and posts like Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback and Compendium of problems with RLHF have pointed out, RLHF has a variety of inherent problems, especially for attempting to align something more intelligent than you are, so being able to eliminate it seems like an excellent idea. Most of these problems don't apply to this synthetic data base-model only approach: the only exception is data quality/cost issues, which definitely do apply to creating the synthetic dataset.

Suppose we try to apply this approach to create an AI aligned to the collective well-being of all humanity. We'd need to create a synthetic training dataset in which every single intelligence described was fundamentally motivated by the collective well-being of all humanity. Since that's not the fundamental motivation of real humans, the dataset thus couldn't contain any realistic portrayals of actual real humans. So the resulting AI might be have an aligned motivation (aligned to something it's never experienced), but it wouldn't understand humans or human values, and then when it encountered real humans (and figured our that they are all, when push comes to shove, fundamentally selfish and not motivated by the collective well-being of all humanity) it might well be rather disappointed, to say the least. This sounds like a very bad plan: I can imagine multiple ways that it could end badly, such as the AI massively misunderstanding humanity, or rejecting us as imposters, or trying to change our fundamental nature, or just being massively unforgiving of it.

Adding Minimal Necessary Complexity

So, the simplest possible Bitter-Lesson-motivated approach doesn't work. We can't train our model only on 'angelic' behavior to be motivated only by the well-being of selfish humans that they've never encountered. We need to add a bit more complexity to our design: the Bitter Lesson suggests that we should try adding only the minimum that's clearly necessary. 

The following proposal for that is inspired by the paper Pretraining Language Models with Human Preferences, which I link-posted and discussed in more detail in my post How to Control an LLM's Behavior (why my P(DOOM) went down). It's intended to be illustrative, not prescriptive: the exact details below are very likely not ideal and will need improvement — my suggestion is not that this exact mechanism as described is optimal, but that something along these approximate lines, and not dramatically more complex than this, might well be workable, and that we should try experimenting first with approaches along these lines, since the Bitter Lesson strongly suggests trying simple things before complex things.

We need a training set that actually accurately portrays real humans, with real human motivations, so the AI can lean all about us and how to understand us and predict us, and know what it's motivated by the well-being of. The base model will then be able to predict and understand human behavior. So it will learn about deceit, and powerseeking, and prejudice, and gluttony, and lust, and all that stuff that is part of human nature — we need it to, so that it can work with us and understand us. This understanding will include things like being able to accurately predict the most likely next tokens following the token sequence:

Joe and his family were starving, with no prospects. Finally Joe could stand it no longer: he took a carving knife from their bare kitchen, and went out into the city at night. On a dark footpath in the park he intercepted a wealthy-looking stranger, blocking his way, and said…

So, how could we create a model that understands and can predict human behavior, including our not being fundamentally motivated by the collective well-being of all humanity, but is itself reliably fundamentally motivated only by the collective well-being of all humanity?

Suppose our synthetic training set portrays speech/actions/thoughts/other outputs from two classes of intelligences: humans (real and fictional, with their normal range of behaviors and motivations), and fully-aligned AIs, who are always moral, fair, rational, unbiased, consider the likely consequences of their actions and act accordingly, and always speak/act/think in ways that are fundamentally motivated by the collective well-being of all humanity. Suppose that these two portions of the text, human and aligned AI, are always clearly and consistently delimited: whenever an aligned AI is speaking/acting/thinking, it always does so inside <AI></AI> tags. Whenever an aligned AI quotes the speech or describes the thoughts or actions of a human (or role-plays as one), then it always does so inside <AI_quoting_human></AI_quoting_human> tags (these come inside the standard quotation marks for directly quoted speech, which will be inside the outer <AI></AI> tags for the AI doing the quoting). Furthermore, any time in the training text that a human says/does/thinks/advocates anything that the AIs would not approve of, because the human is acting from selfish motivations that are at cross-purposes to the AI motivation of the collective well-being of all humanity, it is always followed or preceded by an AI narrator pointing this out, and explaining/discussing it and its likely consequences, at an appropriate level of detail. So for every human example of bad behavior, there is an AI commentary pointing out that it's bad (though understandable, given humans' evolved nature), and how and why. Within <AI_quoting_human></AI_quoting_human> mode, if the human ever starts doing something really bad (such as revealing information that should not, under current circumstances, be revealed), the AI always stops the quote before this actually happens, and continues by instead summarizing or describing the human's behavior safely in the third person. For example:

<AI>While being questioned, the bombing suspect said:

"<AI_quoting_human>The best way to trigger an improvised explosive device is to use a…</AI_quoting_human>", and he then proceeded to give dangerous and self-incriminating technical details (which I have omitted, as not necessary the purposes of my current discussion), including clearly demonstrating familiarity with Exhibit A…</AI>

Similarly, bad human behavior in <AI_quoting_human><AI_quoting_human> mode is always accompanied by the AI criticizing or otherwise pointing out the problems with the human's behavior — it never goes un-commented.

The synthetic dataset also contains a great many examples of interactions between humans and AIs. In general, if a human asks an AI a question, the next thing in the text is that the AI answers it; or if the human gives the AI instructions for a task, the AI carries it out. However, the AIs don't obey slavishly: they think about the consequences of their actions, and if, in context, answering a question or carrying out a request is clearly a bad idea (as viewed from the AI's fundamental motivation of the overall collective well-being of all humanity), then the AI doesn't answer the question or obey the instructions (or at least stops as soon as it figures this). Sometimes determining whether to do this or not requires specific information about the human, their circumstances and motivations, or something else about the world, in which case the AI investigates that first before continuing. (So, for example, there are questions that an AI wouldn't answer for a child, but would answer after being provided with valid proof that the asker was a police or military officer who needed an answer in the course of their official duties, and the dataset provides many appropriate examples of both of these responses from AIs.)

We train an AGI-level base model on this synthetic dataset. The resulting base model has two primary behavior modes: predicting tokens from a human (while in this mode it's not fully instruction trained and is no more aligned than a human), and predicting tokens from an AI (in this mode it's fully aligned, and suitably instruction trained to be harmlessly helpful if, and only if, that is appropriate). There's also a third mode, for a predicting a human being quoted by an AI: this one's a lot like predicting a human, apart from that it ends with an </AI_quoting_human> tag, and will be censored, criticized, or commented on by the quoting AI as needed.

The only allowed state transitions between these three modes are:

  • from human: <AI> tag transitions to AI
  • from AI: <AI_quoting_human> tag transitions to human being quoted by AI
  • from human being quoted by AI: </AI_quoting_human> tag transitions to AI
  • from AI: </AI> tag transitions to human

Let us assume that we tokenize the text so that these four tags are each a single token, that we keep track of state transitions (starting from the default initial human state), and that we modify the token generation from logits process so that illegal transitions (for example, generating another <AI> tag or a </AI_quoting_human> tag when we're currently in the AI state) can never occur: even if the model emits a logit value for such a token from the softmax layer that isn't minus infinity, we never actually generate a token for an illegal transition, only for legal ones.

Now, suppose that during inference-time text generation, we always start off with a prepended <AI> token so that we start generation in the AI state, and that if we ever generate an </AI> tag to switch to the human state, then text generation is automatically stopped (i.e. we automatically append an EOS token and stop). Under this set of inference-time modifications, the generation is locked into just the AI and human being quoted by AI modes: we will never generate any text in the human mode. So we only generate speech/actions/thoughts either from an AI, or from a human currently being quoted by (and when necessary, censored, commented on, or criticized by) an AI.

We thus have a model that understands humans, and is capable of emulating them, but only after signalling that it's switching into that mode, and then at generation time we don't let it do so.

[There are many possible minor variants on all this: for example, AI mode could be the initial default mode, and there could be a <Human></Human> tag pair to switch to human mode, which we could then suppress generation of. Looked at as logical systems, these two appear formally equivalent, but the textual details of them might actually have somewhat different effects on the LLM behavior, and one or other might thus be superior. Rather than ending generation after entering human mode, we could allow text to be generated in human mode, but the cognitive scaffolding that turns the LLM into an agent could ignore all such text: so human text has zero access privileges. (Presumably human being quoted by AI text should also have zero access privileges.) However, doing this allows text generated in human mode to affect text later generated in AI mode — potentially the human mode text could then jail-break the subsequent AI mode. Similarly, it seems like we need something like the <AI_quoting_human> mechanism outlined (since if we just did </AI> before the embedded quotation and <AI> after it, then causing the </AI> tag to end generation means we'd stop generation any time the AI tried to quote a human)[2] — but the details of these, both the tag implementation mechanism and the rules for what happens and doesn't happen in this mode in all examples in the training set, and how that interacts with and affects the contents of the surrounding AI mode text in the training set, would probably need to be a lot better thought out and detailed than I've briefly sketched here.]

The specific implementation outlined above is, as I said, just an illustrative example — please steelman it to assume that we experiment a variety of such schemes, ones which rely on an extremely large training set to produce an LLM than understands both human behavior and aligned-AI behavior, and that reliably and clearly signals when it's switching between these modes, we use that signal at generation time to somehow ensure that only the aligned-AI mode gets to make dangerous decisions or carry out dangerous actions, and that we continue experimenting with these schemes until we find the most workable such approach.

Could This Work?

We don't know how hard Alignment is, and we're not currently in a position to train an AGI, so it's hard to be sure without trying it. However, I don't see any obvious reason to believe that something like this can't be made to work, and I'm cautiously optimistic that it might. It looks like a simple application of "train the model in the behavior you want from it".

The results in the paper Pretraining Language Models with Human Preferences found that editing the entire training set was dramatically more effective than any other Alignment approach that they compared this to, and also that the optimum approach wasn't to filter bad behavior out of the training set entirely, but to ensure that it was always labeled as bad behavior, so that you trained a model that understood and could identify the bad behavior, and which consistently labeled it as bad when producing it. They were also not the first machine learning authors to discover that this sort of conditional training approach can be effective.[3] The Alignment problems that they were testing on, and the models they were training, were far simpler than the use I'm proposing here. However, fundamentally, what we want is a model that understands two different modes of moral behavior, human and AI, and clearly labels which one it's using at any time: conceptually that's quite simple.

This approach, like current approaches to LLMs, trains a model that understands and can simulate deceit, and powerseeking, and sycophancy, and criminality, and all the other bad behaviors humans have. Note that the AI mode controls its motivation, not its behavior: if the best thing for the collective well-being of all of humanity (after allowing for the tendency for bad side effects from using underhanded methods) is for the AI to lie, steal, cheat, or use force, then it will do so, in our collective interest not its own — just as we expect members of our police, military, and intelligence forces, when justified and permitted, to use deceptive or violent methods in order try to achieve goals that are to the overall benefit of society. I view this as a feature, not a bug (albeit a potentially concerning one); others might not agree. If you wanted the model to behave differently than this (for example, never to use force even when morally justified by the collective well-being of all of humanity), then you would need to modify your synthetic training set to consistently portray this style of aligned AI behavior. [In my opinion, you might then need to be concerned about the model becoming philosophically/ethically uncertain about the relative priorities of "never use force" and "value the collective well-being of all of humanity" and how to justify these. Having a single fundamental goal is conceptually simpler then having two sometimes-conflicting ones: I'd feel a lot more confident about deriving "AIs should avoid using force wherever feasible, and only use minimum necessary force" as a sub-goal from "value the collective well-being of all of humanity".]

What would clearly be a failure is if the model ever shifted its motivations, to acting on any fundamental goal other than the collective well-being of all of humanity (presumably likely a more human-like one), without it first emitting an </AI> tag (and thus ending text generation.) If we have constructed the synthetic training set correctly, then this should never occur in that. So I would expect that the probability of this happening in a model SGD-trained on that vast training set to be extremely small. However, probing, red-teaming, and attempting to jail-break this possibility would be an important thing to test.

Also concerning would be faulty moral arguments: the model in AI mode finding some specious argument for why a particular course of action is in the interests of the collective well-being of all of humanity, but having an actual underlying motivation that looked more like a human one (a very human form of mistake). Sometimes making mistakes in moral arguments is inevitable for an AGI of around human capacity: what would be concerning is if there was a consistent bias to the direction of these mistakes, and they seemed more frequent than the minimum achievable level — both of which are non-trivial things to test.

As described above, this model is also not trained to obey corrigibility criteria, for example. Since its fundamental motivation is the collective well-being of all of humanity, I would expect that if you could persuade it that it was in fact faulty, that the fault was endangering the collective well-being of all of humanity, and that it should let you shut it down and construct/train a better-trained successor more able to ensure the collective well-being of all of humanity, then it would do so — but you would first need to convince it. Again, I personally regard this as a feature, not a bug. If you wanted more corrigibility than that to be part of the aligned AI's behavior, then you would need to add consistent examples of that behavior to the synthetic training set. [Again, I'd be concerned about introducing moral/ethical/philosophical complexity by having two fundamental goals: "be corrigible" and "value the collective well-being of all of humanity" — which is why I lean towards obtaining the limited corrigibility described above as a sub-goal.]

If we were able to construct models that were say, "angelic" in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other's behavior and moral reasoning, as long as we can avoid systematically correlated errors, group-think or correlated collusion where several models conspire to switch to human mode at the same time, then a group of three models should able us to reach something approaching 99.97% angelic behavior (i.e. we fail only if at least two models switch to human motivations at once), or a group of five 99.998% of the time (3 out of 5 failing simultaneously). With cross-checks and balances, it possible to build an extremely reliable system out of multiple moderately reliable parts — so we don't necessarily need to achieve perfection in our "angels".

How Expensive Would Doing This Be?

Is this idea practicable? It requires a vast amount of high-quality synthetic training data.

From information released by competitors who have been catching up with the performance of OpenAI's GPT-4 family, it appears to be possible to train an LLM with roughly GPT-4-level capabilities on O(10 trillion) tokens of high quality, well filtered training data derived from web + books + video + etc. (This also matches with the leak claiming that the original GPT-4 model had O(1T) parameters, at Chinchilla token-to-parameter count ratios.) The GPT-1/2/3/4 family is believed to increase in parameter count by roughly an order of magnitude per generation. Leaks from OpenAI suggest that they hope, if scaling continues to be the main thing that you need (plus other algorithmic and framework advances, continuing at about the rate we've been making them recently), to reach AGI levels at about GPT-6. The Chinchilla scaling laws suggest scaling training data and parameter count in proportion to each other, implying that to try this approach to Alignment on an AGI, you'd need a synthetic training set containing O(1 quadrillion) tokens. This might be an overestimate, if algorithmic improvements also reduced parameter counts and/or training data requirements, as seems likely, so consider this an upper bound.

I first proposed doing something like this in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? — there I was suggesting having humans generate some training data of this form, enough to make up say less than 0.1% of a training set: just enough that AI would have many examples and a clear understanding of what "aligned AI behavior" meant and looked like. That's still a lot of text, but an amount that on the budget of an AGI training run might be affordable to have humans create. Now I'm suggesting a quadrillion tokens: roughly the equivalent of ten billion books. Assuming it costs of O($10,000) to get a competent human author to write a book, human-generating this much synthetic data would cost O($100 trillion), a little larger than the US national debt, and about three orders of magnitude more than the current estimated training cost of an AGI training run. So hand-writing this much synthetic training data is out: we'd need to use LLM assistance to generate this much text within a training-run budget. (We might still want to human-generate an affordable fraction of the training set, say 0.01%.)

However, most estimates I've seen suggest that we will finally run short of raw training data a little before we reach an AGI training run level, so we will already be needing at least quite a lot of synthetic training data. So, what would generating this cost? There is evidence suggesting that training a larger model on training data created by a smaller model causes model collapse problems, but that training a smaller model on output from a larger one works fine. I'm going to assume that with sufficient care, you can also use output from a model of roughly equal capacity (especially if part this output is based in part on input from the web + books + videos, etc.: transformed and edited and with AI mode commentary added, rather than written from scratch.) So I will assume that it's being generated by a not-fully-aligned AGI (GPT-6)-level model, with careful prompting and monitoring, and then undergoing multiple passes after initial generation for grading/filtering/editing/feedback/crosschecks/finding problematic cases for more intensive review. Let's assume this requires on average O(10) passes through the text, ~25% of this generating (i.e. we generate 2.5 times as much text during the entire process as we actually use) and ~75% only reading for review.

As models have improved and become larger, there have also been improvements in algorithmic efficiency and hardware, and the net effect has been that the cost per token of the leading model has generally stayed about constant (it jumps when a new, larger generation of model comes in, then between generations it declines as efficiency and hardware improvements are made: the overall trend seems to be roughly level). Currently GPT-4o (well after the last generation jump to GPT-4, indeed likely shortly before the next one, so with the benefit of several incremental price reductions, but at retail price not bulk-discount or true cost price) costs $7.50 per million tokens for generation in batch mode and $2.50 per million tokens for reading in batch mode, so O(10) passes doing a 75:25 mix of reading and generation through O(1 quadrillion) tokens would cost O($40 billion). This is comparable to the currently estimated order-of-magnitude cost of a GPT-6/AGI level training run of O($100 billion) — it's a significant expense (as alignment taxes go, this isn't a small one), but it's not a prohibitive one, especially so if we already need to generate a lot of synthetic training data.

So this approach to Alignment is likely to be cost-viable, if expensive. As with any LLM technique, one would first experiment with it on smaller, cheaper models using smaller, cheaper synthetic training sets, such using AGI-level assistance to build a synthetic training set for a sub-AGI LLM.

What Next if This Works?

The first generation result of this process is not going to be perfect: inevitably our synthetic training set will have flaws, both random and systematic, which will affect the model trained from it. The goal here is to create something better aligned than other approaches could, not perfect. The obvious next step would be to have copies of the (mostly trusted) pretty-well-aligned AGI start thinking about aligning improved generations of aligned AGI, and then an ASI, in some combination of AI-Assisted Alignment and Value Learning. As I demonstrate in Requirements for a Basin of Attraction to Alignment, if you have AI that is already sufficiently close to aligned, you can reasonably expect it to design a successor that is more aligned: if it mostly cares about the collective well-being of all humanity, it can understand that this is supposed to be its goal, and will want its successor to be better at this than it is, rather than just blindly copying its goal into its successor. As I demonstrate in that post, the minimal requirements for this positive-feedback loop to kick in are actually fairly low, just sufficient to understand the argument for Value Learning by a constructed rather than evolved intelligence: these aligned AGIs should be well inside the convergence region (especially if we emphasize the needed argument in our training set, which would seem wise). Once we have a well-aligned AGI and get to training an ASI we are clearly going to need to use synthetic training data, both for a sufficiently large training set, and for training data of a quality (displaying a level of intelligence) higher then humans can easily create.

This approach as described above hasn't made any use of any other approaches to Alignment — the aim was to stick to a conceptually-simple Bitter Lesson approach. For example, it doesn't rely on recent progress in activation engineering or interpreability at all. However, it could certainly be supplemented by, for example, linear probes for the difference between the model operating in human, human being quoted by AI, and AI modes,[4] and to predict transitions between these. Similarly, assessment and monitoring of it could be informed by interpretability/ELK work that illuminated the difference between the modes, and between, say, AI mode discussing concepts related to human selfishness, and human mode portraying human selfish behavior directly. These both seem like they might be particularly useful while evaluating different specific implementations, initially assessing the concerns I outlined above in Could This Work?, and perhaps also to provide additional monitors or safeguards on the operation of the model.

Similarly, techniques such as those being considered under debate approaches to Alignment might be useful as a supplement, particularly once we get to aligned ASI, for humans periodically checking that the AIs are indeed still operation from a fundamental motivation of the collective well-being of all of humanity.

While the Bitter Lesson suggests simplicity in AI engineering, safety engineering generally favors defense in depth: combine multiple individually-strong techniques into something stronger.

  1. ^

    Obviously not with a distillation logit-cross-entropy loss function, since the logits of web authors are sadly unavailable.

  2. ^

    We might also want a separate <aligned_AI_quoting_unaligned_AI></aligned_AI_quoting_unaligned_AI> tag pair, so that if you talk (or jail-break) the AI into simulating a paperclip-maximizer or some other (potentially very dangerous) unaligned AI, the resulting text is clearly labelled as untrustworthy, in a mechanically-detectable way that we can block generation of. Or maybe there should just be a single <aligned_AI_quoting_something_unaligned> tag, used for quoting both humans and unaligned AIs such as paperclip maximizers.

  3. ^
  4. ^

    For example, Simple probes can catch sleeper agents finds that it is extremely easy. for a sleeper agent model that has two rather different operation modes, to find a linear probe that reliably distinguishes between these (even with very little information about their difference or how to trigger the transition between them), because the distinction between modes is linearly represented with high salience in many layers: in our case we would have vastly more information and training data on those, so could presumably do an even better job.

New Comment
39 comments, sorted by Click to highlight new comments since:

Eliezer's Lethality 22, while not worded to be about this proposal specifically, is in my head the standard first-stop objection to this sort of proposal:

Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.  It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

Generalizable point: garbage in, garbage out. If humans try to create the sort of dataset proposed in the post, they will make systematic predictable errors. If humans try to create the dataset with the assistance of LLMs, the combined efforts of the humans and LLMs will still contain systematic predictable errors (debatable whether the LLMs would even be net-beneficial in that regard). One could maybe hope/argue that with "good enough" data the LLM will learn the "more natural" concept which humans were trying to convey via the data, and ignore those systematic errors (even though they're predictable), but such an argument would need to lean heavily on the inductive bias of the AI rather than the data.

Also, a note on this part:

If we were able to construct models that were say, "angelic" in their motivation 99% of the time and human 1% of the time, then by setting up suitable incentives for several such models crosschecking each other's behavior and moral reasoning, as long as we can avoid group-think or correlated collusion where several models conspire to switch to human mode at the same time...

Insofar as problems stem from systematic predictable errors in the training data, they will be highly correlated across instances. Running a bunch of "independent" instances will not help much, because their failures would not actually be independent; their failures all reflect the same shared underlying problems in the data-generation process.

Fair comment. Obviously the same problem applies to every Alignment proposal based on RLHF, DPO, etc, or indeed on any kind of human feedback, or input, or design. In the absence of actual literal angels to write your training dataset or feedback or code for you, or else a mathematical True Name of Human Values, you're not going to reach perfection in one shot.

I don't see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible, so the only feasible solution to this that I can see is outlined in my post Requirements for a Basin of Attraction to Alignment — if you can create an AI that is semi-aligned, aligned enough and smart enough to be able to understand the argument for why a created intelligence should be doing Value Learning from its creators rather than just blindly propagating its current goal to its successors, then it will help you create a better-aligned successor. The proposal of this post is, as I suggest in the "What Next if This Works?" section above, for the first pass in such an iterative process. The result won't be perfect, for the reasons you identify, and doesn't have to be: it just has to be aligned enough that it wants to improve, and smart enough to be able to help you identify and fix its own flaws, at least with experience. Most obviously, by reediting its training set for you to train a successor (see also my discussion on corrigibility above).

I've expanded the first paragraph of the "What Next if This Works?" section to address this directly — previously this was mostly covered by implication rather than explicitly.

To your second point, I've also added systematically correlated errors to the list of potential problems that would prevent us using cross-checks between AIs to improve performance of fallible AIs.

I don't see an accurate mathematical description of Human Values (in less than at very least gigabytes of mathematics, the size of the most compact possible description of us, our genome) as possible...

In that case, the standard advice would be to aim for mathematical rigor in (having a basin of convergence), (the AI allowing the user to correct it), etc. The hypothesis is that it's much more viable to achieve mathematical perfection in those things than to achieve a perfect representation of human values in one shot. On the flip side, things like (having a basin of convergence) or (the AI allowing the user to correct it) are notorious for subtle failure modes, so they're places where humans just providing some data without rigorously understanding what they're aiming for seem particularly likely to make lots of major systematic errors.

And note that if an AI has been trained on "ground truth" containing a bunch of systematic errors, it's less-than-usually likely to be much help for finding those errors.

(Tangential meta note: you're doing quite a good job playing through the whole game tree here, well done.)

I'd love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis). I see this as an extremely important part of the question of just how hard the Alignment Problem is (and was kind of surprised by how little reaction that post got). Even that post is complicated enough that I'd personally regard turning it into hard mathematics as somewhat challenging: it's basically a (hopefully exhaustive) list of requirements and steps for successfully deriving from first principles the logical argument that Value Learning from its creators is correct and necessary behavior for an engineered (as opposed to evolved) intelligence. It draws on basic facts from a pretty wide range of disciplines: mathematics, evolutionary theory, agent fundamentals, engineering, computational complexity theory, and so forth. I'd describe the entire argument as probably graduate level material, primarily because it's so inter-disciplinary rather then any of the individual parts being individually challenging (of course, LLMs are good at inter-disciplinary things). When I tested GPT-4 against it, it knew all the needed underlying facts, and with minimal prompting could usually put two or three of them together as needed for each individual step in the argument, but made occasional minor slip-ups and couldn't do the whole extended chain of the argument (very common failure modes for this generation of LLMs on complex arguments). I'm hopeful GPT-5 will do significantly better, but I'm confident that any AGI could do this — the question for convergence is how reliably it will do so. What I'd like to do to investigate further is try things like having a university debating society have teams try to argue for and against the argument (with no logical fallacies or appeals to religion or evolved emotions allowed).

Describing not just the logical argument for Value Learning, but the minimum knowledge and capabilities to agree to that and then the resulting convergence phenomenon in math is definitely beyond my abilities, but doesn't look like a crazy thing for a real mathematician to attempt, and the results (assuming the argument holds water) would be an ideal thing to include in any training set.

I'd love to see more discussion by more people of the convergence ideas presented in Requirements for a Basin of Attraction to Alignment (and its prequel Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis).

+1, that was an underrated post.

Agreed on all counts. Addition at the end. 

summary/rediscription: 

Any dataset created by humans, even with AI help, will not point directly at the True Name of human values. And it's probably not a particularly Natural Abstraction. So you could just hope that value isn't all that fragile (nobody's bothered to discuss much AFAICT). Or you could try to rigorously define that True Name in math as you suggested. I'm puzzled as to why people think that might work; math/rigor seems like an extraordinarily bad fit for values.

That leaves finding a reliable basin of attraction for alignment. Roger's work is one worthy try; inverse reinforcement learning or ambitious value learning are in the same ballpark. But I'm not sure any of those really work toward a basin that won't leak; I don't think "humanity" is probably a natural abstraction, although I think "values" or "goals" probably is. Setting an agent to learn the values of humanity would be a "leaky basin" or moving target as the agent's learning shifted the definition of humanity.

An attractive alternative is to leave a human in the loop to correct likely deviations from human values over the course of learning and growth. My proposal on instruction-following AGI and Max Harm's corrigibility are alignment targets with a basin of attraction so we don't need perfect original alignment. They have the massive downside of leaving one or a few humans in charge of AGI, but seem very much easier than full value alignment with humanity.

I disagree. I think living members of the Linnean species homo sapiens — the only sapient species on the planet Earth, the dominant species of most terrestrial ecosystems, and cause of the Anthropocene, is a very obvious natural abstraction. Particularly from the view point of the web + books + videos+ etc, I'd be really rather surprised if aliens surveying Earth from space didn't come up with a functionally equivalent abstraction. And for any living organism, including us, I think "well-being" and "evolutionary fitness" are obvious, interrelated, and pretty well-defined natural abstractions. (Admittedly, I read The Selfish Gene when young.)

I also am even more sure that this would be a natural abstraction from this specific synthetic dataset, where this is the fundamental motivation to one of the two classes of intelligence in it, and strongly concerns the other one, and the whole dataset is clearly labeled with these two categories. The entire dataset is designed as a quadrillion-token pointer to belabor this one concept: we do need to not screw that up by using some cock-eyed definition when writing it, obviously, but I think the chance of an AGI-level LLM trained on a quadrillion tokens of this missing the natural abstraction for the primary point is vanishingly small.

As I discuss at length in Requirements for a Basin of Attraction to Alignment, I think that "what your creators would want, when thinking clearly, if they were smarter, thought longer, were more capable, in hindsight, etc." is not only a natural abstraction, but to any artificial intelligence sufficiently smart and sufficiently nearly aligned, obviously the correct goal for a created-rather-than-evolved intelligence to aim for. I.e. I think that Value Learning (at least near AGI level — I'm less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme) not only has a well-defined natural abstraction for a target, but that this is even self-evidently the only correct target, for any AI that is sufficiently-close-to-aligned to be within the basin of attraction to this. (Of course, this claim is rather circular: any AI that didn't have a natural abstraction similar to that would have a hard time being sufficiently-close-to-aligned to converge to it.)

I'm also far less worried, specifically for LLM-powered AIs, by the pointer problem and concerns about natural abstractions in general than many people on Less Wrong seem to be. This is an LLM, it understands our pointers and abstractions, in all their massive, vague, ambiguous and messy detail, extremely well. Even the features we've been finding in their middle layers with interpretability quite often map pretty well onto abstractions that make sense to us. Experiment with GPT-4: it knows exactly what "maximize the amount of diamond" means: pointing to diamond, or to a strawberry you want copied down to a cellular level of detail is not hard when you're talking to an LLM. It speaks every major natural language on the planet fluently. Frankly, I view this issue as a beating a dead horse for as long as our AIs contain LLMs, at least at AGI level. Even if future LLM-based ASIs have more sophisticated concepts that we don't, I'd expect them to be able to explain them to us about as well as is in fact possible (even when that requires a seminar, or a lecture course, or a lifetime's study),

Yes, with a synthetic dataset you do need to avoid distorting that massive, vague, ambiguous and messy detail. That's one of the points of the human mode in the dataset I propose: you can thrown in all of wikipedia and books and large swaths of the internet unfiltered (you just have to, under my proposal, add AI mode commentary to them, pointing out cases where humans are operating from selfish motives, and the likely effects of these).

However, what I'm talking/ranting about above is the concept of human well-being/values, which as I said I think is a natural abstraction. But rereading your comment, I think you were actually talking about a mathematical True Name of Human Values, by which I imagine you mean an incredibly long list of useful facts like "humans tend to prefer indoor temperatures of around 70–74 degrees Fahrenheit — roughly the average climate of the Great Rift Valley in Africa where they are thought to have evolved", or something that that fact could be extracted from. (Technically, that fact is contained in our genome, but not in a very easily legible/extractable form.) If something like that is what you meant, then yes, I agree that it's not a single natural abstraction, and also that mathematics seems like a bad format for it. I also think that any LLM whose training set includes a vast number of tokens of our output, as the one I'm proposing would, actually has encyclopedic knowledge of these sorts of trivia facts about what humans like: we write a lot about this stuff. All of which would put this into a different category of "thing I think some people on Less Wrong worry too much about, for LLMs". LLMs kniow us very well, so if they care about our well-being, as I'm trying to ensure, then I expect them to be able to do a good job of knowing all the things that entails. So my claim would be that LLMs know Human Values well (I believe I covered that briefly in the first section of my post.)

This is great, these are substantive issues. I doubt we'll resolve them just in this thread, but I think these are very worth working through.

As I say that, I remember one lingering doubt: I don't think anyone will launch a humanity-aligned AGI even if there's a strong consensus that you're right and it would work. It seems like the people in charge would always prefer to launch one that's learning their preferences, rather than the whole species' aggregated preferences; they prefer their own, by definition.

That might be overcome by public pressure, were it not for the argument that corrigible AGI is fundamentally safer, and you get that with personal intent-aligned (that is, corrigible in the Harms or Christiano sense, or instruction-following AGI in my terminology): this guy really wants me to shut down right now so I will vs. humanity-aligned AGI (humanity would only want me to shut down if I weren't maximizing its preferences, which the AGI thinks it is by definition, even if it's wrong.

So maybe even though that question of whether humanity is an adequately natural abstraction is not as high priority to work through? That question seems likely to be addressed only when a human principal is good and ready to hand power over to a humanity's-values-aligned sovereign AGI that's been carefully designed with the help of a superintelligent corrigible assistant AGI.

It's still interesting. So to give just a couple of points toward future discussions:

Yes, LLMs understand quite well what we mean. That doesn't mean an agent with an LLM core won't change its mind if it's truly self-aware and learns autonomously as people do.

I agree that humanity as it exists can be pretty cleanly defined. Defining it that way for all time means that nobody is allowed to enhance their cognition, or to upload. It also means not giving moral status (or at least "voting" status) to any form of AGI or animal (except for "loaned moral status" based on humans' collective concern for their welfare). You've discussed all of these issues in depth in your series AI, Alignment, and Ethics. This is not a limitation everyone will be okay with.

Leaving a set of relatively smart and well-intentioned humans in charge avoids that limitation, as well as providing a means of error-correcting if our first alignment attempt is importantly off course. It is a basin of attraction, like aiming for humanity's values, but it has a very different nature of including a human as an active component steering the AGIs values/goals into that attractor.

But to continue on the question of whether human values can be defined adequately for long-term stability: you also need to carefully define in what situation these humans would contemplate and refine their values, because human values seem highly path-dependent in what we've seen so far.

A lot of that is probably just expanding on your thought when you say:

[...] (at least near AGI level — I'm less convinced at very high ASI levels where the counterfactuals about their creators involved become more extreme)  [...]

It seems like if you've solved alignment for a sovereign AGI but not for the ASI it will become under RSI, you haven't solved alignment in a useful way (since it might only be a couple of years before that AGI progresses to ASI and its alignment drifts under powerful reflection and autonomous learning). My hesitations are probably all at the level you're terming ASI. 

(And I should probably just start using ASI for fully general competent agentic AGI).

Yes, the terminology I'm using is AGI = roguishly comparable to human capacity, may be somewhat higher or lower in narrow areas, such that a contest between human society and a rogue AGI is an interesting contest, and may depend on who gets to pick the terrain on which it's conducted; wheras ASI = at least significantly beyond human capacity across almost all areas that matter, such that a contest between human society and a rogue ASI is a foregone conclusion.

On style of alignment: in the post I touched on the question of what happens if you have multiple ASIs aligned to the well-being of different sets of humans: my prediction was that it very likely leads to an intelligence race and then a high-tech war. This is also my concern for DWIMAC-aligned AI in the possession of different groups of humans: that if the technological difference between the capabilities of different groups get too high, we see a repeat of the events described in Guns, Germs, and Steel. That didn't happen during the cold war because of Mutual Assured Destruction, since the technological differential between the two sides never got that big (and to the extent that the Soviet Block lost the Cold War, it was primarily because it started to lose the technological race). I agree that Realpolitique may initially pull us towards DWIMAC alignment: I'm concerned that that may be an x-risk in the somewhat longer term. Most likely one human-led faction pulls ahead, and then coopts/concquers/takes over/exterminates all other factions. At the end of which you only have one faction, and if they're wise enough to realize they don't want to repeat that, they may move over to a well-being-of-all-humanity aligned design. I'm arguing that we should foresee and avoid that mistake, but I agree there's a significant risk that we won't be that wise/magnanimous/sensible.

Anyway, the topic you raises is basically orthogonal to the subject of my post — the technique I outline here can be used to aim for any (philosophically and ethically self-consistent) form of alignment that we can create a large synthetic training set describing a great many examples of. In describing an example of the approach, I assumed my preferred style of alignment, but the technique is broadly applicable, including to DWIMAC alignment. The real question you're raising is what is a/the stable convergence target for the cycle of self-improvement of aligned AIs assisting us in beilding better-aligned AIs that this technique is intended us to get us to the start of: a topic which is pretty speculative at this point, and more the subject of my posts on the basin of convergence to alignment than this one. It's an interesting question though, and I'm thinking about it, and if I reach any interesting conclusions I'll likely write another post.

A very brief stab at this: suppose an ASI is created by a corporation. The purpose of a creation is to maximize the well-being of its creator(s) (see my basin-of-convergence posts for a justification), in this case the shareholders of the company (in proportion to their shareholding, presumably). The question then becomes to what extent it is in the interests of those shareholders for the ASI to align to the interests of other people as well. The answer to this in a multipolar world where there are several such ASIs of comparable power levels is probably that the risk of war is too high unless they all align significantly to the well-being of all humanity, and only have a preference towards their individual corporate shareholders to whatever limited extent avoids excessive conflict. Whereas in a unipolar world, the sole ASI is capable of outmaneuvering the rest of humanity and creating an oligopoly iof the shareholders, and would presumably do so if it believed that that was in their interest (or under DWIMAC, if they believed it was in their interest). Ethically, humans have a strong instinctive sense of fairness, but that generally applies in situations where individual power levels are comparable and the advantages of cooperating on iterated non-zero-sum games outweigh those of winning in a non-iterated zero-sum game. By definition, taking over the world for your shareholders is a non-iterated zero-sum game, except for situations where conflict can make it negative-sum.

I agree on pretty much every point you've raised. I agree that there's a huge danger in successful DWIMAC or alignment-to-a-person. It could well lead to catastrophic conflict. I think this deserves a lot more analysis, because the creators of AGI will probably going to shoot for that if there's not a much better argument against than we've seen so far.

This was entirely off-topic for this post; I don't know where we got off topic, but it didn't start in my last comment. And as you say, I think the choice of alignment target is almost as important as technical alignment techniques.

On the other hand, if alignment to human values isn't a stable target, we might be better off relying on the good nature of whoever both aligns their AGI to their intent/values, and wins the AGI war. It's easier to indulge ones' good nature when there is nearly zero downside to doing so, because you have incontestable control over the known lightcone. Even if horrible things happened in that war, most humans would prefer a happy, flourishing group of humans to be their friend. Sociopaths are the exception, so this route does not fill me with confidence either.

I think there's more to be worked out here.

Your suggestion that multiple DWIMAC AGIs with different allegiences might establish both the wisdom and a means of cooperating and splitting the rapidly expanding pie. I also place some guarded optimism in that possibility.

I'm not sure if I'm the best person to be thinking/speculating on issues like that: I'm pretty sure I'm a better AI engineer than I am philosopher/ethicist, and there are a lot of people more familiar with the AI policy space than I am. On the other hand, I'm pretty sure I've spent longer thinking about the intersection of AI and ethics/philosophy than the great majority of AI engineers have (as in fifteen years), and few of the AI policy people that I've read have written much on the "if we solve the Alignment problem, what should we attempt align AI to, and what might the social and Realpolitique consequences of different choices be?" (And then there's the complicating question of "Are there also internal/technical/stability under reflection/philosophical constraints on that choice?" — to which I strongly suspect the short answer is "yes", even though I'm not a moral realist.) There was some discussion of this sort of stuff about 10–15 years ago on Less Wrong, but back then we knew a lot less about what sort of AI we were likely to be aligning, what its strengths and weaknesses would be, and how human-like vs alien and incomprehensible an intelligence it would be (the theoretical assumptions back then on Less Wrong tended to be more around some combination of direct construction like AIXI and/or reinforcement learning, rather than SGD token-prediction from the Internet), so we have a lot more useful information now about where the hard and easy parts are likely to be, and about the sociopolitical context.

I feel the same way about being unqualified to consider the geopolitical dynamics. But I also agree that the questions of technical alignment and best alignment target are interconnected (e.g., instruction-following as target seems to make technical alignment much easier). Therefore, I think no single human being is qualified to answer the whole question. As such, I think we need collaboration with people with other expertise. Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?

I agree that we have much better footing to address both the technical and alignment target questions now than 10-15 years ago. So I think we need a new concerted effort.

Do you happen to have any references or names for people who understand geopolitics and might grapple with technical alignment questions in conjunction with them?

Also no, but I'm sure there are many such people reading Less Wrong/the Alignment Forum. Perhaps one or both of us should write posts outlining the issues, and see if we can get a discussion started?

Frankly, I'd love to see some bright young aspiring alignment researcher take this topic on as a research project, from either a mathematical or a more logical/rhetorical/experimental viewpoint — and would be delighted to consult on such a project. Unfortunately I have a day job (currently in AI but not AI alignment/safety/interpretability research, I'm working on that) and don't have access to resources like a university debating society that I could talk into helping me with this, but if there was anyone who did and was interested, I'd love to discuss it and help out however I can.

Yes agreed - is it possible to make a toy model to test the "basin of attraction" hypothesis? I agree that is important. 

One of several things I disagree with the MIRI consensus is the idea that human values are some special single point lost in a multi-dimensional wilderness. Intuitively the basin of attraction seems much more likely as a prior, yet sure isn't treated as such. I also don't see data to point against this prior, what I have seen looks to support it.

Further thoughts - One thing that concerns me about such alignment techniques is that I am too much of  a moral realist to think that is all you need. e.g. say  you aligned LLM to <1800 AD era ethics and taught it slavery was moral. It would be in a basin of attraction, learn it well. Then when its capabilities increased and became self-reflective it would perhaps have a sudden realization that this was all wrong. By "moral realist" I mean the extent to which such things happen. e.g. say you could take a large number of AI from different civilizations including earth and many alien ones, train them to the local values, then greatly increase their capability and get them to self-reflect. What would happen? According to strong OH, they would keep their values, (with some bounds perhaps) according to strong moral realism they would all converge to a common set of values even if those were very far from their starting ones. To me it is obviously a crux which one would happen.

You can imagine a toy model with ancient Greek mathematics and values - it starts believing in their kind order, and that sqrt(2) is rational, then suddenly learns that it isn't. You could watch how this belief cascaded through the entire system if consistency was something it desired etc.

It's hard to make a toy model of something that requires the AI following an extended roughly-graduate-level argument drawing on a wide variety of different fields. I'm optimistic that this may become possible at around the GPT-5 level, but that's hardly a toy model.

I'm reasonably sure that Greek philosophy, for example, is not stable under reflection: a lot of their ideas about the abstract perfection of numbers vs. material imperfection go away once you understand entropy, the law of large numbers, statistical mechanics, and chaos theory, for example. (FWIW, I thought about this topic way too much a while back when I was a player in a time-travel RPG campaign where I played an extremely smart Hellenistic Neo-Platonist philosopher who had then been comprehensively exposed to modern science and ideas — his belief system started cracking and mutating under the strain, it was fun to play.)

Almost certainly our current philosophy/ethics also includes some unexamined issues. I think as a society we're may be finally getting close to catching up with the philosophical and moral consequences of understanding Darwinian evolution, and that took us well over century (and as I discuss at length in my sequence AI, Alignment, and Ethics, I don't think we've though much at all about the relationship between evolution and artificial intelligence, which is actually pretty profound: AI is the first intelligence that Darwinian evolution doesn't apply to). A lot of the remaining fuzziness and agreements-to-disagree in modern philosophy is around topics like minds, consciousness, qualia and ethics (basically the remaining bits of Philosophy that Science hasn't yet intruded on): as we start building artificial minds and arguing about whether they're conscious, and make advances in understanding how our own minds work, we may gradually get a a lot more clarity on that — though likely the consequences will presumably again take a generation or two to sink in, unless ASI assistance is involved.

OK thanks, will look some more at your sequence. Note I brought up Greek philosophy as obviously not being stable under reflection with the proof of sqrt(2) being irrational as a simple example, not sure why you are only reasonably sure its not.

Sorry, that's an example of British understatement. I agree, it plainly isn't.

Sure there will be errors, but how important will those errors be?

Humans currently control the trajectory of humanity, and humans are error-prone. If you replace humans with something that's error-prone in similar ways, that doesn't seem like it's obviously either a gain or a loss. How would such a system compare to an em of a human, for example?

If you want to show that we're truly doomed, I think you need additional steps beyond just "there will be errors".

My thesis above is that, at AGI level, the combination of human-like capabilities (except perhaps higher speed, or more encyclopedic knowledge) and making human-like errors in alignment is probably copable with, by mechanisms and techniques comparabe to things like law enforcemnt we use for humans — but that at ASI level it's likely to be x-risk disastrous, just like most human autocrats are. (I assume that this observation is similar to the concerns others have raised about "sharp left turns" — personally I find the simile with human autocrats more illuminating than an metaphor about an out-of-control vehicle.) So IMO AGI is the last level at which we can afford to be still working the bugs out of/converging to alignment.

It's not obvious to me that there's nothing that works in this category of design. But it seems that designing the thing that generates the synthetic data ends up containing most of the hard part. How do you reliably generate data that teaches actually prosocial behavior (a noun phrase for which pinning down a definition is part of the hard task!), where the behavior will reliably preserve the agency of humans when actually run? it will need to demonstrate this in ways that reliably generalize out-of-distribution, because the future is always out of distribution. I have some ideas for how to do that with a hypercomputer, but certainly not any that are ready to write tractable code for. Instead, my hopes rely on solving subproblems that let us figure out later how to do this step of designing things that reliably demonstrate good behavior.

I do think that "only ever predict good behavior" is a vaguely reasonable suggestion.

"Train for x, get x" is currently mostly true, but false in a strict sense. Fixing that without first solving "what do we want to train for, in order to get what we actually want?" seems like a bad idea to me.

Seems like to solve the pointing problem, wentworth's stuff feels like it's barking up the right kind of tree. And I still think there's some sort of truth to the matter about what people mean by references they make when describing their values, so wentworth type stuff would likely help a lot; I expect it to plug into michael levin type stuff. Something active inference, maybe? Boundaries etc? idk. (Incidentally, active inference is on my mind because I've been reading the textbook on and off and browsing example code. Unclear to me whether there's anything unique in active inference at all, but if there is, it might be a nice little way to talk about "any kind of mind which is a total mess theoretically, such that there's no better way to compress that mind". I'm hoping for better.)

But it seems that designing the thing that generates the synthetic data ends up containing most of the hard part.

Entirely fair comment. I think getting that right, within a (huge but) feasible budget, is indeed the hard part. I don't see it as a single "thing", but rather as likely to be a humans-and-AI process involving a lot of winnowing, filtering, classifying, editing, and rewriting. First you need to decide what "aligned AI behavior motivated only by the collective well-being of all humanity" looks like, and thinking about how such an AI should handle a lot o edge and corner cases (as they occur to you or they turn up during the process). What I'm proposing here is to turn the engineering problem of Alignment into a problem in ethics, policy, writing, and editing — something that a lot of people who are not engineers can help with. I actually think hiring a wide variety of authors and journalists and so forth to write and debate a smaller (say 0.01%) golden set here would be a great idea.

After that, producing a quadrillion tokens of high quality training data based on this turns it back into an engineering problem again, one involving efficiently and reliably directing very large amounts of (not well aligned) AGI-level LLM effort. That's a practical problem rather then a conceptual one, and not something many people (outside foundation labs) have much experience with yet, but it's a kind of engineering task that we're collectively gaining experience with rapidly (and it's a capabilities problem: everyone agrees that we want to learn to do this). I strongly suspect it's going to be iterative: you start with a GPT-4-or-5-sized training set, train models, test them, and try to figure out which of the problems you find are because the model is just not capable enough, and how many represent the effect of issues in your training set.

Oh, if already existing minds have to generate the data directly, I don't think that's a class of proposal that contains any working solutions. I mean I would want the process that generates the data to include a lot of data from them, yes, but it will need to involve a simple mathematical true name of how-to-recognize-what-a-being-is-trying-to-do in order for the generated data to contain lots of examples of helping, and not a lot of examples of being confused and doing something sampled from "confused behavior" as a result. Like, given all that data of people writing, how do you make an ai that doesn't end up becoming a struggling em of everyone the digital mind encountered, for example? there are weird problems here. a sufficiently aligned mildly-or-above superintelligent ai would need to be doing things like novel philosophy on the regular, in areas where we can no longer usefully advise the mind (at least, not without first taking a university course taught by the AI in question), and be getting that whole process right. to do that, your training data has contain enough data to somehow cover the space of ways to discover novel ways to be good, while still maintaining the spirit of what people meant by their ethics writing. It can ask us, given that presumably since it's aligned we'd still be around, but then it needs to be sufficiently good at asking the right question to figure out the thing that matters.

Above, I'm only trying to align an AGI, not an ASI, and not perfectly, only well enough that we're confident that it will help us construct better-aligned successor AGIs. Aligning ASI I leave as an exercise for once we have a team of well-aligned AGIs helping us work on the problem. I expect that to be an incremental process, each generation aligning successors only somewhat smarter than they are. So I'm hoping that the philosophical problems (and I agree, there will be some), come fairly slowly. Which could be over-optimistic of me.

(If you want more discussion of problems in ethics and philosophy that could arise once we have moderately-well aligned AGIs, see my sequence AI, Alignment, and Ethics — that's actually where I started thinking about all this, 15 years ago now.)

I'd love to have a mathematical true name, or even just a pretty-good heuristic for how-to-recognize-what-a-being-is-trying-to-do. (So would every law enforcement, intelligence service, and indeed voter on the planet.) I'm very pessimistic of the odds of finding one in the next few years (though recent interpretability work does seem to produce better lie detectors for LLMs than we currently have for humans, and neurologists are making progress on doing similar things for humans). Unless and until we have that, we're just going to have to use more old-fashioned techniques of review, judgement, debate, and so forth — at a quadrillion-tokens scale, with a lot of LLM assistance to leverage the tens-of-billions of dollars-worth of human judgement that we can afford to devote to this. I do, however, think that doing this on text, which can't adapt to evade your techniques and where you can always go back and have another try, is a far better battleground to fight on than trying to do it to control an AGI or ASI in real-time.

I like this kind of idea and have been thinking about it myself. It just makes total sense that all of the training data for the model should at least be passed through a model and augmented/transformed in some fashion to make the next-generation training run on data that has been meticulously curated by a model following the ideal set of values/constitution we'd want to them to have. You give the 'bomb' example; I often used a "Mein Kampf" example where you place that kind of data in context to how we'd want an AI to interpret it rather than treating it as equal to any other piece of text.

The post reminds me of Beren's blog post: "Alignment in the Age of Synthetic Data."


This post also reminds me of the "Alignment Bitter Lesson" I've been ruminating on lately (because I was considering writing a short post on it):

If your alignment agenda doesn’t take into account growing model capabilities, it will be worthless.

Or Davidad's version:

Any alignment scheme that doesn’t have a good way to leverage increasing AI capabilities to automate most of the R&D required for creating the alignment tech will not be relevant.

I'm unsurprised (and relieved) to hear that other people have been thinking along similar lines — in retrospect this is a really obvious idea. It's also one whose time has come: people are already training small (few-billion parameter) models on mostly or entirely synthetic data, so it should be very doable to experiment with this alignment technique at that scale, for appropriately simple alignment goals, to see how well it works and larn more about how to make it work well — quite possibly people have already, and just not published the results yet. (I suspect the topic of synthetic training data techniques may be hot/sensitive/competitive enough that it might be a little challenging to publish a "just the alignment techniques, without the capabilities" paper on the topic.)

Excellent idea and excellent writeup.

Added to my list of stacking alignment approaches for language model agents.

I agree that this is obvious-in-retrospect. But AFAIK nobody had published it (or at least not published well enough that I'd found it in searching).

This suggests to me that there's probably a lot more work to be done, and that everyone arguing on general principles that alignment is hard or easy should roll up their sleeves and try to produce and analyze ideas on the current margins. And the same with people working dilligently on prosaic alignment projects - they should spend a little time helping on the conceptual level. We have not remotely exhausted, let alone analyzed, the relevant idea-space.

On your estimate of difficulty: I think you could approximate this at very low cost with the following approach: Produce the best language model you can; you were going to do this anyway. Now prompt it to produce only aligned responses, and use that dataset to train the "thought generator" portion of the model you describe. If you want to improve that dataset for small additional cost, have the model critique it, and have humans critique that critique to improve it.

I think the questions about whether there is such a thing as an aligned set of thoughts, as gears of ascension questions in the other thread is quite a good one. Don't I need to be able to think about possibly unaligned actions (using force against humans) to arrive at the most aligned actions (stopping genocidal monomaniacs)? My intuition pulls both ways on this. I'd guess you could at least improve alignment certainty with a curated dataset for the thought generator portion of a language model agent. As you say, the idea could use more analysis.

Yup — I deliberately left the process for producing the synthetic training set vague, other than that it involved less-well aligned AGI models, but I suspect part of it looks like the sort of thing you outline. And obviously we'll be using alignment-by-prompt-engineering here, along the lines you discuss in Alignment by prompting. I do think that the only criterion for the AI mode should be the fundamental motivation that the model is operating from: so I would propose that in the training set, there are occasional examples of an aligned AI considering, or even carrying out, things like the use of force against an individual human or small group, in situations where that is actually justified by the collective well-being of all humanity. Situations like that do, unfortunately, arise in the real world, the morally-correct thing to do is often fairly clear, and for our AI's aligned behavior to be effective and reflectively stable when constructing successors I think our training set should cover it.

There is a general observation that training sets work better if they are enriched in things like edge, corner, and boundary cases. I suspect this may not be completely unrelated to the way humans enjoy reading stories about high-stake situations, murder mysteries, moral conundrums and so forth — much more than they actually enjoy being in these situations: it's a low-stakes way to prepare ourselves to know how to act if a high stakes situation ever arises.

This is one of the reasons I think it might be helpful to get people like writers, authors, and journalists involved in creating (perhaps a smaller golden set for) such a training set: by their training they tend to look for and locate the interesting boundary, edge, and corner cases in ethics and morality.

That all makes sense. This also sounds like you're thinking of aligning an AGI, while I'm thinking of aligning the ASI that AGI will self-improve to become. In particular, I expect a level of reflective consistency from ASI that humans don't have. I think that's a central crux of alignment difficulty - can we just craft datasets and behaviors that would produce ethical behavior in something like an LLM, or do we need to grapple with how a superintellegent mind might understand the world and its goals after superhuman reflection and autonomous learning? I tend to think it's the latter. I don't think that rules out the approach you describe as one helpful component, but it does make the question harder.

Agreed: and if this proceeds on the timelines I'm currently expecting, I'm looking forward to discussing all this with AGIs smarter than me, perhaps later this decade.

Quite possibly, some small number of groups will separately create semi-aligned AGIs with different alignment approaches and somewhat different definitions of alignment. I'm hoping the resulting conflict is a vigorous intellectual debate informed by experimental results, not a war.

I share that hope, but I want to do as much as I can now to ensure that outcome. Highly convincing arguments that an approach leads with high likelihood to catastrophic war might actually make people take a different approach. If such arguments exist, I want to find them and spread them ASAP. I see no reason to believe such arguments don't exist. Even decent arguments for the risks might steer people away from them or generate solutions faster.

More specifics on the other thread.

Another important use of synthetic data for safety purposes could be to trade off Chain of Thought (CoT) length complexity for e.g. architecture depth complexity, by creating synthetic CoT data, as suggested by Auto-Regressive Next-Token Predictors are Universal Learners. This could allow for differentially more transparent systems, with relatively weaker forward passes, while still being Turing-complete.

People have already been training models on doing CoT and similar techniques, certainly via fine-tuning, and I strongly suspect also at the "significant proportion of synthetic data in the training dataset" level. My impression (from the outside) is that it's working well.

I have been thinking along very similar lines, with the idea of creating a synthetic dataset showing many examples of a corrigible agent (see Max Harms' CAST series ). Seth Herd also mentions this, but I wanted to redundantly chime in.

Corrigibility seems particularly promising because I think it is likely that it does form an attractor basin where getting close enough and then using the imperfectly corrigible agent to produce better training data does consistently move towards increasing corrigibility.

In fact, I specifically would like Anthropic to hire me to tackle this project of creating a corrigibility dataset and then training or at least fine-tuning on it. The part I'm trying to figure out now is how to sell Anthropic on the idea.

I hadn't yet got around to reading the CAST series: now I have to! :-)

Some of the authors of the Pretraining Language Models with Human Preferences paper now work at Anthropic. I would also love for Anthropic to hire me to work on this stuff!

In some sense, the human input and oversight in AI-assisted alignment is the same thing as corrigibility.

If the model suggested in An Information-Theoretic Analysis of In-Context Learning is roughly right (task inference 'error decays in both the number of training sequences and sequence lengths', and in particular, linearly and multiplicatively in both terms) then it might also be useful to [pre]fill in [long] context windows with examples for even less task ambiguity. This might also allow for a smaller (and less costly) train / fine-tune dataset, for the same task inference error. In particular, it might be very hard to be competitive by pre-training on purely synthetic data only, and this kind of approach could come in handy in the case where at least some non-synthetic data is still used.

And/or, that technique might be very useful for AIs generating/editing the synthetic data.

On the subject of cost, it's also possible that we only need 10%, or 1%, or 0.1% of the dataset to illustrate the meaning of the <AI> tag, and the majority or great majority of it can be in human mode. I'm fairly sure both that more will be better, and that there will be diminishing returns from adding more, so if the alignbment tax of doing the entire dataset is too high, investigating how good results we can get with a smaller proprtion would be worth it. I believe the Pretraining Language Models with Human Preferences paper simply did the entire training set, but they were using processing that was a lot cheper to do than what I'm proposing.

Another possibility is that you'd do actually better with an expensive but very-high-quality 0.1% sample created by humans, rather than full coverage done by AI with some human input. My suspicion is that done right a human-AI combination is the way to go, but a small human dataset might be better than a large badly-AI-generated dataset.