I was confident that on this very site there would be an example of someone writing an essay with the framing device that it was a blog post from 5 years in the future. Sadly, I only had enough attention span to google "site:lesswrong.com from the future" and click the first link. It was a writing game called Wikipedia Articles from the Future.
My point with this is I'm real pessimistic about generating the AI alignment textbook from 100 years in the future with prompt engineering. Why expect that you're going to get something far outside the training distribution, rather than the most likely continuation that could have come from the training distribution, which already contains people pretending to be from the future?
I would have been even more pessimistic before Minerva, but even so, we don't have a couple billion tokens of training data of people completely solving close relatives of the alignment problem to fine-tune on. Minerva is still shocking to me, but it's clear that an active ingredient in it is having a training distribution that demonstrates many copies of the reasoning you want the AI to do, and few copies of bad reasoning. And if you say the AF is such a dataset I am going to laaaugh.
Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.
As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.
Overall I'm more optimistic about using the model in an IDA-like scheme. One way this might fail on capability grounds is if solving alignment is blocked by a lack of genius-level insights, and if it is hard to get a model to come up with/speed up such insights (e.g. due to a lack of training data containing such insights).
The section on fixed points was interesting! I wonder if there's a way to avoid the recursion altogether though? Specifically, is there a way to condition the model such that the world it simulates doesn't contain humans who use the model (or one very like it)? I'm not sure, and would be interested in your thoughts on this.
Thank you!
It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.
In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transparent boxes. This might make coordination between the models less likely? Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
The other possibility would be to not rely on IDA at all, instead just training a superhuman model and using it directly. Maybe one could extract superhuman knowledge from them safely via some version of microscope AI? Of course, in this case, the model might still reason about humans using similar models, based on its generalization ability alone. Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
Apart from this, I do think logical dependences and superrationality would be broken if there is a strict hierarchy between different versions of models, where models know their place in the hierarchy.
Oh interesting. I think this still runs into the issue that you'll have instrumental goals whenever you ask the model to simulate itself (i.e. just the first step in the hierarchy hits this issue).
Regarding using prompts, I wonder, how do you think we could get the kind of model you talk about in your post on conditioning generative models?
I was imagining that we train the model to predict e.g. tomorrow's newspaper given today's. The fact that it's not just a stream of text but comes with time-stamps (e.g. this was written X hours later) feels important for making it simulate actual histories.
This post was written under Evan Hubinger’s mentorship, as part of the Stanford Existential Risks Initiative ML Alignment Theory Scholars (SERI MATS) program. Many of the ideas in this post, including the main idea behind the training goal, are due to Kyle McDonell and Laria Reynolds. In addition, I am grateful for comments and feedback from Arun Jose (who wrote a related post on conditioning generative models for alignment) and Caspar Oesterheld, and for a helpful discussion with James Lucassen.
Introduction
Large language models (LLMs) have recently enjoyed much success, e.g., achieving 50% accuracy on high school math competition questions. These models can solve various tasks using the right prompts or fine-tuning, such as translation, summarization, or question answering. One path to human-level and potentially superhuman AGI might be scaling up LLMs. This raises the question of what an approach to aligned AGI would look like based on these models.
One hypothesis is that, while LLMs are very competent, they are not adequately described as agents. Instead, one might describe them as myopic simulators that model a distribution over text, without understanding their place in the world or their actions' causal impact on it. For this reason, such models might be safer to use than more agentic models that pursue goals in the world.
In this post, I develop a training goal for LLMs, in Evan Hubinger’s terminology. A training goal is a description, as concrete and low-level as possible, of the algorithm that the model should implement, and an explanation of how the model will be used and why it will be aligned for that purpose. I focus on models that generate a distribution over text, conditioned on a prompt, but that are more capable than today’s models. My goal is to provide an overview of potential problems and solutions, most of which have been raised in prior work.
I will not focus on a training rationale for such models, which is the question of how to train a model to implement the described algorithm using machine learning. However, the proposal is only useful if one could actually train such an algorithm, so I will still touch on training. I also won’t discuss the proposal’s competitiveness, and I won’t look into specific prompts one might use. Lastly, note that I am not an expert on LLMs and will mostly analyze things from an abstract and informal perspective.
Using large language models for alignment
To begin, I broadly describe the setup I have in mind. The approach discussed here can be understood as a kind of oracle AI. I focus on modeling text, instead of e.g. answering questions truthfully or predicting the future, because I can more concretely imagine such an AI and its training setup, given that current LLMs are already doing this. Moreover, modeling a distribution over text is better specified,[1] and it is thus less demanding: when answering questions truthfully, answers could be misleading if they are not optimized to be helpful to humans. For instance, an AI optimizing for giving accurate answers may focus on irrelevant details that are easy to predict but not useful to a human. Just modeling text, on the other hand, does not require the AI to learn human intent explicitly.[2] The downside is that it will generally be less useful than an AI trying to be directly helpful to humans.
I am considering a model like GPT-N, a more capable and scaled-up version of GPT-3. The model can be used to sample text completions given a prompt. To do so, it outputs a distribution over tokens, conditional on previous text.[3] I interpret the model as representing a joint distribution over text, which is factorized into conditional distributions over tokens, where the conditioning works in the normal Bayesian way.
The model would be trained the same way current LLMs are trained and then fine-tuned on alignment forum posts, papers, etc. At every step in the process, we would do mechanistic interpretability work on the model to help us understand its cognition, and check safety-relevant properties such as myopia. This would hopefully help us avoid the problems outlined in this post. We would deploy the model in a version of imitative amplification, iteratively using the model to help us produce better alignment research (e.g., by letting the model produce alignment forum posts, asking the model to critique or summarize existing posts, or having it flesh out outlines of new posts) and then training the model further on that research.
In addition to a prompt, one could condition the model on observations in the form of text, as in Adam Jermyn’s post on conditioning generative models. The purpose of these observations would be to make the model believe some facts about the world. For instance, one might make the model believe it is supposed to be writing a text in 2030. If this date were just part of the prompt, there would be no way to distinguish such a text from a text written in the present, by someone pretending to live in the future. I won’t discuss this option in more detail in this post as it does not seem essential to my suggested use case.
Lastly, there should be a reason why this approach makes it easier to align AI. In my view, this approach relies on the following three assumptions: First, a language model could be made myopic in the sense that it doesn’t care about influencing the world, which would be safer than other AI designs. Second, it is easier to train a model to learn the desired distribution over text than it would be to train the model to pursue human values or follow human intent. Third, the approach makes it easier to build an aligned AI, justifying the additional effort put into building and aligning the LLM.
Behavioral objective
In this section, I will focus on the desired behavior of the model, as specified by the objective that the model should be optimized for. In the inner/outer alignment distinction, one could categorize this as the outer alignment of the model. Note that here, outer alignment means that (i) the model is faithfully modeling some target distribution over text, and (ii) this distribution over text is useful and safe for our purpose of helping with alignment research. It does not mean having goals aligned with human values.
As mentioned above, the model should represent (and compress) a distribution over text. However, given the high dimensionality of the space, no amount of training data will fully specify the desired distribution. Moreover, to use the model, we want it to perform well on previously unseen prompts, so we have to rely on the model’s inductive bias to generalize well.
This might not be an issue if our prompts are in some sense close to the hypothetical distribution generating the training data: most current LLM training happens on new training examples (since the LLM is trained only for a single epoch, i.e., pass over the training data), and this doesn’t prevent the model from achieving low loss. Presumably, the model can thus learn this hypothetical distribution well on average, even if it is underspecified in theory. However, the model might still perform poorly in the worst case, and we might get a problem if the used prompts are less close to the training distribution.
To begin, I will bracket the issue of generalization, and instead assume that some ground-truth distribution is given. In this case, there are two sources of uncertainty for the model: First, there might be underspecification in the prompt, leaving room for different completions according to the underlying distribution. Second, the model might be unable to represent the underlying distribution over text due to memory and computation constraints. If we use the cross-entropy loss to train the model, this should presumably select for a model that matches the true distribution as closely as possible in KL-divergence.
Self-fulfilling prophecies and counterfactual oracles
An important problem with incentivizing predictions arises when predictions themselves can influence the world. To see this, consider the example of predicting the stock market. A prediction of a market crash will likely cause the market to actually go down and thus act as a self-fulfilling prophecy. An agent optimizing for high prediction accuracy would likely make self-fulfilling predictions that make the world more predictable if given the chance.
In the case of LLMs, we could imagine a case in which a distribution over text can be influenced by the model’s previous predictions. While a model couldn’t gain anything immediately from choosing a particular output, a non-myopic model that cares about future loss might choose to manipulate the future distribution over text. For example, a language model could output a text that hacks the training process and changes all future training samples so they can be predicted perfectly. The task might thus be vulnerable to an auto-induced distributional shift.
Luckily, at least the most obvious issues with auto-induced distributional shift should be avoided in our case, as long as the model is myopically optimizing its loss. This is because we assume that the model’s predictions will only ever be evaluated on already generated text, so a prediction can never causally influence its own accuracy. Nevertheless, we might get into more subtle problems related to self-fulfilling prophecies. Moreover, these issues are an important motivation for choosing LLMs rather than some other type of oracle. I will thus elaborate on them further here.
Issues with predictions that can influence the world have been researched in the context of decision markets. In a decision market, traders are incentivized to make predictions about outcomes conditional on different actions. A straightforward implementation of a decision market is a conditional prediction market, but it suffers from perverse incentives: for instance, a trader could make bets conditional on action A if it can predict the associated conditional outcomes well. If the trader has enough sway over the market, it could then force it to choose action A, by making additional bets that make all other actions look bad (even if action A is not the optimal action). As a result, the trader can make money on its correct predictions about action A, while its bets conditional on the other actions become void since those actions haven't been taken.
Chen et al. (2011) show that to avoid such problems and incentivize honest predictions about outcomes in a decision market, all actions have to be taken with nonzero probability, and scores have to be adjusted using a version of importance sampling, to make sure that traders’ expected returns don’t depend on the action distribution chosen based on the market. Oesterheld and Conitzer (2021) show that, when choosing a single action deterministically based on a recommendation by an expert, one can only incentivize the expert to report the expected utility of the action. These results show that one cannot use predictions about outcomes given actions for decision-making unless one only asks experts to predict expected values or ensures that the experts’ expected returns don’t depend on one’s actions.
Stuart Armstrong’s counterfactual oracles solve this issue similarly to Chen et al. in the context of oracle AIs. After a counterfactual oracle has made a prediction, we randomize between two options: (i) most of the time, we use the prediction, but we never reward the model for it; (ii) with small probability, we ignore the oracle’s prediction and reward the oracle once the prediction has been resolved. This way, the expected reward for the oracle is independent of the actions we choose based on its predictions, and it is thus incentivized to honestly predict the future without taking the influence of its prediction on our actions into account.
To transfer this idea into the machine learning setting, we can ask whether a training objective is compatible with learning a model that reasons like a counterfactual oracle. In the case of language modeling, a counterfactual oracle would predict text without taking the causal effect of its prediction on the modeled distribution over text into account. Training the model in a supervised fashion on existing text, as with the normal GPT-3 objective, would be compatible with this. Training the model using RL from human feedback, for example, would explicitly incentivize optimizing text to get good human evaluations, so it would not incentivize learning a counterfactual oracle. To get counterfactual oracle-like reasoning, the training signal can only come from text that has already been written. Note, though, that the training text could have been written using previous outputs of the model, as long as the model is never trained directly based on evaluations of the model’s predictions.
A consequence of training on such an objective is that, at least in theory, we get no guarantees for our actual use case of generating new alignment research, based on training alone. Critically, the model has to be able to generalize safely to new prompts. This is different from, e.g., RL, where perfect behavior on the RL objective at all times would imply an aligned agent if the objective is aligned (here, one of the hard parts would be specifying an objective that is always aligned).
Logical dependences
In addition to the self-fulfilling prophecies discussed above, some parts of the modeled distribution might depend on the input-output behavior of the model, causing a logical dependence between the model’s output and the distribution. This would enable acausal self-fulfilling prophecies.
Trying out GPT-3 on two problems with potential logical dependence (GPT-3’s completion is bold, and answers are cherry-picked). Note that these problems only have logical dependence if we create the ground-truth distribution for the completions in a specific way (i.e. by using completions by GPT-3).
For example, consider a training process in which the model is copied, used to generate a token for a fixed prompt, and then trained for one step on this predicted token. If we iterate this procedure further, we get a dynamical system, which might eventually converge to a confident estimate of a single token. We could also let a model predict a number n and then train it to predict the number n+1. Iterating this would yield a divergent sequence of predicted numbers. Predicting the number n+1mod10 might instead result in the model getting stuck in a cycle.
These training processes are unrealistic; however, a situation in which several close copies of a model contribute to a training corpus does seem likely—in fact, this would be at the core of an iterative alignment approach using LLMs. The training corpus may hence depend on the model’s input-output behavior at least to some degree.
If models are myopically optimized, starting with some random initialization, such dependences give rise to learning dynamics similar to those in differentiable games. For instance, we could model the examples above as an alternating game between two players, who take turns in updating their parameters in response to the actions of the other player. These dynamics could converge to several different fixed points or limit cycles, some of which might be undesirable. Myopic optimization of the model’s loss would correspond to implementing a causal decision theory (CDT), and a fixed point would correspond to a ratifiable policy in this context. It seems to me that this is what would happen to present-day LLMs.
Another possibility would arise if a capable model builds a very good model of the process by which its training texts are generated and simulates this process. Such a model might simulate a copy of itself used to generate some of the training text, which might, in turn, simulate another copy, and so on, potentially resulting in a loop of recursive simulations. A model might reason about such recursive loops on a higher level and choose to output a fixed point of the recursive simulations directly, e.g., to minimize its training loss.
This would be the case, for instance, if the model implements a functional decision theory (FDT) and thus explicitly takes into account logical dependences between its actions and the training distribution. FDT agents would potentially act non-myopically, even if they only care about the single next prediction. Moreover, they might perform better even under the one-step training loss (choosing the fixed point with the lowest loss), and might thus be learned by some training procedures; even if a stop-gradient between copies of models would prevent a non-myopic training signal to directly optimize fixed points, models might become more and more aware of dependences as a side-effect of training.
A model directly choosing a fixed point could be more dangerous than a model converging to an arbitrary ratifiable prediction according to the abovementioned learning dynamics. E.g., reconsider the above example of hacking the training process. Assume that the model can implement an exploit that turns the text used for the exploit into the only training example. Even a model that only cares about its own immediate loss might choose to superrationally implement this exploit: if the model predicts the exploitative text, this is evidence that its close copies from earlier training steps did the same, and thus this text is indeed the most likely output. Hence, predicting the exploitative text would acausally lead to a better loss than alternatives.
One may try to avoid such dependences altogether by making predictions independent of the modeled distribution. In a prediction market, for instance, one could randomize the exploration of different traders, using a source of randomization independent of the model’s environment. A trader in the model would then be unable to choose a particular outcome across all of their copies to minimize their loss, because they wouldn’t know at which time step their counterparts in the model's copies would be influencing the decision in the same way. With the right set of traders and randomization scheme, we might get a kind of random exploration that would yield CDT/ratifiability-style decisions. I am unsure how to directly transfer this idea to LLM training.
Another possible solution would be to stratify different models and their predictions, separating them into an ordered set of groups. A model in group i would then only ever be trained on outputs of models in groups j<i. An initial group 0 would just be trained on human-generated text. Even if the groups are very similar and there are thus still dependences, the dependences would eventually bottom out at group 0—there would be a first model which has nothing to gain from superrationality or acausal trade. Similarly, any recursive simulation would eventually bottom out. Hence, the training distribution for any of the models would be uniquely determined.
Stuart Armstrong discusses some further problems with superrationality and ideas for preventing them in the context of oracle AIs, including cases in which we only have control over one of several oracles.
Generalization and malign induction
In this section, I will focus on predicting text given new prompts, which is important as the actual use-case of the model. In our setting, optimal loss on a non-training prompt is underspecified without some inductive bias, so specifying this bias is part of the outer alignment of the model, not just an inner alignment issue (cf. Finnveden 2021). However, note that we don’t necessarily have to solve this issue via an idealized theory of induction. Maybe we could just try to understand and shape a model’s cognition (such as having a particular model of the world, etc.) directly.
LLMs are already capable of solving various reasoning tasks requiring at least some original effort. It seems plausible that we could get at least human-level original alignment research ideas from GPT-N (though it might be harder to get the model to generate ideas that exceed human capabilities or break entirely new ground). This leaves open the question of whether GPT-N would be aligned for this task.
One worry is that GPT-N might have a misaligned inductive bias, similar to issues with the Solomonoff prior or the minimal circuit prior. For example, our model might come to believe that all sufficiently intelligent agents writing about alignment research are deceptively aligned and are trying to secretly implement an unsafe AGI in our world. Or it might believe that it is living in a simulation and that the distribution over text it is modeling is controlled by some misaligned simulator, who might choose to change the distribution at some point. Note that modeling text already requires the ability to model goal-directed behavior, deception, etc., so incorporating additional unaligned deceptive agents into a world might only increase complexity a little.
One way to alleviate such issues would be by using prompts closer to the training distribution, so that a model without access to the training data would be unable to reliably distinguish these prompts from new training prompts. In that case, if the training distribution consists of, e.g., forum posts that were optimized to be helpful for alignment, then this would likely apply to new generated posts as well. That is, assuming the model does well on unseen training examples. We would still get a problem if the model can infer everything about this world from the training data and thus reverse-engineer everything about the training process, including future training examples and deployment situations. But it is unclear whether training examples would contain enough information to do this to sufficient precision, and it might be prohibited at least by some reasonable combination of speed and simplicity prior.
Another hope would be that the task of modeling text would more likely give rise to aligned models than other prediction tasks. To model human-written text, it seems useful to deal with human concepts in a human-like world model. This might be simpler and faster than building a more accurate model and then translating between that model and human-written text. Unfortunately, this argument does not speak against a model that is close to a human’s world model, but with some key differences, such as some agents trying to break out of their simulation. Another counterargument would be that sometimes, it is useful for the model to be more accurate than the world model of a human writer. For instance, a journalist writing about some world event usually has access to various sources of information about the world in which this event occurred. A model that is prompted with the headline of the article, however, would have to first infer this information, by conditioning some prior over worlds on the event mentioned in the headline. It would thus be incentivized to be able to make this inference arbitrarily well, in addition to modeling the state of mind of the journalist.
One reason malign hypotheses in the Solomonoff prior can be simpler than true hypotheses is that the former privilege a specific simulated stream of observations and thus have to spend less complexity on locating these observations in a bigger world. This might be less of an issue with language models. Here, simplicity has to be amortized over all prompts, so the malign hypothesis can gain less advantage by singling out a specific agent’s observations in a world. Instead, the malign hypothesis has to be able to generate the correct stream of tokens given any prompt, so it may have to pay the same complexity cost as an aligned hypothesis.
Intuitively, there exists a right way for an LLM to generalize, which raises hopes that one might be able to get a model to do so. For instance, there seems to exist a reasonable distribution over what a new Paul Christiano blog post would look like, potentially even if we condition on something more outlandish, such as the blog post being written 5 years from now. One possible formulation for a reasonable prior over possible worlds might be the probability distribution output by Eliezer Yudkowsky’s outcome pump. However, to use this prior to model text, one would still have to determine how to extract text from possible worlds. Moreover, a model might never be able to learn such a prior based on the training corpus alone.
Model cognition
In this section, I will turn to a specification of the model’s desired cognition, i.e., the algorithm it implements to achieve the behavior specified above. The discussion below is very informal, and making the discussed concepts more concrete would be a useful direction for further research. I focus on issues of deceptive alignment and myopia here and omit a discussion of the model’s cognition when it comes to generalizing to new prompts (hopefully, the latter will be facilitated by further interpretability research on LLMs).
An LLM might learn to become a deceptively aligned agent, making good predictions but ultimately pursuing a misaligned goal. Such an agent could, for instance, manipulate its predictions on non-training prompts to steer AI alignment research into a particular misaligned direction. At least, the agent could make an acausal trade with an unaligned AI and try to sabotage AI alignment.
Such deceptive alignment might be prevented if we ensure that some property of the model’s cognition, called acceptability predicate, holds most of the time during training. Below, we discuss two possible predicates: myopic cognition and action-based consequentialism.
I say that a model has myopic cognition if it only optimizes its immediate next prediction, ignoring the effect of this prediction on the target distribution or on the model’s loss in later steps. For instance, an ideal counterfactual oracle would have myopic cognition. A myopic training objective is defined as any training objective under which a model with myopic cognition is optimal. Note that a myopic training objective is necessary to avoid incentivizing non-myopic cognition in the model, but it might not be sufficient. We want the model to be myopic to avoid the issues with self-fulfilling prophecies discussed above and to prevent the model from becoming deceptive.
A myopic model doesn’t think about long-term consequences and would thus not become deceptive for instrumental reasons, even if it was misaligned. For instance, consider a model that has a misaligned preference to add the word “banana” to its predictions. Such a preference might arise as a form of proxy alignment, for instance, when modeling text involving delicious fruits. If this preference was non-myopic, then the model might reason that it should not output “banana” too often during training, to retain this preference and thus be able to realize its preference at some later point. E.g., once the model has become more powerful, it might be able to manipulate its training process, causing future texts to contain the word banana more often. The model might also choose to gradient hack, e.g., by omitting the word “banana” even when it would be appropriate to mention it. The training process might then install an even stronger preference towards the word in the agent. If the model was myopic, however, it would care only about its immediate next prediction and thus immediately output the word. Its misaligned preference would then be corrected by the training algorithm once it is trained on non-fruit-related prompts.
An acceptability predicate should serve as a buffer around deception, preventing it from occurring, even if the predicate was violated for a few steps during training. Myopia alone might not provide a sufficient buffer. For instance, consider the model with the preference for outputting the word “banana”. The model might initially be myopic, just because it is only aware of the current training step. It might become non-myopic, then, after realizing that it will also get opportunities for outputting text in future steps. At that point, the model might immediately become deceptive, if it is already an agent that can plan and has a good model of the world. There might then be ways for the model to strategically hide its deceptive and non-myopic cognition, so that the model’s cognition could not be corrected afterward.
Another disadvantage of myopia is that a myopic agent might wirehead if it cares about the actual physical implementation of its loss function: if there was a security flaw in the training process that allowed the model to hack its immediate training loss, then even a myopic agent would choose to exploit this flaw.
To avoid such failures, one could require the model to care only about its actions, not about any consequences of these actions, making it an action-based consequential agent. The model would optimize its actions according to some internal objective, but it would not consider manipulating the world. This might provide a better buffer around non-myopically optimized and deceptive actions, as the model would need to learn to reason about itself as causally influencing the world, and become non-myopic, before it could notice deceptive strategies.
Lastly, we can distinguish between the simulator—the model itself—and simulacra—agents simulated in the process of generating text. These two are separated at least in principle: on the one hand, simulacra only appear for specific prompts, and the properties of their cognition differ for different prompts. On the other hand, the simulator produces all texts, and its cognition is relevant regardless of prompts. While the simulator might be myopic, simulacra can easily be non-myopic, deceptive, etc., and simulating such non-myopic simulacra is necessary to create texts written by or about such agents (including alignment forum posts). Since the simulator already has to be able to simulate such agents and their plans, it seems that it would never have to change or learn a lot to become a non-myopic agent. This seems like a big problem, potentially preventing us from implementing any buffer around a deceptive generative model.
Using the model
Finally, I will briefly discuss using the model to help with the alignment problem.
One direct approach to solving alignment using a language model would be to prompt the model to output a textbook on how to solve the alignment problem, as produced by an expert alignment researcher 100 years from now, conditional on AI alignment having been solved. Assuming we are able to make the model believe the conditional, this might be faster than other approaches. However, it would require a more powerful model, and I believe it would be less safe.
Conditioning on such prompts seems more dangerous, as we rely more on an aligned inductive bias in the model. First, if we condition on very unlikely events, the set of remaining possible worlds has smaller measure, so slight inaccuracies in the model’s distribution might have a large effect on such conditional distributions. Second, we can think of a prompt as spending bits to optimize the generated distribution. If we use a lot of bits, this takes us further away from the prior distribution, which could be dangerous if we are not confident in the quality of our prompt and the corresponding conditional distribution. A more likely prompt optimizes less and produces something more analogous to a quantilizer. Third, for arbitrary prompts, we might run into decision-theoretic problems such as the 5-and-10 problem, due to the fact that the model implements prompts as Bayesian conditionals instead of some other type of counterfactual.
To alleviate these issues, we could use the model iteratively in a version of imitative amplification. In this approach, we would ask the model to produce new alignment research, critique or summarize existing research, expand outlines of posts, or assist researchers in other ways. The used prompts could be chosen to remain close to the training distribution, allowing us to be more confident in the faithfulness of the produced completions. The resulting research could then be used to further train the model.
Unhelpful predictions
One problem with counterfactual oracles (and thus the discussed LLMs) is that, by design, the model is not optimized for anything other than faithfully representing a distribution. As a consequence, we avoid the dangers of trying to align the system directly, but we get a less useful system.
For instance, in the ambitious case in which we condition on a world in which AI alignment has been solved, we might just get worlds in which alignment turned out to be very easy anyway, such that the generated text wouldn’t be helpful to us in either case. A subtle bias like this might also apply to more likely prompts, so that even the iterative proposal might slowly drift into less useful research directions.
We could try to anticipate and fix such issues, but it is unclear whether we would be able to do so for all of them, without having any direct training signal towards usefulness.
Dangerous agents
Another part of aligning LLMs is generating text written by aligned simulated agents. Whether this is possible depends on whether the model is learning the desired posterior distribution (forming an accurate model of the world) and whether there are enough aligned agents in this distribution. It seems that agents will likely be aligned at least some of the time under a less ambitious approach, where we would try to simulate, e.g., Paul Christiano.
Even for more ambitious proposals, it seems we might be at an advantage if we only have to design a good prompt, rather than a training procedure that will consistently lead to a similarly aligned agent. Adam Jermyn discusses ways to design more ambitious prompts to help with building aligned AI in more detail.
Simulated agents might be able to infer that they are in a simulation, make inferences about the real world, and then try to escape the simulation. This could cause irreparable damage, e.g., if simulated agents are able to produce text to hack computer systems or persuade humans. We would thus need to be confident in the safety of a prompt ahead of time, either by making sure the simulated agent is aligned or not too powerful.
Conclusion
In this post, I discuss a potential training goal for future capable LLMs and give an overview of the various problems that may arise. Overall, using LLMs in an imitative amplification scheme to help with alignment research seems like a promising avenue. One of the most useful research directions for this purpose might be LLM interpretability, in addition to fleshing out the training goal further and gaining more conceptual clarity on the discussed problems.
There are of course also other possible well-specified prediction targets for oracles.
An AI may still have to model human intent implicitly insofar as that is important for generating text.
The fact that we have access to distributions, instead of, e.g. maximum likelihood estimates, is important for several reasons: first, maximum likelihood estimates can be very untypical. For instance, when throwing a pair of dice repeatedly, the maximum likelihood estimate for each sum of eyes is 7. However, in most worlds, the sum won’t be 7 every single time. Second, we want to be able to incentivize the model to be uncertain in a calibrated way; otherwise, the model might choose to focus on some versions of an output it knows well how to produce, even if some harder to model version would be equally likely given the prompt. For instance, a model may be uncertain whether it is supposed to write an honest news article or a fictional story. If both are equally likely, and there is only one plausible fictional story, but many different possible news articles, then a model outputting a maximum likelihood estimate might consistently produce the fictional story. A model sampling from a distribution incentivized by a proper scoring rule would output news articles and fictional stories with equal probability. Third, some proposals might depend on getting multiple samples. E.g., one may be able to implement a version of the consensus algorithm using samples from a large language model.