Yoshua Bengio recently posted a high-level overview of his alignment research agenda on his blog. I'm pasting the full text below since it's fairly short.

What can’t we afford with a future superintelligent AI? Among others, confidently wrong predictions about the harm that some actions could yield. Especially catastrophic harm. Especially if these actions could spell the end of humanity.

How can we design an AI that will be highly capable and will not harm humans? In my opinion, we need to figure out this question – of controlling AI so that it behaves in really safe ways – before we reach human-level AI, aka AGI; and to be successful, we need all hands on deck. Economic and military pressures to accelerate advances in AI capabilities will continue to push forward even if we have not figured out how to make superintelligent AI safe. And even if some regulations and treaties are put into place to reduce the risks, it is plausible that human greed for power and wealth and the forces propelling competition between humans, corporations and countries, will continue to speed up dangerous technological advances.

Right now, science has no clear answer to this question of AI control and how to align its intentions and behavior with democratically chosen values. It is a bit like in the “Don’t Look Up” movie. Some scientists have arguments about the plausibility of scenarios (e.g., see “Human Compatible“) where a planet-killing asteroid is headed straight towards us and may come close to the atmosphere. In the case of AI there is more uncertainty, first about the probability of different scenarios (including about future public policies) and about the timeline, which could be years or decades according to leading AI researchers. And there are no convincing scientific arguments which contradict these scenarios and reassure us for certain, nor is there any known method to “deflect the asteroid”, i.e., avoid catastrophic outcomes from future powerful AI systems. With the survival of humanity at stake, we should invest massively in this scientific problem, to understand this asteroid and discover ways to deflect it. Given the stakes, our responsibility to humanity, our children and grandchildren, and the enormity of the scientific problem, I believe this to be the most pressing challenge in computer science that will dictate our collective wellbeing as a species. Solving it could of course help us greatly with many other challenges, including disease, poverty and climate change, because AI clearly has beneficial uses. In addition to this scientific problem, there is also a political problem that needs attention: how do we make sure that no one triggers a catastrophe or takes over political power when AGI becomes widely available or even as we approach it. See this article of mine in the Journal of Democracy on this topic.

In this blog post, I will focus on an approach to the scientific challenge of AI control and alignment. Given the stakes, I find it particularly important to focus on approaches which give us the strongest possible AI safety guarantees. Over the last year, I have been thinking about this and I started writing about it in this May 2023 blog post (also see my December 2023 Alignment Workshop keynote presentation). Here, I will spell out some key thoughts that came out of a maturation of my reflection on this topic and that are driving my current main research focus. I have received funding to explore this research program and I am looking for researchers motivated by existential risk and with expertise in the span of mathematics (especially about probabilistic methods), machine learning (especially about amortized inference and transformer architectures) and software engineering (especially for training methods for large scale neural networks).

I will take as a starting point of this research program the following question: if we had enough computational power, could it help us design a provably safe AGI? I will briefly discuss below a promising path to approximate this ideal, with the crucial aim that as we increase computational resources or the efficiency of our algorithms, we obtain greater assurances about safety.

First, let me justify the Bayesian stance – or any other that accounts for the uncertainty about the explanatory hypotheses for the data and experiences available to the AI. Note that this epistemically humble posture or admitting any explanatory hypothesis that is not contradicted by the data is really at the heart of the scientific method and ethics, and motivated my previous post on the “Scientist AI“. Maximum likelihood and RL methods can zoom in on one such explanatory hypothesis (e.g., in the form of a neural network and its weights that fit the data or maximize rewards well) when in fact the theory of causality tells us that even with infinite observational data (not covering all possible interventions), there can exist multiple causal models that are compatible with the data, leading to ambiguity about which is the true one. Each causal model has a causal graph specifying which variable is a direct cause of which other variable, and the set of causal graphs compatible with a distribution is called the Markov equivalence class. Maximum likelihood and RL are likely to implicitly pick one explanatory hypothesis H and ignore most of the other plausible hypotheses (because nothing in their training objective demands otherwise). “Implicitly“, because for most learning methods, including neural networks, we do not know how to have an explicit and interpretable access to the innards of H. If there are many explanatory hypotheses for the data (e.g., different neural networks that would fit the data equally well), it is likely that the H picked up by maximum likelihood or RL will not be the correct one or a mixture containing the correct one because any plausible H or mixture of them (and there could be exponentially many) would maximize the likelihood or reward.

Why is that a problem, if we have a neural net that fits the data well? Not taking into account the existence of other H’s would make our neural network sometimes confidently wrong, and it could be about something very important for our survival. Serious out-of-distribution failures are well documented in machine learning, but for now do not involve decisions affecting the fate of humanity. To avoid catastrophic errors, now consider a risk management approach, with an AI that represents not a single H but a large set of them, in the form of a generative distribution over hypotheses H. Hypotheses could be represented as computer programs (which we know can represent any computable function). By not constraining the size and form of these hypotheses, we are confident that a correct explanation, at least one conceivable by a human, is included in that set. However, we may wish to assign more probability to simpler hypotheses (as per Occam’s Razor). Before seeing any data, the AI can therefore weigh these hypotheses by their description length L in some language to prefer shorter ones, and form a corresponding Bayesian prior P(H) (e.g. proportional to 2^{-L}). This would include a “correct” hypothesis H*, or at least the best hypothesis that a human could conceive by combining pieces of theories that humans have expressed and that are consistent with data D. After seeing D, only a tiny fraction of these hypotheses would remain compatible with the data, and I will call them plausible hypotheses. The Bayesian posterior P(H | D) quantifies this: P(H | D) is proportional to the prior P(H) times how well H explains D, i.e., the likelihood P(D | H). The process of scientific discovery involves coming up with such hypotheses H that are compatible with the data, and learning P(H | D) would be like training an AI to be a good scientist that spits out scientific papers that provide novel explanations for observed data, i.e., plausible hypotheses. Note that the correct hypothesis, H*, by definition must be among the plausible ones, since it is the best possible account of the data, and with Occam’s Razor hypothesis we can assume that it has a reasonable and finite description length. We will also assume that the data used to train our estimated posterior is genuine and not consistently erroneous (otherwise, the posterior could point to completely wrong conclusions).

There is a particularly important set of difficult-to-define concepts for a safe AI, which characterize what I call harm below. I do not think that we should ask humans to label examples of harm because it would be too easy to overfit such data. Instead we should use the Bayesian inference capabilities of the AI to entertain all the plausible interpretations of harm given the totality of human culture available in D, maybe after having clarified the kind of harm we care about in natural language, for example as defined by a democratic process or documents like the beautiful UN Universal Declaration of Human Rights.

If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.

Based on this observation we can decompose our task in two parts: first, characterize the set of plausible hypotheses – this is the Bayesian posterior P(H | D); second, given a context c and a proposed action a, consider plausible hypotheses which predict harm. This amounts to looking for an H for which P(H, harm | a, c, D)>threshold. If we find such an H, we know that this action should be rejected because it is unsafe. If we don’t find such a hypothesis then we can act and feel assured that harm is very unlikely, with a confidence level that depends on our threshold and the goodness of our approximation.

Note that with more data, the set of hypotheses compatible with the data (those that have a high probability under P(H | D)), will tend to shrink – exponentially, in general. However, with the space of hypotheses being infinite in the first place, we may still end up with a computationally intractable problem. The research I am proposing regards how we could approximate this tractably. We could leverage the existing and future advances in machine learning (ML) based on the work of the last few decades, in particular our ability to train very large neural networks to minimize a training objective. The objective is that safety guarantees will converge to an exact upper bound on risk as the amount of available compute and the efficiency of our learning methods increase.

The path I am suggesting is based on learned amortized inference, in which we train a neural network to estimate the required conditional probabilities. Our state-of-the-art large language models (LLMs) can learn very complex conditional distributions and can be used to sample from them. What is appealing here is that we can arbitrarily improve the approximation of the desired distributions by making the neural net larger and training it for longer, without necessarily increasing the amount of observed data. In principle, we could also do this with non-ML methods, such as MCMC methods. The advantage of using ML is that it may allow us to be a lot more efficient by exploiting regularities that exist in the task to be learned, by generalizing across the exponential number of hypotheses we could consider. We already see this at play with the impressive abilities of LLMs although I believe that their training objective is not appropriate because it gives rise to confidently wrong answers. This constitutes a major danger for humans when the answers are about what it is that many humans would consider unacceptable behavior.

We can reduce the above technical question to (1) how to learn to approximate P(H | harm, a, c, D) for all hypotheses H, actions a, and contexts c and for the given data D, while keeping track of the level of approximation error, and (2) find a proof that there is no H for which P(H, harm | a, c, D)>threshold, or learn excellent heuristics for identifying H’s that maximize P(H, harm | a, c, D), such that a failure to find an H for which P(H, harm | a, c, D)>threshold inspires confidence that none exist. These probabilities can be in principle deduced from the general posterior P(H | D) through computations of marginalization that are intractable but that we intend to approximate with large neural networks.

Part of the proposed research is to overcome the known inefficiency of Bayesian posterior inference needed for (1). The other concerns the optimization problem (2) of finding a plausible hypothesis that predicts major harm with probability above some threshold. It is similar to worst-case scenarios that sometimes come to us: a hypothesis pops in our mind that is plausible (not inconsistent with other things we know) and which would yield a catastrophic outcome. When that happens, we become cautious and hesitate before acting, sometimes deciding to explore a different, safer path, even if it might delay (or reduce) our reward. To imitate that process of generating such thoughts, we could take advantage of our estimated conditionals to make the search more efficient: we can approximately sample from P(H | harm, a, c, D). With a Monte-Carlo method, we could construct a confidence interval around our safety probability estimate, and go for an appropriately conservative decision. Even better would be to have a neural network construct a mathematical proof that there exists no such H, such as a branch-and-bound certificate of the maximum probability of harm, and this is the approach that my collaborator David Dalrymple proposes to explore. See the research thesis expected to be funded by the UK government within ARIA that spells out the kind of approach we are both interested in.

An important issue to tackle is that the neural networks used to approximate conditional probabilities can still make wrong predictions. We can roughly divide errors into three categories: (a) missing modes (missing high-probability hypotheses), (b) spurious modes (including incorrect hypotheses), and (c) locally inaccurate probability estimation (we have the right hypotheses, but the numerical values of their probabilities are a little bit inaccurate). Inaccurate probabilities (c) could be fixed by additional tuning of the neural network, and we could estimate these inaccuracies by measuring our training errors, and then use them to construct confidence intervals around our estimated probabilities. Only having spurious modes (b) would not be too worrisome in our context because it could make us more conservative than we should: we could reject an action due to an implausible hypothesis H that our model considers as plausible, when H wrongly predicts catastrophic harm. Importantly, the correct hypothesis H* would still be among those we consider for a possible harmful outcome. Also, some training methods would make spurious modes unlikely; for example, we can sample hypotheses from the neural net itself and verify if they are consistent with some data, which immediately provides a training signal to rule them out.

The really serious danger we have to deal with in the safety context is (a), i.e., missing modes, because it could make our approximately Bayesian AI produce confidently wrong predictions about harm (although less often than if our approximation of the posterior was a single hypothesis, as in maximum likelihood or standard RL). If we could consider a mode (a hypothesis H for which the exact P(H|D) is large) that the current model does not see as plausible (the estimated P(H|D) is small), then we could measure a training error and correct the model so that it increases the estimated probability. However, sampling from the current neural net unfortunately does not reveal the existence of missing modes, since the neural net assigns them very small probability in the first place and would thus not sample them. This is a common problem in RL and has given rise to exploration methods but we will apply these methods in the exploration in the space of hypotheses, not the space of real-world actions: we want to sample hypotheses not just from our current model but also from a more exploratory generative model. This idea is present in RL and also in the research on off-policy training of amortized inference networks. Such methods can explore where we have not yet gone or where there are clues that we may have missed a plausible hypothesis. As argued below, we could also considerably reduce this problem if the AI could at least consider the hypotheses that humans have generated in the past, e.g., in human culture and especially in the scientific literature.

A nice theoretical reassurance is that we could in principle drive those training errors to zero with more computational resources. What is nice with the proposed Bayesian posterior approximation framework is that, at run-time, we can continue training or at the very least estimate the error made by the neural network through a sampling process. This is similar to how AlphaGo can refine its neural net prediction by running a bunch of stochastic searches for plausible downstream continuations of the game. In human terms, this would be like taking the time to think harder when faced with a tricky situation where we are not sure of what to do, by continuing to sample relevant possibilities in our head and adjusting our estimates of what could happen accordingly.

Yet another way to decrease the risks associated with an insufficiently trained neural network is to make the AI-generated hypotheses somewhat human-readable. This could be achieved by using a regularizer to encourage the AI to generate interpretable hypotheses, i.e., ones that can be converted to natural language and back with as little error as possible, and vice-versa (such that human theories expressed in natural language can be expressed as statements in the AI internal language for hypotheses). At the very least, if we cannot convert the full theory to a human-interpretable form, we could make sure that the concepts involved in the theory are interpretable, even if the relationships between concepts may not always be reducible to a compact verbalizable form. However, because a small number of discrete statements would have a much smaller description length, the AI training procedure should favor interpretable explanations. This would allow human inspection of the explanations generated by the AI. Instead of trying to interpret neural net activations, we would only require that the sequences of outputs generated by the AI be interpretable or as interpretable as possible. This would favor the set of theories about the world that humans can understand, but that space is extremely expressive: it includes all existing scientific theories. Some pieces of these theories could however be implicit, for example the result of applying an algorithm. AI theories could refer to existing math and computer science knowledge in order to explain the data more efficiently: think about algorithms that approximate quantum physics in order to characterize chemical properties. Although the quantum physics laws can be spelled out compactly, there is no tractable solution to questions involving more than a few atoms, and chemistry is all about approximations and fit to larger-scale data for which exact quantum calculations are infeasible, sometimes purely numerical, but involving variables and concepts that can be defined and named.

Interestingly, human theories would generally have a better prior (i.e., would be preferred by the AI) than completely novel ones because their description length could be reduced to identifying their index in the encyclopedia of human knowledge, e.g., by quoting the bibliographic reference of a corresponding scientific paper or wikipedia entry. On the other hand, novel theories would have to be specified from the much larger universe of all the possible theories. For the same reason, a novel theory that can build on existing human concepts and theories would be more compact to describe and would be more likely to be chosen a priori (according to P(H))) by the AI than one that requires a lot of new definitions and concepts. A major advantage of making AI hypotheses interpretable and human theories translatable in the AI language for hypotheses is that it might allow us to obtain approximate guarantees that the AI estimate of P(H|D) could not be missing a theory (a missing mode) that already belongs in the human corpus of theories. So we could have missing modes, but only for theories that humans also miss.

Furthermore, consider that the same P(H|D) is used both to obtain actions, e.g., from P(action | outcome, context, D), and to obtain the cautious rejection of unsafe actions via sampling of P(H | harm, action, context, D). Hence, if a mode is missing in P(H|D), it is missing for both conditional predictions (which come from the same neural network but used with different inputs). Hence, if the AI can read all the human-generated theories (for example by reading all the scientific literature), we could make sure that these partial hypotheses are all included as explanatory candidates for the data D. Only those human theories that are compatible with the data available to the AI would remain among the plausible ones under the estimated P(H | D), i.e., we can cull out the human-generated bad theories that are not even compatible with data (like conspiracy theories and incoherent blabber that populate much of our internet). As a consequence, we would be assured that if any human would have predicted harm using any of the plausible human-generated theories, so would the AI’s approximate Bayesian posteriors over theories. The AI could also discover modes (plausible hypotheses) not known by humans, i.e., new scientific theories, but at least it would have absorbed all human hypotheses about how the world works, culling out those that are incoherent or inconsistent with data. This is very different from an LLM which just mimics the distribution of the text in its training corpus. Here we are talking about explanations for the data, which cannot be inconsistent with the data because the data likelihood P(D|H) computed given such an interpretation would otherwise vanish, nor be internally inconsistent because P(H) would vanish. If either P(D|H) or P(H) vanish, then the posterior P(H|D) vanishes and the AI would be trained to not generate such H’s.

A particular kind of explanation for the data is a causal explanation, i.e., one that involves a graph of cause-and-effect relationships. Our neural net generating explanations could also generate such graphs (or partial graphs in the case of partial explanations), e.g., as we have shown on a small scale already. Causal explanations should be favored in our prior P(H) because they will be more robust to changes in distribution due to actions by agents (humans, animals, AIs), and they properly account for actions, not just as arbitrary random variables but ones that interfere with the default flow of causality – they are called “interventions”. Causal models are unlike ordinary probabilistic (graphical) models in that they include the possibility of interventions on any subset of the variables. An intervention gives rise to a different distribution without changing any of the parameters in the model. A good causal model can thus generalize out-of-distribution, to a vast set of possible distributions corresponding to different interventions. Even a computer program can be viewed under a causal angle, when one allows interventions on the state variables of the program, which thus act like the nodes of a causal graph.

This post only provides a high-level overview of the research program that I propose, and much remains to be done to achieve the central goal of efficient and reliable probabilistic inference over potentially harmful actions with the crucial desideratum of increasing the safety assurance when more computational power is provided, either in general or in the case of a particular context and proposed action. We don’t know how much time is left before we pass a threshold of dangerous AI capabilities, so advances in AI alignment and control are urgently needed.

New Comment
19 comments, sorted by Click to highlight new comments since:

It currently seems to me like almost all the interesting work is in the step where we need to know whether a hypothesis implies harm.

Putting this in language which makes the situation more clear to me, "we need to know whether a given predictative model predicts/'thinks' that a given action will result in bad outcomes (despite ourselves not knowing this and despite not having access to reliable short term measurements)".

This is a very general version of the ELK problem.[1]

As far as I can, the proposal for ensuring that we know whether a given predictive model 'thinks' that a given action will result in bad outcomes is to ensure that the predictive models are sufficiently human interpretable.

(This is broadly equivalent to imitative generalization. Note that various worst-case-theory counterexamples to using imitative generalization are in the ELK report along with other discussion.)

Given that so much of the action in having human intepretable predictative models (again, these would be called hypotheses in the language of this post), I'd be excited about work demonstating this in hard cases where the most competitive approach would be to train a black box predictor.

For instance, I'd be excited about work demonstrating the ability to produce human interpretable predictive models which allow humans to perform much better at next token prediction while also understanding what's going on.

While tasks like next token prediction are relatively straightforward to set up, the most analogous cases would probably be cases where we want to train an AI agent, but the agent could tamper with our measurements[2] and humans struggle to understand if the measurements were tampered with based on looking at the actions of the AI agent.

I'm somewhat nervous that in practice, work on this agenda will almost entirely focus on "reliable probabilistic inference" component even though this component doesn't seem to be doing most of the work.

(If reliable probabilistic inference failed, but we had a methodology for finding some predictive hypotheses that humans understand, that would be a very valuable contribution. But probabilistic inference with asymptotic guarantees doesn't clearly get us much on its own.)

[My thoughts here are partially based on reading some comments on this agenda I've seen elsewhere and a variety of other discussion with people, but I'm unsure if these individuals would like their name appear in this comment, so I haven't included acknowledgements. All mistakes are my own.]


  1. It's a very general version of ELK as it needs to handle all bad outcomes, in principle including bad outcomes which take a very long time to manifest. (Unlike the diamond in the vault example from the ELK report where we just care about an outcome that occurs on a short time horizon.) ↩︎

  2. Note that the linked work (by myself and collaborators) doesn't have the property that humans would have a hard knowing if the measurements were tampered with based on inspecting the action. ↩︎

IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.

Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.

IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.

EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.

If the claim is that it's easier to learn a covering set for a "true" harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:

  1. The Inverse Reward Design paper, which tries to straightforwardly implement the posterior P(True Reward|Human specification of Reward) and act conservatively wrt to this outcome. 
  2. The Learning preferences from state of the world paper does this for P(True Reward | Initial state of the environment) and also acts conservatively wrt to outcome.
  3. [A bunch of other papers which consider this for Reward | Another Source of Evidence, including: randomly sampled rewards and human off-switch pressing, . Also the CIRL paper, which proposes using uncertainty to directly solve the meta problem of "the thing this uncertainty is for".]
  4. It's discussed as a strategy in Stuart's Human Compatible, though I don't have a copy to reference the exact page number.

I also remember finding Rohin's Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There's also a lot more academic work in this vein from other research groups/universities too. 

The reason I'm not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard. 

It's clear that if you allow H to be "all possible harm predicates", then an AI that acts conservatively wrt to this is safe, but it's also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the "true" harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate. 


Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the "Bayesian ML is hard" perspective.

Ironically, the state of the art for this when I left CHAI in 2021 were "ask the most capable model (an LLM) to estimate the uncertainty for you in one go" and "ensemble the point estimates of very few but quite capable models" (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the "learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it" approaches. (Anyone who's currently at CHAI should feel free to correct me here.)

I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I'm not super confident that he'll do better than "finetune an LLM to help you do it", which is presumably what we'd be doing anyways? 

(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan's comments above.)

The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):

One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.

If you get an AI scientist like that, clearly you need to have solved ELK. And I don't think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don't produce artifacts more legible than regular NNs, right?).

I agree this agenda wants to solve ELK by making legible predictive models.

But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.

I guess I don't understand how it would even be possible to be conservative in the right way if you don't solve ELK: just because the network is Bayesian doesn't mean it can't be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between "the model is confident and correct that doing X is totally safe" (but humans don't know why) and "the model tells you it's confident that doing X is totally safe - but it's actually false"? Where do you get the information that distinguishes "confident OOD prediction because it's correct" from "the model is confidently lying OOD"? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don't understand, or ignore the possibility of scheming).

I'm skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).

It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn't come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you'll be paralyzed with inaction (because the hypothesis "predict harm whenever a human wouldn't understand what's going on" isn't going to be that low measure).

If it did improve over normal ensembling, why wouldn't Bayesian ML already be relatively widely used?

If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.

[A bunch of what I'm going to say is maybe obvious, but I'm uncertain what will and won't be obvious, so I'm saying it anyway.]

E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.

We can already pick a threshold for a classifier trained "normally" and vary this threshold based on the context. (Or using an ensemble of classifiers.)

Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.

Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can't obtain good labels and thus would prefer to just hope for good generalization.

For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we're on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.

So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)

Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]

If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding's inequality.

For an ensemble of N models, these bounds won't allow us to bound probabilities lower than 1/N, but we'll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding's inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won't work, but I also expect that you run into other issues such as hypotheses like "say there will be harm if humans would have been confused" which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)

(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)

(Perhaps you (and other people who think Bayesian ML would do something useful) don't buy this "training a model is just sampling from the posterior" argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)

I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.

I think it's probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.

For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It's just Bayesian Logistic Regression).

(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)

My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.

I think the reason we don't see Bayesian ML is that ensembles probably just work fine in practice.


  1. If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let's simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We'll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn. ↩︎

For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It's just Bayesian Logistic Regression).

IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered. 

To avoid catastrophic errors, now consider a risk management approach, with an AI that represents not a single H but a large set of them, in the form of a generative distribution over hypotheses H

On first reading this post, the whole proposal seemed so abstract that I wouldn't know how to even begin making such an AI. However after a very quick skim of some of Bengio's recent papers I think I have more of a sense for what he has in mind.

I think his approach is roughly to create a generative model that constructs Bayesian Networks edge by edge, where the likelihood of generating any given network represents the likelihood that that causal model is the correct hypothesis.

And he's using GFlowNets to do it, which are a new type of ML/RL model developed by MILA that generate objects with likelihood proportional to some reward function (unlike normal RL which always tries to achieve maximum reward). They seem to have mostly been used for biological problems so far.

If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.


Feels like this would paralyze the AI and make it useless?

The charitable interpretation here is that we'll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we'll pause and refer actions to humans if E[harm|action] is too high.)

I think "ruling out the possiblity" isn't really a good frame for thinking about this, and it's much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn't pin down predictions well enough).

ETA: realistically, I think this isn't going to perform better than ensembling in practice.

See discussion here also.

Is there anyone who understand GFlowNets who can provide a high-level summary of how they work?

Let me know if I've missed something, but it seems to me the hard part is still defining harm. In the one case, where we will use the model and calculate the probability of harm, if it has goals, it may be incentivized to minimize that probability. In the case where we have separate auxiliary models whose goals are to actively look for harm, then we have a deceptively adversarial relationship between these. The optimizer can try to fool the harm finding LLMs. In fact, in the latter case, I'm imagining models which do a very good job at always finding some problem with a new approach, to the point where they become alarms which are largely ignored.

Using his interpretability guidelines, and also human sanity checking all models within the system, I see we can probably minimize failure modes that we already know about, but again, once it gets sufficiently powerful, it may find something no human has thought of yet.

If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.

This idea seems to ignore the problem that the null action can also entail harm. In a trolley problem this AI would never be able to pull the lever.

Maybe you could get around this by saying that it compares the entire wellbeing of the world with and without its intervention. But still in that case if it had any uncertainty as to which way had the most harm, it would be systematically biased toward inaction, even when the expected harm was clearly less if it took action.

[mostly-self-plaigiarized from here] If you have a very powerful AI, but it’s designed such that you can’t put it in charge of a burning airplane hurtling towards the ground, that’s … fine, right? I think it’s OK to have first-generation AGIs that can sometimes get “paralyzed by indecision”, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.

You only really get a problem if your AI finds that there is no sufficiently safe way to act, and so it doesn’t do anything at all. (Or more broadly, if it doesn’t do anything very useful.) Even that’s not dangerous in itself… but then the next thing that happens is the programmer would probably dial the “conservatism” knob down lower and lower, until the AI starts doing useful things. Maybe the programmer says to themselves: “Well, we don’t have a perfect proof, but all the likely hypotheses predict there’s probably no major harm…”

Also, humans tend to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. I think that if it’s possible to make AIs with the same inclination, then it seems like probably a good idea to do so, at least until we get up to super-reliable 12th-generation AGIs or whatever. It’s dangerous to make AIs that notice injustice on the other side of the world and are immediately motivated to fix it—that kind of AI would be very difficult to keep under human control, if human control is the plan (as it seems to be here).

Sorry if I’m misunderstanding.

Yup I think I agree. However I could see this going wrong in some kind of slow takeoff world where the AI is already in charge of many things in the world.