It currently seems to me like almost all the interesting work is in the step where we need to know whether a hypothesis implies harm.
Putting this in language which makes the situation more clear to me, "we need to know whether a given predictative model predicts/'thinks' that a given action will result in bad outcomes (despite ourselves not knowing this and despite not having access to reliable short term measurements)".
This is a very general version of the ELK problem.[1]
As far as I can, the proposal for ensuring that we know whether a given predictive model 'thinks' that a given action will result in bad outcomes is to ensure that the predictive models are sufficiently human interpretable.
(This is broadly equivalent to imitative generalization. Note that various worst-case-theory counterexamples to using imitative generalization are in the ELK report along with other discussion.)
Given that so much of the action in having human intepretable predictative models (again, these would be called hypotheses in the language of this post), I'd be excited about work demonstating this in hard cases where the most competitive approach would be to train a black box predictor.
For instance, I'd be excited about work demonstrating the ability to produce human interpretable predictive models which allow humans to perform much better at next token prediction while also understanding what's going on.
While tasks like next token prediction are relatively straightforward to set up, the most analogous cases would probably be cases where we want to train an AI agent, but the agent could tamper with our measurements[2] and humans struggle to understand if the measurements were tampered with based on looking at the actions of the AI agent.
I'm somewhat nervous that in practice, work on this agenda will almost entirely focus on "reliable probabilistic inference" component even though this component doesn't seem to be doing most of the work.
(If reliable probabilistic inference failed, but we had a methodology for finding some predictive hypotheses that humans understand, that would be a very valuable contribution. But probabilistic inference with asymptotic guarantees doesn't clearly get us much on its own.)
[My thoughts here are partially based on reading some comments on this agenda I've seen elsewhere and a variety of other discussion with people, but I'm unsure if these individuals would like their name appear in this comment, so I haven't included acknowledgements. All mistakes are my own.]
It's a very general version of ELK as it needs to handle all bad outcomes, in principle including bad outcomes which take a very long time to manifest. (Unlike the diamond in the vault example from the ELK report where we just care about an outcome that occurs on a short time horizon.) ↩︎
Note that the linked work (by myself and collaborators) doesn't have the property that humans would have a hard knowing if the measurements were tampered with based on inspecting the action. ↩︎
IIUC, I think that in addition to making predictive models more human interpretable, there's another way this agenda aspires to get around the ELK problem.
Rather than having a range of predictive models/hypotheses but just a single notion of what constitutes a bad outcome, it wants to also learn posterior over hypotheses for what constitutes a bad outcome, and then act conservatively with respect to that.
IIRC the ELK report is about trying to learn an auxilliary model which reports on whether the predictive model predicts a bad outcome, and we want it to learn the "right" notion of what constitutes a bad outcome, rather than a wrong one like "a human's best guess would be that this outcome is bad". I think Bengio's proposal aspires to learn a range of plausible auxilliary models, and sufficiently cover that space that the both the "right" notion and the wrong one are in there, and then if any of those models predict a bad outcome, call it a bad plan.
EDIT: from a quick look at the ELK report, this idea ("ensembling") is mentioned under "How we'd approach ELK in practice". Doing ensembles well seems like sort of the whole point of the AI scientists idea, so it's plausible to me that this agenda could make progress on ELK even if it wasn't specifically thinking about the problem.
If the claim is that it's easier to learn a covering set for a "true" harm predicate and then act conservatively wrt the set, than to learn a single harm predicate, is not a new approach. E.g. just from CHAI:
I also remember finding Rohin's Reward Uncertainty to be a good summary of ~2018 CHAI thinking on this topic. There's also a lot more academic work in this vein from other research groups/universities too.
The reason I'm not excited about this work is that (as Ryan and Fabien say) correctly specifying this distribution without solving ELK also seems really hard.
It's clear that if you allow H to be "all possible harm predicates", then an AI that acts conservatively wrt to this is safe, but it's also going to be completely useless. Specifying the task of learning a good enough harm predicate distribution that both covers the "true" harms and also allows your AI to do things is quite hard, and subject to various kinds of terrible misspecification problems that seem not much easier to deal with than the case where you just try to learn a single harm predicate.
Solving this task (that is, solving the task spec of learning this harm predicate prosterior) via probabilistic inference also seems really hard from the "Bayesian ML is hard" perspective.
Ironically, the state of the art for this when I left CHAI in 2021 were "ask the most capable model (an LLM) to estimate the uncertainty for you in one go" and "ensemble the point estimates of very few but quite capable models" (that is ensembling, but common numbers were in the single digit range, e.g. 4). These seemed to out perform even the "learn a generative/contrastive model to get features, and then learn a bayesian logistic regression on top of it" approaches. (Anyone who's currently at CHAI should feel free to correct me here.)
I think? that the approach from Bengio is trying to avoid the difficulties is by trying to solve Bayesian ML. I'm not super confident that he'll do better than "finetune an LLM to help you do it", which is presumably what we'd be doing anyways?
(That being said, my main objections are akin to the ontology misidentification problem in the ELK paper or Ryan's comments above.)
The AI scientist idea does not bypass ELK, it has to face the difficulty head on. From the AI scientist blog post (emphasis mine):
One way to think of the AI Scientist is like a human scientist in the domain of pure physics, who never does any experiment. Such an AI reads a lot, in particular it knows about all the scientific litterature and any other kind of observational data, including about the experiments performed by humans in the world. From this, it deduces potential theories that are consistent with all these observations and experimental results. The theories it generates could be broken down into digestible pieces comparable to scientific papers, and we may be able to constrain it to express its theories in a human-understandable language (which includes natural language, scientific jargon, mathematics and programming languages). Such papers could be extremely useful if they allow to push the boundaries of scientific knowledge, especially in directions that matter to us, like healthcare, climate change or the UN SDGs.
If you get an AI scientist like that, clearly you need to have solved ELK. And I don't think this would be a natural byproduct of the Bayesian approach (big Bayesian NNs don't produce artifacts more legible than regular NNs, right?).
I agree this agenda wants to solve ELK by making legible predictive models.
But even if it doesn’t succeed in that, it could bypass/solve ELK as it presents itself in the problem of evaluating whether a given plan has a harmful outcome, by training a posterior over harm evaluation models rather than a single one, and then acting conservatively with respect to that posterior. The part which seems like a natural byproduct of the Bayesian approach is “more systematically covering the space of plausible harm evaluation models”.
I guess I don't understand how it would even be possible to be conservative in the right way if you don't solve ELK: just because the network is Bayesian doesn't mean it can't be scheming, and thus be conservative in the right way on the training distribution but fail catastrophically later, right? What difference can the training process and the overseer make between "the model is confident and correct that doing X is totally safe" (but humans don't know why) and "the model tells you it's confident that doing X is totally safe - but it's actually false"? Where do you get the information that distinguishes "confident OOD prediction because it's correct" from "the model is confidently lying OOD"? It feels like to solve this kind of issue, you ultimately have to rely on ELK (or abstain from making predictions in domains humans don't understand, or ignore the possibility of scheming).
I'm skeptical this realistically improves much over doing normal ensembling (ensembling is basically just taking a small number of samples from the posterior anyway).
It could in principle be better, but this would require that estimating relatively low measure predictions is key (predictions which wouldn't come up in a reasonably sized ensemble). I would guess that if you pause actions if there is any low measure hypothesis which indicates harm, you'll be paralyzed with inaction (because the hypothesis "predict harm whenever a human wouldn't understand what's going on" isn't going to be that low measure).
If it did improve over normal ensembling, why wouldn't Bayesian ML already be relatively widely used?
If you could tractably obtain and work with the posterior, I think that would be much more useful than a normal ensemble. E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.
[A bunch of what I'm going to say is maybe obvious, but I'm uncertain what will and won't be obvious, so I'm saying it anyway.]
E.g. being able to choose at what threshold you start paying attention to a hypothesis which predicts harm, and vary it depending on the context, seems like a big plus.
We can already pick a threshold for a classifier trained "normally" and vary this threshold based on the context. (Or using an ensemble of classifiers.)
Within distribution, we should expect trained classifiers to be well calibrated. And, using online training, it will likely be possible to get good calibration on average despite the distribution shifting over time.
Of course, calibration can fail in rare cases (e.g. very strange OOD inputs) or if we want to generalize to some new distribution where we can't obtain good labels and thus would prefer to just hope for good generalization.
For the exact ELK case we were talking about, we wanted to depend on generalization in this way (assuming we're on the same page). More precisely, we wanted to train a reporter on a clean/easy set where we trust the labels, but the data is much easier than the data we actually care about and thus is OOD.
So, we might run into issues where our reporter fails to be calibrated (or even to have signal) off distribution and thus our classifier thresholds might not be very meaningful. (Of course, we can still do things like pay attention to the top 1% most harmful actions based on our reporter.)
Thus, you might hope you can solve these calibration issues with Baysian ML. But, bayesian ML also has no guarantees about avoiding rare failures or guarantees about having good generalization (guarantees here require imposing additional properties on the prior and the data). Further, as noted earlier, I think ensembling is basically just taking some number of samples of predictive models (hypotheses) from the posterior anyway.[1]
If we assume that ensembling is just taking some number of samples from the posterior, then we can get relatively tight bounds on the probability that bayesian ML would estimate (with no additional assumptions) using Hoeffding's inequality.
For an ensemble of N models, these bounds won't allow us to bound probabilities lower than 1/N, but we'll probably run into other issues before these 1/N bounds are the main blocker. (As in, imagine the case where all of our ensemble reports a tiny probability of harm. We can bound the probability that Bayesian ML would report >1/N probability of harm. This is again by Hoeffding's inequality. If we wanted to bound probabilities to <<1/N, then our ensemble won't work, but I also expect that you run into other issues such as hypotheses like "say there will be harm if humans would have been confused" which ensure that bayesian ML always indicates some probability of harm in the cases we need to handle when using generalization.)
(Separately, I think the deep neural networks are themselves relatively well described as ensembles such that the variance of samples from the posterior will be low in practice. This makes ensembling as an approximation to bayesian ML even better.)
(Perhaps you (and other people who think Bayesian ML would do something useful) don't buy this "training a model is just sampling from the posterior" argument? Or perhaps you buy this argument, but you think we care a lot about probabilities <<1/N and also that we can rule out bad generalizations which will always cause noise in probability estimates this low?)
I think the reason Bayesian ML isn't that widely used is because it's intractable to do. So Bengio's stuff would have to succesfully make it competitive with other methods.
I think it's probably tractable to apply Bayesian ML to learn moderately-small reporter heads on top of existing neural networks. And, in practice, I expect this probably gets you almost all of the advantages of doing Bayesian ML over the entire neural network.
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It's just Bayesian Logistic Regression).
(When viewed from this perspective, I think that training a linear probe using the difference in means method will be something like the maximum likelihood estimator for this posterior. And, difference in means works well empirically and has other well motivated reasons for working.)
My understanding is that training tiny (e.g. 2 layers with n_embed=512) models (transformers, MLPs, whatever) with Bayesian ML is also doable, so we could just use such a tiny model as a reporter head and get most of the benefits of Bayesian ML.
I think the reason we don't see Bayesian ML is that ensembles probably just work fine in practice.
If we imagine that our prior is the space of (reporter) model initializations, then Bayesian ML will aim to approximate updating toward (reporter) models that performed well in training. Let's simplify and imagine that instead Baysian ML is just approximating the distribution of initializations which get better training distribution performance than some threshold. We'll refer to this distribution as a the posterior. I claim that training an ensemble of N (reporter) models from differential initializations is a very good approximation of sampling N (reporter) models from the posterior. I think this is basically a prediction of SLT and most other theories of how neural networks learn. ↩︎
For the most simple case, consider learning a linear probe on embeddings with Bayesian ML. This is totally computationally doable. (It's just Bayesian Logistic Regression).
IIRC Adam Gleave tried this in summer of 2021 with one of Chinchilla/Gopher while he was interning at DeepMind, and this did not improve on ensembling for the tasks he considered.
To avoid catastrophic errors, now consider a risk management approach, with an AI that represents not a single H but a large set of them, in the form of a generative distribution over hypotheses H
On first reading this post, the whole proposal seemed so abstract that I wouldn't know how to even begin making such an AI. However after a very quick skim of some of Bengio's recent papers I think I have more of a sense for what he has in mind.
I think his approach is roughly to create a generative model that constructs Bayesian Networks edge by edge, where the likelihood of generating any given network represents the likelihood that that causal model is the correct hypothesis.
And he's using GFlowNets to do it, which are a new type of ML/RL model developed by MILA that generate objects with likelihood proportional to some reward function (unlike normal RL which always tries to achieve maximum reward). They seem to have mostly been used for biological problems so far.
If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.
Feels like this would paralyze the AI and make it useless?
The charitable interpretation here is that we'll compute E[harm|action] (or more generally E[utility|action]) using our posterior over hypothesis and then choose what action to execute based on this. (Or at least we'll pause and refer actions to humans if E[harm|action] is too high.)
I think "ruling out the possiblity" isn't really a good frame for thinking about this, and it's much more natural to just think about this as an estimation procedure which is trying hard to avoid over confidence in out-of-distribution contexts (or more generally, in contexts where training data doesn't pin down predictions well enough).
ETA: realistically, I think this isn't going to perform better than ensembling in practice.
See discussion here also.
Let me know if I've missed something, but it seems to me the hard part is still defining harm. In the one case, where we will use the model and calculate the probability of harm, if it has goals, it may be incentivized to minimize that probability. In the case where we have separate auxiliary models whose goals are to actively look for harm, then we have a deceptively adversarial relationship between these. The optimizer can try to fool the harm finding LLMs. In fact, in the latter case, I'm imagining models which do a very good job at always finding some problem with a new approach, to the point where they become alarms which are largely ignored.
Using his interpretability guidelines, and also human sanity checking all models within the system, I see we can probably minimize failure modes that we already know about, but again, once it gets sufficiently powerful, it may find something no human has thought of yet.
If an AI somehow (implicitly, in practice) kept track of all the plausible H’s, i.e., those with high probability under P(H | D), then there would be a perfectly safe way to act: if any of the plausible hypotheses predicted that some action caused a major harm (like the death of humans), then the AI should not choose that action. Indeed, if the correct hypothesis H* predicts harm, it means that some plausible H predicts harm. Showing that no such H exists therefore rules out the possibility that this action yields harm, and the AI can safely execute it.
This idea seems to ignore the problem that the null action can also entail harm. In a trolley problem this AI would never be able to pull the lever.
Maybe you could get around this by saying that it compares the entire wellbeing of the world with and without its intervention. But still in that case if it had any uncertainty as to which way had the most harm, it would be systematically biased toward inaction, even when the expected harm was clearly less if it took action.
[mostly-self-plaigiarized from here] If you have a very powerful AI, but it’s designed such that you can’t put it in charge of a burning airplane hurtling towards the ground, that’s … fine, right? I think it’s OK to have first-generation AGIs that can sometimes get “paralyzed by indecision”, and which are thus not suited to solving crises where every second counts. Such an AGI could still do important work like inventing new technology, and in particular designing better and safer second-generation AGIs.
You only really get a problem if your AI finds that there is no sufficiently safe way to act, and so it doesn’t do anything at all. (Or more broadly, if it doesn’t do anything very useful.) Even that’s not dangerous in itself… but then the next thing that happens is the programmer would probably dial the “conservatism” knob down lower and lower, until the AI starts doing useful things. Maybe the programmer says to themselves: “Well, we don’t have a perfect proof, but all the likely hypotheses predict there’s probably no major harm…”
Also, humans tend to treat “a bad thing happened (which had nothing to do with me)” as much less bad than “a bad thing happened (and it’s my fault)”. I think that if it’s possible to make AIs with the same inclination, then it seems like probably a good idea to do so, at least until we get up to super-reliable 12th-generation AGIs or whatever. It’s dangerous to make AIs that notice injustice on the other side of the world and are immediately motivated to fix it—that kind of AI would be very difficult to keep under human control, if human control is the plan (as it seems to be here).
Sorry if I’m misunderstanding.
Yup I think I agree. However I could see this going wrong in some kind of slow takeoff world where the AI is already in charge of many things in the world.
Yoshua Bengio recently posted a high-level overview of his alignment research agenda on his blog. I'm pasting the full text below since it's fairly short.