My name is pronounced "YOO-ar SKULL-se". I'm a PhD student at Oxford University, and I have worked in several different areas of AI safety research. For a few highlights, see:
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Misspecification in Inverse Reinforcement Learning
STARC: A General Framework For Quantifying Differences Between Reward Functions
Risks from Learned Optimization in Advanced Machine Learning Systems
Is SGD a Bayesian sampler? Well, almost
For a full list of all my research, see my Google Scholar.
What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.
However, note that even this is meaningfully different from just using RLHF, especially in settings with some adversarial component. In particular, a situation that is OOD for the policy need not be OOD for the world model. For example, learning a model of the rules of chess is much easier than learning a policy that is good at playing chess. It would also be much easier to prove a learning-theoretic guarantee for the former than the latter.
So, suppose we're training a chess-playing AI, and we want to be sure that it cannot be defeated in moves or less. The RLHF strategy would, in this scenario, essentially amount to letting a bunch of humans play against the AI a bunch of times, try to find cases where the red-teamers find a way to beat the AI in moves, and then train against those cases. This would only give us very weak quantitative guarantees, because there might be strategies that the red teamers didn't think of.
Alternatively, we could also train a network to model the rules of chess (in this particular example, we could of course also specify this model manually, but let's ignore that for the sake of the argument). It seems fairly likely that we could train this model to be highly accurate. Moreover, using normal statistical methods, we could estimate a bound on the fraction of the state-action space on which this model makes an incorrect prediction, and derive other learning-theoretic guarantees (depending on how the training data is collected, etc). We could then formally verify that the chess AI cannot be beaten in moves, relative to this world model. This would produce a much stronger quantitative guarantee, and the assumptions behind this guarantee would be much easier to audit. The guarantee would of course still not be an absolute proof, because there will likely be some errors in the world model, and the chess AI might be targeting these errors, etc, but the guarantee is substantially better than what you get in the RLHF case. Also note that we, as we run the chess AI, could track the predictions of the world model on-line. If the world model ever makes a prediction that contradicts what in fact happens, we could shut down the chess AI, or transition to a safe mode. This gives us even stronger guarantees.
This is of course a toy example, because it is a case where we can design a perfect world model manually (excluding the opponent, of course). We can also design a chess-playing AI manually, etc. However, I think it illustrates that there is a meaningful difference between RLHF and formal verification relative to even a black-box world model. The complexity of the space of all policies grows much faster than the complexity of the environment, and this fact can be exploited.
In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).
I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?
For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest. This cannot be done with current tools, because we don't have a detailed computational model of the human body (and even if we did, we would not be able to use it for scaleable inference). However, this seems like the kind of thing that could plausibly be created in the not-so-long term using AI tools. And if we had such a model, we could prove many interesting safety properties for e.g. pharmacutical development AIs (even if these AIs know many things that are not covered by this world model).
Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something?
I think that would be extremely useful, because it would tell us many things about how to implement cognitive algorithms. But I don't think it would be very useful for proving safety properties (which I assume was your actual question). GPT-3's capabilities are wide but shallow, but in most cases, what we would need are capabilities that are narrow but deep.
I think the distinction between these two cases often can be somewhat vague.
Why do you think that the adversarial case is very different?
I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bayesian aspect of [Bengio's] agenda doesn't add much value".
I agree that if a Bayesian learner uses the NN prior, then its behaviour should -- in the limit -- be very similar to training a large ensemble of NNs. However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head:
And so on, these are just some examples. Of course, if you get these benefits in practice is a matter of speculation until we have a concrete algorithm to analyse. All I'm saying is that there are valid and well-motivated reasons to explore this particular direction.
But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)
It sounds to me like we agree here, I don't want to put too much weight on "most".
Is this true?
It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.
Aren't NN implicitly ensembles of vast number of models?
They probably are, to some extent. However, in practice, you often find that neural networks make very confident (and wrong) predictions for out-of-distribution inputs, in a way that seems to be caused by them projecting some spurious correlation. For example, you train a network to distinguish different types of tanks, but it learns to distinguish night from day. You train an agent to collect coins, but it learns to go to the right. You train a network to detect criminality, but it learns to detect smiles. Adversarial examples could also be cast as an instance of this phenomenon. In all of these cases, we have a situation where there are multiple patterns in the data that fit a given training objective, but where a neural network ends up giving an unreasonably large weight to some of these patterns at the expense of other plausible patterns. I thus think its fair to say that -- empirically -- neural networks do not robustly quantify uncertainty in a reliable way when out-of-distribution. It may be that this problem mostly goes away with a sufficiently large amount of sufficiently varied data, but it seems hard to get high confidence in that.
Also, does ensembling 5 NNs help?
In practice, this does not seem to help very much.
If we're conservative over a million models, how will we ever do anything?
I mean, there can easily be cases where we assign a very low probability to any given "complete" model of a situation, but where we are still able assign a high probability to many different partial hypotheses. For example, if you randomly sample a building somewhere on earth, then your credence that the building has a particular floor plan might be less than 1 in 1,000,000 for each given floor plan. However, you could still assign a credence of near-1 that the building has a door, and a roof, etc. To give a less contrived example, there are many ways for the stock market to evolve over time. It would be very foolish to assume that it will evolve according to, e.g., your maximum-likelihood model. However, you can still assign a high credence to the hypothesis that it will grow on average. In many (if not all?) cases, such partial hypotheses are sufficient.
If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior?
In the examples I gave above, the issue is less about the prior and more about the need to keep track of all plausible alternatives (which neural networks do not seem to do, robustly). Using ensembles might help, but in practice this does not seem to work that well.
I again don't see how bayes ensures you have some non-schemers while ensembling doesn't.
I also don't see a good reason to think that a Bayesian posterior over agents should give a large weight to non-schemers. However, that isn't the use-case here (the world model is not meant to be agentic).
Another way to put this, is that all the interesting action was happening at the point where you solved the ELK problem.
So, this depends on how you attempt to create the world model. If you try to do this by training a black-box model to do raw sensory prediction, and then attempt to either extract latent variables from that model, or turn it into an interpretable model, or something like that, then yes, you would probably have to solve ELK, or solve interpretability, or something like that. I would not be very optimistic about that working. However, this is in no way the only option. As a very simple example, you could simply train a black-box model to directly predict the values of all latent variables that you need for your definition of harm. This would not produce an interpretable model, and so you may not trust it very much (unless you have some learning-theoretic guarantee, perhaps), but it would not be difficult to determine if such a model "thinks" that harm would occur in a given scenario. As another example, you could build a world model "manually" (with humans and LLMs). Such a model may be interpretable by default. And so on.
I thought the plan was to build it with either AI labor or human labor so that it will be sufficiently intepretable. Not to e.g. build it with SGD. If the plan is to build it with SGD and not to ensure that it is interpretable, then why does it provide any safety guarantee? How can we use the world model to define a harm predicate?
The general strategy may include either of these two approaches. I'm just saying that that the plan does not definitionally rely on the assumption that the wold model is built manually.
Won't predicting safety specific variables contain all of the difficulty of predicting the world?
That very much depends on what the safety specifications are, and how you want to use your AI system. For example, think about the situations where similar things are already done today. You can prove that a cryptographic protocol is unbreakable, given some assumptions, without needing to have a complete predictive model of the humans that use that protocol. You can prove that a computer program will terminate using a bounded amount of time and memory, without needing a complete predictive model of all inputs to that computer program. You can prove that a bridge can withstand an earthquake of such-and-such magnitude, without having to model everything about the earth's climate. And so on. If you want to prove something like "the AI will never copy itself to an external computer", or "the AI will never communicate outside this trusted channel", or "the AI will never tamper with this sensor", or something like that, then your world model might not need to be all that detailed. For more ambitious safety specifications, you might of course need a more detailed world model. However, I don't think there is any good reason to believe that the world model categorically would need to be a "complete" world model in order to prove interesting safety properties.
From my understanding, the bayesian aspect of this agenda doesn't add much value.
A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since it only maintains a single hypothesis at any given time. However, Bayesian methods may make it less likely that a model will misgeneralise, or should at least give you a way of detecting when this is the case.
I also don't agree with the characterisation that "almost all the interesting work is in the step where we need to know whether a hypothesis implies harm" (if I understand you correctly). Of course, creating a formal definition or model of "harm" is difficult, and creating a world model is difficult, but once this has been done, it may not be very hard to detect if a given action would result in harm. But I may not understand your point in the intended way.
manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)
I don't think this is an accurate description of davidad's plan. Specifically, the world model does not necessarily have to be built manually, and it does not have to be as good at prediction as our AI. The world model only needs to be good at predicting the variables that are important for the safety specification(s), within the range of outputs that the AI system may produce.
Proof checking on this world model also seems likely to be unworkable
I agree that this is likely to be hard, but not necessarily to the point of being unworkable. Similar things are already done for other kinds of software deployed in complex contexts, and ASL-2/3 AI may make this substantially easier.
You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world model would need to do, and how such a world model may be created, is discussed in Section 3.2. My personal opinion is that the creation of such a world model probably would be challenging, but not more challenging than the problems encountered in other alignment research paths (such as mechanistic interpretability, etc). Also note that you can obtain guarantees without assuming that the world model is entirely accurate. For example, consider the guarantees that are derived in cryptography, or the guarantees derived from formal verification of airplane controllers, etc. You could also monitor the environment of the AI at runtime to look for signs that the world model is inaccurate in a certain situation, and if such signs are detected, transition the AI to a safe mode where it can be disabled.
If a universality statement like the above holds for neural networks, it would tell us that most of the details of the parameter-function map are irrelevant.
I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
The ambitious statement here might be that all the relevant information you might care about (in terms of understanding universality) are already contained in the loss landscape.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a "small" amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I'm not sure).
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn't solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn't solved).
I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can "explain your thought process" or "justify your answer" is probably (mostly) in this category. I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like "when solving a maze, it's often a good idea to start from the end". I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of "why"-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.
(*) It is also not fully clear to me if it even makes sense to say that a task can't be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I'm not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.