LESSWRONG
LW

All of Joar Skalse's Comments + Replies

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse1moΩ110

I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.

2Chris_Leong1mo

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?

Using (Uninterpretable) LLMs to Generate Interpretable AI Code

Joar Skalse7moΩ220

I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).

I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasi... (read more)

1Caspar Oesterheld6mo

Yeah, I think I agree with this and in general with what you say in this paragraph. Along the lines of your footnote, I'm still not quite sure what exactly "X can be understood" must require. It seems to matter, for example, that to a human it's understandable how the given rule/heuristic or something like the given heuristic could be useful. At least if we specifically think about AI risk, all we really need is that X is interpretable enough that we can tell that it's not doing anything problematic (?).

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo10

What I had in mind is a situation where we have access to the latent variables during training, and only use the model to prove safety properties in situations that are within the range of the training distribution in some sense (eg, situations where we have some learning-theoretic guarantees). As for treacherous turns, I am implicitly assuming that we don't have to worry about a treacherous turn from the world model, but that we may have to worry about it from the AI policy that we're verifying.

However, note that even this is meaningfully different from j... (read more)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo10

In practice, I think you are unlikely to end up with a schemer unless you train your model to solve some agentic task (or tain it to model a system that may itself be a schemer, such as a human). However, in order to guarantee that, I agree we need some additional property (such as interpretability, or some learning-theoretic guarantee).

2ryan_greenblatt10mo

(I think most of the hard-to-handle risk from scheming comes from cases where we can't easily make smarter AIs which we know aren't schemers. If we can get another copy of the AI which is just as smart but which has been "de-agentified", then I don't think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a "world-model" vs "agent" distinction isn't going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo10

I'm not so convinved of this. Yes, for some complex safety properties, the world model will probably have to be very smart. However, this does not mean that you have to model everything -- depending on your safety specification and use case, you may be able to factor out a huge amount of complexity. We know from existing cases that this is true on a small scale -- why should it not also be true on a medium or large scale?

For example, with a detailed model of the human body, you may be able to prove whether or not a given chemical could be harmful to ingest... (read more)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo10

I think the distinction between these two cases often can be somewhat vague.

Why do you think that the adversarial case is very different?

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo10

I think you're perhaps reading me as being more bullish on Bayesian methods than I in fact am -- I am not necessarily saying that Bayesian methods in fact can solve OOD generalisation in practice, nor am I saying that other methods could not also do this. In fact, I was until recently very skeptical of Bayesian methods, before talking about it with Yoshua Bengio. Rather, my original reply was meant to explain why the Bayesian aspect of Bengio's research agenda is a core part of its motivation, in response to your remark that "from my understanding, the bay... (read more)

4ryan_greenblatt10mo

Insofar as the hope is: 1. Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something). 2. Do something else that makes this actually useful for "improving" OOD generalization in some way. It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn't stated any particular hope for (2) which depends on (1). I agree that if the Bayesian aspect of the agenda did a specific useful thing like '"improve" OOD generalization' or 'allow us to control/understand OOD generalization', then this aspect of the agenda would be useful. However, I think the Bayesian aspect of the agend won't do this and thus it won't add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this - but I disagree and don't see the story for this. I agree that "actually use Bayesian methods" sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don't think it clearly does. (Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).) 1-3 don't seem promising/important to me. (4) would be useful, but I don't see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there. To be clear, if the hope is "figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model", then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo1-2

But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.)

It sounds to me like we agree here, I don't want to put too much weight on "most".

Is this true?

It is true in the sense that you don't have any theoretical guarantees, and in the sense that it also often fails to work in practice.

Aren't NN implicitly ensembles of vast number of models?

They probably are, to some extent. However, in practice, you often find that neural networks make very confident (an... (read more)

2ryan_greenblatt10mo

I agree that you can improve safety by checking outputs in specific ways in cases where we can do so (e.g. requiring the AI to formally verify its code doesn't have side channels). The relevant question is whether there are interesting cases where we can't currently verify the output with conventional tools (e.g. formal software analysis or bridge design), but we can using a world model we've constructed. One story for why we might be able to do this is that the world model is able to generally predict the world as well as the AI. Suppose you had a world model which was as smart as GPT-3 (but magically interpretable). Do you think this would be useful for something? IMO no and besides, we don't need the interpretability as we can just train GPT-3 to do stuff. So, it either has to be smarter than GPT-3 or have a wildly different capability profile. I can't really image any plausible stories where this world model doesn't end up having to actually be smart while the world model is also able massively improve the frontier of what we can check.

2ryan_greenblatt10mo

Worth noting here that I'm mostly talking about Bengio's proposal wrt to the bayes related arguments. And I agree that the world model isn't meant to be a schemer, but it's not as though we can guarantee that without some additional property... (Such as ensuring that the world model is interpretable.)

2ryan_greenblatt10mo

TBC, I would count it as solving an ELK problem if you constructed an interpretable world model which allows you to extract out the latents you want.

2ryan_greenblatt10mo

Won't this depend on the generalization properties of the model? E.g., we can always train a model to predict if a diamond is in the vault on some subset of examples which we can label. From my perspective the core difficulty is either: * Treacherous turns (or similar high-stakes failures) * Note being able to properly identify the latents in actual deployment cases Perhaps you were discussing the case where we magically have access to all the latents and don't need to worry about high-stakes failures? In this case, I agree there are no problems, but the baseline proposal of RLHF while looking at the latents also works perfectly fine.

9ryan_greenblatt10mo

Sure, but why will the bayesian model reliably quantify uncertainty OOD? There is also no guarantee of this (OOD). Whether or not you get reliable uncertainly quanitification will depend on your prior. If you have (e.g.) the NN prior, I expect the uncertainly quantification is similar to if you trained an ensemble. E.g., you'll find a bunch of NNs (in the bayesian posterior) which also have the spurious correlation that a trained NN (or ensemble of NNs) would have. If you have some other prior, why can't we regularize our NN to match this? (Maybe I'm confused about this?)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10mo30

From my understanding, the bayesian aspect of this agenda doesn't add much value.

A core advantage of Bayesian methods is the ability to handle out-of-distribution situations more gracefully, and this is arguably as important as (if not more important than) interpretability. In general, most (?) AI safety problems can be cast as an instance of a case where a model behaves as intended on a training distribution, but generalises in an unintended way when placed in a novel situation. Traditional ML has no straightforward way of dealing with such cases, since i... (read more)

6ryan_greenblatt10mo

I dispute that Bayesian methods will be much better at this in practice. [ Aside: This seems like about 1/2 of the problem from my perspective. (So I almost agree.) Though, you can shove all AI safety problems into this bucket by doing a maneuver like "train your model on the easy cases humans can label, then deploy into the full distribution". But at some point, this is no longer very meaningful. (E.g. you train on solving 5th grade math problems and deploy to the Riemann hypothesis.) ] Is this true? Aren't NN implicitly ensembles of vast number of models? Also, does ensembling 5 NNs help? If this doesn't help why does sampling 5 models from the Bayesian posterior help? Or is that we needed to approximate sampling 1,000,000 models from the posterior? If we're conservative over a million models, how will we ever do anything? Do they? I'm skeptical on both of these. It maybe helps a little and rules out some unlikely scenarios, but I'm overall skeptical. Overall, my view on the Bayesian approach is something like: * What prior were we using for Bayesian methods? If this is just the NN prior, then I'm not really sold we do much better than just training a NN (or an ensemble of NNs). If our prior is importantly different in a way which we think will help, why can't we regularize to train a NN in a normal way which will vaguely reasonably approximate this prior? * My main concern is that we can get a smart predictive model which understands OOD cases totally fine, but we still get catastrophic generalization for whatever reason. I don't see why bayesian methods help. * In the ELK case, our issues is that too much of the prior is human imitation or other problematic generalization. (We can ensemble/sample from the posterior and reject things where our hypotheses don't match, but this will only help so much and I don't really see a strong case for bayes helping more than ensembling.) * In the case of a treacherous turn, it seems like the core concern was t

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse10moΩ110

You can imagine different types of world models, going from very simple ones to very detailed ones. In a sense, you could perhaps think of the assumption that the input distribution is i.i.d. as a "world model". However, what is imagined is generally something that is much more detailed than this. More useful safety specifications would require world models that (to some extent) describe the physics of the environment of the AI (perhaps including human behaviour, though it would probably be better if this can be avoided). More detail about what the world m... (read more)