This post is an attempt at reformulating some of the points I wanted to make in “Safe exploration and corrigibility” in a clearer way. This post is standalone and does not assume that post as background.

In a previous comment thread, Rohin argued that safe exploration is currently defined as being about the agent not making “an accidental mistake.” I think that definition is wrong, at least to the extent that I think it both doesn't make much sense and doesn't describe how I actually expect current safe exploration work to be useful.

First, what does it mean for a failure to be an “accident?” This question is simple from the perspective of an engineer outside the whole system—any unintended failure is an accident, encapsulating the majority of AI safety concerns (i.e. “accident risk”). But that's clearly not what the term “accidental mistake” is pointing at in this context—rather, the question here is what is an accident from the perspective of the model? Intuitively, an accident from the perspective of the model should be some failure that the model didn't intend or wouldn't retroactively endorse. But that sort of a definition only makes sense for highly coherent mesa-optimizers that actually have some notion of intent. Maybe instead we should be thinking of this from the perspective of the base optimizer/loss function? That is, maybe a failure is an accidental failure if the loss function wouldn't retroactively endorse it (e.g. the model got a very low reward for making the mistake). By this definition, however, every generalization failure is an accidental failure such that safe exploration would just be the problem of generalization.

Of all of these definitions, the definition defining an accidental failure from the perspective of the model as a failure that the model didn't intend or wouldn't endorse seems the most sensical to me. Even assuming that your model is a highly coherent mesa-optimizer such that this definition makes sense, however, I still don't think it describes current safe exploration work, and in fact I don't think it's even really a safety problem. The problem of producing models which don't make mistakes from the perspective of their own internal goals is precisely the problem of making powerful, capable models—that is, it's precisely the problem of capability generalization. Thus, to the extent that it's reasonable to say this for any ML problem, the problem of accidental mistakes under this definition is just a capabilities problem. However, I don't think that at all invalidates the utility of current safe exploration work, as I don't think that current safe exploration work is actually best understood as avoiding “accidental mistakes.”

If safe exploration work isn't about avoiding accidental mistakes, however, then what is it about? Well, let's take a look at an example. Safety Gym has a variety of different environments containing both goal states that the agent is supposed to reach and unsafe states that the agent is supposed to avoid. From OpenAI's blog post: “If deep reinforcement learning is applied to the real world, whether in robotics or internet-based tasks, it will be important to have algorithms that are safe even while learning—like a self-driving car that can learn to avoid accidents without actually having to experience them.” Why wouldn't this happen naturally, though—shouldn't an agent in a POMDP always want to be careful? Well, not quite. When we do RL, there are really two different forms of exploration happening:[1]

  • Within-episode exploration, where the agent tries to identify what particular environment/state it's in, and
  • Across-episode exploration, which is the problem of making your agent explore enough to gather all the data necessary to train it properly.

In your standard episodic POMDP setting, you get within-episode exploration naturally, but not across-episode exploration, which you have to explicitly incentivize.[2] Because we have to explicitly incentivize across-episode exploration, however, it can often lead to behaviors which are contrary to the goal of actually trying to achieve the greatest possible reward in the current episode. Fundamentally, I think current safe exploration research is about trying to fix that problem—that is, it's about trying to make across-episode exploration less detrimental to reward acquisition. This sort of a problem is most important in an online learning setting where bad across-episode exploration could lead to catastrophic consequences (e.g. crashing an actual car to get more data about car crashes).

Thus, rather than define safe exploration as “avoiding accidental mistakes,” I think the right definition is something more like “improving across-episode exploration.” However, I think that this framing makes clear that there are other types of safe exploration problems—that is, there are other problems in the general domain of making across-episode exploration better. For example, I would love to see an exploration of how different across-episode exploration techniques impact capability generalization vs. objective generalization—that is, when is across-episode exploration helping you collect data which improves the model's ability to achieve its current goal versus helping you collect data which improves the model's goal?[3] Because across-episode exploration is explicitly incentivized, it seems entirely possible to me that we'll end up getting the incentives wrong somehow, so it seems quite important to me to think about how to get them right—and I think that the problem of getting them right is the right way to think about safe exploration.


  1. This terminology is borrowed from Rohin's first comment in the same comment chain I mentioned previously. ↩︎

  2. With some caveats—in fact, I think a form of across-episode exploration will be instrumentally incentivized for an agent that is aware of the training process it resides in, though that's a bit of a tricky question that I won't try to fully address now (I tried talking about this somewhat in “Safe exploration and corrigibility,” though I don't think I really succeeded there). ↩︎

  3. This is what I somewhat confusingly called the “objective exploration problem” in “Safe exploration and corrigibility.” ↩︎

New Comment
8 comments, sorted by Click to highlight new comments since:

(disclaimer: I worked on Safety Gym)

I think I largely agree with the comments here, and I don't really have attachment to specific semantics around what exactly these terms mean. Here I'll try to use my understanding of evhub's meanings:

First: a disagreement on the separation.

A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There's a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don't have episode boundaries.

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

Second: a learning efficiency perspective.

Another prediction I have now (also weakly held) is that we'll have a smooth spectrum between "things we want" and "things we don't want" with lots of stuff inbetween, instead of a sharp boundary between "accident" and "not accident".

A small contemporary example is in robotics, often accidents are "robot arm crashes into the table", but a non-accident we still want to avoid is "motions which incur more long-term wear than alternative motions".

An important goal of safe exploration is that we would like to learn this quality with high efficiency. A joke about current methods is that we learn not to run into a wall by running into a wall millions of times.

With this in mind, I think for the moment sample efficiency gains are probably also safe exploration gains, but at some point we might have to trade off how to increase the learning efficiency of safety qualities by decreasing the learning efficiency of other qualities.

[-]evhubΩ5140

Hey Aray!

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.

The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.

A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There's a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don't have episode boundaries.
Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

My main reason for making the separation is that in every deep RL algorithm I know of there is exploration-that-is-incentivized-by-gradient-descent and exploration-that-is-not-incentivized-by-gradient-descent and it seems like these should be distinguished. Currently due to episode boundaries these cleanly correspond to within-episode and across-episode exploration respectively, but even if episode boundaries become obsolete I expect the question of "is this exploration incentivized by the (outer) optimizer" to remain relevant. (Perhaps we could call this outer and inner exploration, where outer exploration the exploration that is not incentivized by the outer optimizer.)

I don't have a strong opinion on whether "safe exploration" should refer to just outer exploration or both outer and inner exploration, since both options seem compatible with the existing ML definition.

In a previous comment thread, Rohin argued that safe exploration is best defined as being about the agent not making “an accidental mistake.”

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Like you, I do not think this definition makes sense in the context of powerful AI systems, because it is evaluated from the perspective of an engineer outside the whole system. However, it makes a lot of sense for current ML systems, which are concerned with e.g. training self-driving cars, without ever having a single collision. You can solve the problem, by using the engineer's knowledge to guide the training process. (See e.g. Parenting: Safe Reinforcement Learning from Human Input, Trial without Error: Towards Safe Reinforcement Learning via Human Intervention, Safe Reinforcement Learning via Shielding, Formal Language Constraints for Markov Decision Processes (specifically the hard constraints).)

Fundamentally, I think current safe exploration research is about trying to fix that problem—that is, it's about trying to make across-episode exploration less detrimental to reward acquisition.

Note that this also describes "prevent the agent from making accidental mistakes". I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective, rather than from the engineer's perspective, but I think literally none of the algorithms in the four papers I cited above, or the ones in Safety Gym, could reasonably be said to be improving exploration from the agent's perspective and not the engineer's perspective. I'd be interested in an example of an algorithm that improves across-episode exploration from the agent's perspective, along with an explanation of why the improvements are from the agent's perspective rather than the engineer's.

[-]evhubΩ360

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Ah, I see—thanks for the correction. I changed “best” to “current.”

I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).

EDIT:

That being said, it's a little bit tricky in some of these safe exploration setups to draw the line between what's part of the base objective and what's not. For example, I would generally include the constraints in constrained optimization setups as just being part of the base objective, only specified slightly differently. In that context, constrained optimization is less of a safe exploration technique and more of a reward-engineering-y/outer alignment sort of thing, though it also has a safe exploration component to the extent that it constrains across-episode exploration.

Note that when across-episode exploration is learned, the distinction between safe exploration and outer alignment becomes even more muddled, since then all the other terms in the loss will implicitly serve to check the across-episode exploration term, as the agent has to figure out how to trade off between them.[1]


  1. This is another one of the points I was trying to make in “Safe exploration and corrigibility” but didn't do a great job of conveying properly. ↩︎

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).

Wait, then how is "improving across-episode exploration" different from "preventing the agent from making an accidental mistake"? (What's a situation that counts as one but not the other?)

[-]evhubΩ120

Like I said in the post, I'm skeptical that “preventing the agent from making an accidental mistake” is actually a meaningful concept (or at least, it's a concept with many possible conflicting definitions), so I'm not sure how to give an example of it.

Maybe pedantic but, couldn't we just look at the decision process as a sequence of episodes from the POMDP, and formulate the problem in terms of the regret incurred by our learning algorithm in this decision process? In particular, if catastrophic outcomes (i.e., ones which dominate the total regret) are possible, then a low-regret learning algorithm will have to be safe while still gathering some information that helps in future episodes. (On this view, the goal of safe exploration research is the same as the goal of learning generally: design low-regret learning algorithms. It's just that the distribution of rewards in some cases implies that low-regret learning algorithms have to be "safe" ones.)