From your paper:
It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments,is not asymptotically optimal [Orseau, 2010], and in-deed, may cease to explore [Leikeet al., 2015]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.
Given this, why is your main conclusion "Perhaps our results suggest we are in need of more theory regarding the 'parenting' of artificial agents" instead of "We should use Bayesian optimality instead of asymptotic optimality"?
The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there's not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There's still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn't quite allow us to say that it will pick the right point on the trade-off. As long as we have access to "parents" that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.
And I'd say it's more a conclusion, not a main one.
This question is probably stupid, and also kinda generic (it applies to many other papers besides this one), but forgive me for asking it anyway.
So, I'm trying to think through how this kind of result generalizes beyond MDPs. In my own life, I don't go wandering around an environment looking for piles of cash that got randomly left on the sidewalk. My rewards aren't random. Instead, I have goals (or more generally, self-knowledge of what I find rewarding), and I have abstract knowledge constraining the ways in which those goals will or won't happen.
Yes, I do still have to do exploration—try new foods, meet new people, ponder new ideas, etc.—but because of my general prior knowledge about the world, this exploration kinda feels different than the kind of exploration that they talk about in MDPs. It's not really rolling the dice, I generally have a pretty good idea of what to expect, even if it's still a bit uncertain along some axes.
So, how do you think about the generalizability of these kinds of MDP results?
(I like the paper, by the way!)
Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?
Hmm, I think I get it. Correct me if I'm wrong.
Your paper is about an agent which can perform well in any possible universe. (That's the "for all ν in ℳ"). That includes universes where the laws of physics suddenly change tomorrow. But in real life, I know that the laws of physics are not going to change tomorrow. Thus, I can get optimal results without doing the kind of exhaustive exploration that your paper is talking about. Agree or disagree?
Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.
A decent intuition might be to think about what exploration looks like in human children. Children under the age of 5 but old enough to move about on their own—so toddlers, not babies or "big kids"—face a lot of dangers in the modern world if they are allowed to run their natural exploration algorithm. Heck, I'm not even sure this is a modern problem, because in addition to toddlers not understanding and needing to be protected from exploring electrical sockets and moving vehicles they also have to be protected from more traditional dangers that they would definitely otherwise check out like dangerous plants and animals. Of course, since toddlers grow up into powerful adult humans, this is a kind of evidence that they are powerful enough explorers (even with protections) to become powerful enough to function in society.
Obviously there are a lot of caveats to taking this idea too seriously since I've ignored issues related to human development, but I think it points in the right direction of something everyday that reflects this result.
Planned summary for the Alignment Newsletter:
In environments without resets, an _asymptotically optimal_ agent is one that eventually acts optimally. (It might be the case that the agent first hobbles itself in a decidedly suboptimal way, but _eventually_ it will be rolling out the optimal policy _given_ its current hobbled position.) This paper points out that such agents must explore a lot: after all, it's always possible that the very next timestep will be the one where chopping off your arm gives you maximal reward forever -- how do you _know_ that's not the case? Since it must explore so much, it is extremely likely that it will fall into a "trap", where it can no longer get high reward: for example, maybe its actuators are destroyed.
More formally, the paper proves that when an asymptotically optimal agent acts, for any event, either that event occurs, or after some finite time there is no recognizable opportunity to cause the event to happen, even with low probability. Applying this to the event "the agent is destroyed", we see that either the agent is eventually destroyed, or it becomes _physically impossible_ for the agent to be destroyed, even by itself -- given that the latter seems rather unlikely, we would expect that eventually the agent is destroyed.
The authors suggest that safe exploration is not a well-defined problem, since you never know what's going to happen when you explore, and they propose that instead agents should have their exploration guided by a mentor or <@parent@>(@Parenting: Safe Reinforcement Learning from Human Input@) (see also <@delegative RL@>(@Delegative Reinforcement Learning@), [avoiding catastrophes via human intervention](https://arxiv.org/abs/1707.05173), and [shielding](https://arxiv.org/abs/1708.08611) for more examples).
Planned opinion:
In my opinion on <@Safety Gym@>, I mentioned how a zero-violations constraint for safe exploration would require a mentor or parent that already satisfied the constraint; so in that sense I agree with this paper, which is simply making that statement more formal and precise.
Nonetheless, I still think there is a meaningful notion of exploration that can be done safely: once you have learned a good model that you have reasonable confidence in, you can find areas of the model in which you are uncertain, but you are at least confident that it won't have permanent negative repercussions, and you can explore there. For example, I often "explore" what foods I like, where I'm uncertain of how much I will like the food, but I'm quite confident that the food will not poison and kill me. (However, this notion of exploration is quite different from the notion of exploration typically used in RL, and might better be called "model-based exploration" or something like that.)
It seems like the upshot is that even weak optimality is too strong, since it has to try everything once. How does one make even weaker guarantees of good behavior that are useful in proving things, without just defaulting to expected utility maximization?
How does one make even weaker guarantees of good behavior
I don't think there's really a good answer. Section 6 Theorem 4 is my only suggestion here.
After a bit more thought, I've learned that it's hard to avoid ending back up with EU maximization - it basically happens as soon as you require that strategies be good not just on the true environment, but on some distribution of environments that reflect what we think we're designing an agent for (or the agent's initial state of knowledge about states of the world). And since this is such an effective tool at penalizing the "just pick the absolute best answer" strategy, it's hard for me to avoid circling back to it.
Here's one possible option, though: look for strategies that are too simple to encode the one best answer in the first place. If the absolute best policy has K-complexity of 10^3 (achievable in the real world by strategies being complicated, or in the multi-armed bandit case by just having 2^1000 possible actions) and your agent is only allowed to start with 10^2 symbols, this might make things interesting.
Maybe optimality relative to the best performer out of some class of algorithms that doesn't include "just pick the absolute best answer?" You basically prove that in environments with traps, anything that would, absent traps, be guaranteed to find the absolute best answer will instead get trapped. So those aren't actually very good performers.
I just can't come up with anything too clever, though, because the obvious classes of algorithms, like "polynomial time," include the ability to just pick the absolute best answer by luck.
Here's a new(ish) paper that I worked on with Marcus Hutter.
If an agent is exploring enough to be guaranteed strong performance in the limit, that much exploration is enough to kill it (if the environment is minimally difficult and dangerous). It's nothing too surprising, but if you're making a claim along these lines about exploration being dangerous, and you need something to cite, this might work.
My attitude towards safe exploration is: exploration isn't safe. Don't do it. Have a person or some trusted entity do it for you. The paper can also be read as a justification of that view.
Obviously, there are many more details in the paper.