Three remarks:
Roughly speaking, more precisely it seems to be related to average-case complexity separations (e.g. one-way functions). ↩︎
I still think that the hot stove example is a real problem, although maybe unavoidable. My example starts with "I learned that the hot stove always burns my hand." This is not the exploration part anymore, the agent already observed the stove burning its hand many times. Normally, this would be enough to never touch the hot stove again, but if some unexplained nice things happen in the outside world, there is suddenly no guarantee that the IB agent doesn't start touching the stove again. Maybe this is unavoidable, but I maintain it's a weird behavior pattern that the outside weather might make you lose the guarantee to not touch the stove you already learned is burning. I think it's somewhat different than the truly unavoidable thing, that every agent needs to do some exploration and it sometimes leads to them burning their hand.
Well, it's true that the IB regret bound doesn't imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove... assuming the optimal policy it follows is "reasonable": see definition of "robustly optimal policy" below.
The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs and s.t. is a refinenement of , then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of . This doesn't imply low regret, but we can make it an independent requirement. However, I'm not sure how useful that is. Let's look at a simple example.
There are two observations and two actions . Seeing observation incurs a loss of and seeing observation incurs a loss of . Taking action incurs a loss of and taking action incurs a loss of . These losses add up straightforwardly. For the maximal ignorance hypothesis , "always take action " is the only robustly optimal policy. However, "take action unless you observed , in which case take action " is also optimal (for sufficiently large). Worse, is robustly optimal for some refinements of : maybe taking action causes you to get observation instead of .
Instead of , we can consider the law which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in ). Now is the sole robust policy of any refinement of ! However, to express this as an ultra-POMDP we need to introduce the notion of "Murphy observations" (i.e. Murphy doesn't see the state directly, just like the agent). I'm not sure about the consistency of robustness desideratum in this case.
Intuitively, I'm guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as "I learned that the hot stove always burns my hand". For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won't touch the stove when pleasantly surprised, because there are no surprises!
Introduction
I wrote this post during my SERI MATS scholarship, explaining the part of infra-Bayesianism that I understood the most deeply. Compared to some previous distillations, I don't go deeply into the mathematical machinery built around infra-Bayesianism, instead I try to evaluate what results infra-Bayesianism produced compared to classical learning theory in the topic of performance guarantees in general environments, which was the main motivation behind the development of infra-Bayesianism.
The theorems appearing in the post are not original to me, and can be found in either the broader academic literature (concerning classical learning theory) or Vanessa's writings (concerning infra-Bayesianism). I included proofs in the text when I found them useful for general understanding, and moved them to the Appendix when I thought them simple and important enough to write down, but not interesting enough to break the flow of text with them. The later examples investigating the limits of infra-Bayesian learning are my own.
My conclusion is that the general performance guarantees we can currently prove about infra-Bayesian learning agents are pretty weak, but it seems plausible that the theory gets us closer to tackling the question.
For my general thoughts and doubts about infra-Bayesianism, see my other, less technical post: A mostly critical review of infra-Bayesianism.
Classical reinforcement learning
Before we go into the infra-Bayesian explainer, or before we could even really state what the motivating problem of non-realizability is, it's important to get at least a superficial overview of classical reinforcement learning theory that is studied in academia since a long time. My main source for this was the book Bandit Algorithms by Lattimore and Szepesvári.
In a simple model of reinforcement learning, the agent is placed in an environment, and on every turn, the agent chooses an action from its finite action set A, and after that, the environment gives an observation from the finite observation set O. There is a bounded loss function ℓ that is known to the agent in advance, assigning real-valued losses to finite histories. Usually, we will work with ℓ:A×O→R loss functions. This choice is motivated by game theory, where the learning process would be a repeated game, and in each turn, the loss is determined by the agent's action and the opponent1s action (the observation).
This definition of loss function is already pretty general, because for any loss function that can only take finitely many values, we can encode the loss in the observation itself, which gives the loss function the desired ℓ:A×O→R form. It's important that we required the loss function to be bounded, otherwise we could encounter inconvenient St. Petersburg paradox-style scenarios.
The agent has to take repeated actions either until a time horizon T or indefinitely. If the the agent's lifespan is infinite, there needs to be a way to compare overall results instead of just saying "there is infinite loss accumulated anyway", so we introduce a time discount rate γ∈(0,1). We will concentrate on this infinite, time discounted version.
Notation (ΔX). For a space X, the notation ΔX refers to the space of probability distributions over X.
An environment is e:(A×O)∗×A→ΔO that assigns a probability distribution over observations to every finite history ending with the agent taking an action.
The agent has a policy π:(A×O)∗→ΔA that assigns a probability distribution over actions to every finite history ending with an observation.
A policy is good if it ensures that the the agent is successful in a range of environments.
Formally, the agent has a hypothesis class H made of environments (H is usually defined to be finite or countably infinite). It wants to use a policy π such that for every environment e∈H, the policy π has low regret with respect to environment e.
Definiton 1 (Expected regret).
The expected regret of policy π in environment e with respect to loss function Lγ is defined as the expected loss under policy π minus the minimum expected loss over all policies π∗ in environment e with respect to loss function Lγ:
ER(π,e,Lγ)=Eπ,e(Lγ)−minπ∗Eπ∗,e(Lγ).Here Lγ denotes the overall loss with γ discount rate:
Definiton 2 (Lγ).
Lγ=(1−γ)∞∑t=0γtℓ(t)where ℓ(t) is the agent's loss in turn t.
Intuitively, an agent with a good policy takes some time to explore the environment, then as the agent becomes reasonably certain which environment he is in, it starts exploiting the environment, although potentially still taking exploratory steps occasionally.
As the discount rate γ goes to 1, the agent has more and more time to freely explore its environment, because 1000 rounds of exploration is still not too costly compared to the overall utility, so hopefully it will be true that whichever environment e∈H the agent is placed in, it will have time to decide which environment he is in, with a very low probability of mistake, then start to play an almost-optimal strategy for that environment, thus ensuring the regret to be low. If this holds, we call the hypothesis class H learnable. Formally,
Definiton 3 (Learnability).
A hypothesis class H is said to be learnable, if there exists a family of policies {πγ}, such that for every environment e∈H,
limγ→1ER(πγ,e,Lγ)=0.It's important to note that not every hypothesis class is learnable. For example, take a hypothesis class that contains these two environments: In environment E, if the agents makes move X in the first turn, it goes to Heaven (receives minimal loss in every turn then on), otherwise it goes to Hell (receives maximal loss in every turn then on). Environment F is just the reverse. Then it's impossible to construct a policy that has low regret with respect to both environments E and F.
This is a classical example of a trap: an action that has irrevocable consequences. Generally speaking, neither classical nor infra-Bayesian learning theory has a good model of what agents are supposed to do in an environment that might contain traps. This is a big open question that seriously limits our mathematical formalism of real-world agency (as the real world is full of traps, most notably death). But this is an independent concern that infra-Bayesianism doesn't try to address, so we are only looking at learnable hypothesis classes now.
Fortunately, some natural hypothesis classes turn out to be learnable, as we will discuss later. The heuristics is that if no mistake committed in a short period of time is too costly, according to any environment in the class, then the class will be learnable.
Before we look at examples, it's useful to establish the following simple theorem:
Theorem 4.
If every finite subset of the countable hypothesis class H is learnable, then H itself is learnable.
Now we can look at examples of learnable hypothesis classes:
Definiton 5 (Bandit environment). $\$
A bandit is an environment in which, in every turn, the probability distribution of the observation only depends on the action in that turn. A bandit environment can be characterized by a function B:A→ΔO.
The "bandit" name comes from the illustration that the agent is placed in a casino with various slot machines, aka "one-armed bandits" and the action is choosing the slot machine, and each machine has a probability distribution according to which the agent gets an observation (the amount of money falling out of the machine).
Theorem 7. A class H of bandit-environments is always learnable.
Proof. We will only prove the statement only for countable class H, but it's not hard to expand to proof further. By Theorem 4, it's enough to prove for finite classes to have a proof for every countable class.
For a finite hypothesis class H, it's enough to just always play the action that produced the lowest loss on average in the past, and occasionally do exploratory moves and try the other actions too, but with a frequency decreasing towards 0.
As γ→1, and the agent lives longer in expectation, the action producing the smallest loss in expectation will become the one with the smallest average loss, due to the law of large numbers. Then, the agent will mostly play that optimal action, so it's average regret goes to 0. ◻
We could be interested not just in learnability, but in what regret bounds are achievable given a time discount rate γ or time horizon T. For this, we would need to find the optimal rate to decrease the frequency of exploratory steps. There is a nice answer to this question in the literature, but we don't go into details here, as the exact regret bounds are not crucial for the main topic of this post.
A broader hypothesis class we will use is finite-state communicating POMDPs.
(We could also look at Markov Decision Processes (MDPs) as an intermediate step, but we are ignoring them for this post.)
Definiton 8 (POMDP).
A Partially Observed Markov Decision Process consists of a set of hidden states S, a stochastic transition function f:S×A→ΔS (where A is the set of the agent's possible actions), and an observation function F:S→O.
Definiton 9 (Finite state communicating POMDP).
A POMDP in which S is finite and has the property that for every s1,s2∈S, an agent can have a policy π such that starting from state s1 and following policy π, it eventually gets to state s2 with probability 1.
Theorem 10. A hypothesis class made of finite-state communicating POMDPs is always learnable.[1]
There are some learnable hypothesis classes even broader than this, but they are enough for our purpose now, as many systems can be modelled as finite state communicating POMDPs: the world can be in a number of ways, you can't see everything in the world, but have some limited observations and corresponding rewards, and you can make some reversible changes in the world.
Bayes-optimal policies
Instead of ensuring that the regret is low with respect to every individual hypothesis in the class, the agent can aim for the weighted average of regrets being low.
Definiton 11 (Bayesian regret). Given a prior distribution ζ on the hypothesis class H, the Bayesian regret of a policy π is
Ee∼ζER(π,e,L).Theorem 12.
Given a countable hypothesis class H and a non-dogmatic prior ζ on H (non-dogmatic meaning that ζ(h)>0 for all h∈H), and a family of policies πγ, then
limγ→1Ee∼ζER(πγ,e,Lγ)=0is equivalent to
limγ→1ER(πγ,e,Lγ)=0∀e∈H,that is, the class H being learnable and π being a good learning algorithm.
This result might suggest that Bayesian regret doesn't give us much value over considering regret with respect to individual hypotheses, but we will see that it has important application.
Most importantly, now that (given a prior distribution), we can assign a single quantity to the policy's performance, we can look at the policy that minimizes the Bayesian regret:
Definiton 13 (Bayes-optimal policy).
Given a prior distribution ζ over the hypothesis class H, the policy π is said to be Bayes-optimal if it minimizes the Bayesian regret. This is equivalent with the policy minimizing
Ee∼ζEπ,e(L).Using the previous result, we can prove the following, rather unsurprising statement:
Theorem 14.
Given a learnable, countable hypothesis class H and a non-dogmatic prior ζ on H, if we take the Bayes-optimal policy πγ for every γ∈(0,1), then this family of policies learn the hypothesis class H.
As we will see later, it's often easier to prove nice properties about Bayes-optimal policies than just policies with low regret bound in general.
However, while there are some famous algorithms in the literature that ensure low regret bounds (like UCB, UCRL, Exp3), it's usually computationally intractable to find the Bayes-optimal policy. In general, it's more reasonable to expect that a real-life algorithm will achieve some low regret bounds than that it will be literally the best, Bayes-optimal policy.
Thus, both in the academic literature and in our work, we usually want to prove more general theorems in the form "Every policy that has low regret bounds with respect to this hypothesis class, has that property" instead of the much narrower claim of "The Bayes-optimal policy has that property". Still, when we can't prove something stronger, we sometimes fall back to investigating the Bayes-optimal policy.
Non-realizability and classical learning theory
The big problem of classical learning theory is that we have no guarantees that a policy selected for having low regret with respect to the environments in its hypothesis class, will also behave reasonably when placed in an environment that is not contained in its hypothesis class. (An environment not in the hypothesis class is called non-realizable.)
Unfortunately, however broad hypothesis class the agent has, we can't expect it to have a model of itself or an equally complex agent encoded in any of its hypotheses about the world. This is true for both practical reasons (it can't contain in its head a model of the world bigger than itself) and for theoretical reasons (if the model could have a model including itself, it could have a policy of doing the opposite of what its model predicts, which would lead to an obvious paradox). Also, even excluding these situations, the world is just very big, and it's unrealistic to assume that an AI could have the "true state of the world" fully modeled as one of its hypotheses. So it would be nice if we had some guarantees about the agent's performance in environment that are not directly encoded as one of its hypotheses.
A naive hope would be that maybe it's not a big problem, and if an algorithm learns a broad enough hypothesis class, then it necessarily generalizes to do reasonably well even outside the class. Let's look at some concrete examples that it's not the case.
The 010101... failure mode
The agent's loss function is this: any time it plays action X, it gets 0 loss, any time it plays action Y, it gets 1 loss, regardless of the observations. The environment is oblivious, so the observations don't depend on the agent's actions. Obviously, the right choice would be to always play X.
Suppose the agent's hypothesis class consists only of bandits. Then, surprisingly, this learning algorithm will guarantee extremely low expected regret with respect to every environment in the hypothesis class:
"Start with action Y, and and play action Y as long as the observation history looks like 010101... After the first deviation from this sequence, take action X ever after."
As the pattern 010101... is very unlikely to continue for long under any bandit environment, this policy has very low expected regret. In particular, the bandit with respect to which this policy has the highest expected regret (if γ goes to 1) is the one in which, after an action Y, the observations 0 and 1 both have 12. In that environment, the expected regret is
(1−γ)∞∑k+0γk12k=1−γ1−γ2which goes to 0 as γ goes to 1.
It's good to note that the failure mode doesn't need to be just one specific history, it can leave some wiggle room, like the following policy:
"Always play X except if in the history so far the difference between the number of 1s and 0s is at most √n (where n is the length of your history so far) and there were at most √n instances when two observations coming after each other were equal. In that case, play Y."
The probability of observing a long sequence like that is still vanishingly unlikely in any bandit environment, so the policy is allowed to have this broader failure mode too. This means that if the agent is placed in an environment that mostly produces a 01010... sequence, then postulating a little random noise in the environment doesn't save the agent from predictable failure.
I somewhat feel that this is a contrived counter-example: why would and agent have this specific failure mode? I will need to look into Selection theorems in the future, according to my understanding, that agenda is about figuring out which algorithms are likely to selected during a real-life training, and formalize the intuition of "Come on, why would we get an algorithm that has this specific failure mode?" But this is not our agenda, we are looking for provable guarantees, and this example already refutes the naive hope that low regret inside the hypothesis class would always lead to good performance outside the class.
The square-free failure mode
Sure, the 010101... environment can fool an agent that only prepares for bandit environments in its hypothesis class. But maybe if the hypothesis class is broad enough, then a policy that learns the class is really guaranteed to behave reasonably outside the class too! In particular, there is a simple finite-state communicating POMDP that produces the 010101... sequence of observations, so if the agent's hypothesis class is the set of finite-state communicating POMDPs, then a policy that learns it does well in the 010101... environment.
Unfortunately, just broadening the hypothesis class a little doesn't solve the problem: the policy can always have a failure mode, a specific sequence of observations that is very unlikely to happen according to any of the hypotheses in the class, but might easily be encountered in the wild. In the case of a hypothesis class made of finite-state communicating POMDPs, all hypotheses suggest that there is a vanishingly low probability of observing the sequence which is 1 for square-free indices and 0 otherwise, so the policy having a failure mode for that sequence doesn't cause much expected regret.
It's important to emphasize that the general problem is not "Well, maybe the agent will encounter a specific random string of observations in the wild and it happens to exactly match its failure mode." No, if the failure mode string was truly random and unexceptional, then there wouldn't be a significant risk of the true environment producing it either. The real problem is that the failure mode can also be a highly non-random string that could be in principle be very predictable from the real environment (like the square-free numbers can be a naturally occurring sequence), but as the agent doesn't have the true environment in its hypothesis class, it can easily happen that a string which is very natural in the true environment is considered very unlikely according to all the agent's hypotheses, therefore the agent is allowed to have it as a failure mode.
Bayes-optimal policies
Let's be less ambitious then: If a policy doesn't just have low regret on a hypothesis class, but it's the single best policy given a prior (it's Bayes-optimal), then is it enough to guarantee good performance outside the class?
Sometimes, sort of! Let's assume that for now that we only care about oblivious environments, that is, we can assume that the agent's observations are not influenced at all by its actions. Let's also assume that the agent knows this: its hypothesis class includes only oblivious environments. Further, suppose that there is at least one environment with non-zero prior in the hypothesis class according to which every finite sequence of observations has at least a tiny chance of happening. (For example the environment in which observations are independently random with uniform distribution over O. ) In that case, a Bayes-optimal policy never performs dominated actions: That is, if there are actions a and b and ℓ(a,o)<ℓ(b,o) for all possible observations o, then the agent never takes the action b that is always worse. This is easy to prove, as changing an action from b to a in any situation only decreases the expected loss, the Bayes-optimal policy can never play b.
This is not a very impressive result in itself, especially that even if the agent is actually placed in an oblivious environment, if its hypothesis class contains non-oblivious environments too, then it's suddenly not clear whether the agent will learn not to play dominated actions.
Bayes-optimal policy being deceived by Morse-codes
Let the agent's loss function is this: any time it plays action X, it gets 0 loss, any time it plays action Y, it gets 1 loss, and in addition to that, any time the observation is 1, it gets 10 loss, and any time the observation is 0, it gets 0 loss. The environment is oblivious, so the observations don't depend on the agent's actions. Obviously, the right choice would be to always play X.
However, the agent doesn't know for certain that the environment is oblivious, it has some prior probability on hypotheses according to which action Y makes observation 0 more likely, in which case it could be better to play Y.
So the agent does some experimentation, but doesn't find any statistical correspondence between its actions and the observations (as the environment really is oblivious). But before it could pivot to just playing action X, it realizes that the observations so far are repetitions of this Morse-code: "Your actions didn't matter so far. But if you play action Y until the 1000th turn, then you will be rewarded with 1000 turns of observation 0!"
The agent dutifully plays Y until the 1000th turn. Then the promised reward of long strings of 0s doesn't arrive, instead a new Morse-code starts repeating: "Sorry, I lied last time, but I'm telling the truth now, promise! If you play Y until the 10000th turn, you will be rewarded with 10000 observations 0!" And so on.
Will the agent fall for that? Depends on the prior. It can have a prior according to which, for every n, a hypothesis of the form "I get lots of Morse-code warning signs until turn 10n, and if I comply, I get the promised reward of 10n " has relatively high prior probability, while the Morse-code appearing has low probability according to all other hypotheses. In that case, seeing the Morse-code can make the agent comply again and again.
After some time, the promise of future reward becomes less appealing because the time discount rate. When exactly? The agent needs to take an action now at time t that causes 1 loss now, but will lead to 10 reward at time 10t if the "Morse-code tells the truth" hypothesis is true, which can have probability at least 12 given the right prior, because every other hypothesis is very unlikely after observing the repetition of the Morse-code a few times. The agent takes the loss now if γ9t>15, in other words t>−ln59lnγ.
For γ close to 1, lnγ is approximately γ−1. So the agent will take the bait until approximately turn t=ln59(1−γ), which is the ln59 ratio of its expected lifespan, so falling for the trap this long causes a non-negligible regret.
We already knew that if the hypothesis class contains traps (possibilities that a finite string of actions have indefinite, irrevocable consequences), then the agent can fail to learn the hypothesis class. But note that it's not the case here: none of the considered hypotheses are traps, as none of them assumes irrevocable consequences of actions, a particular hypothesis just posits that some actions have consequences that last 10n turns in the future, that's not really a trap, as γ→1, it would become negligible.
So the hypothesis class H itself is learnable, but the environment e that repeatedly produces false Morse-codes is not in H, so we have no guarantee that a Bayes-optimal policy of H would work well in e, and indeed, for certain priors over H, the Bayes-optimal policy will play the suboptimal action Y for most of its life when placed in environment e.
Can Solomonoff predict the even bits?
It seems that the previous example needed a contrived prior. Can similar anomalies occur with a more natural, Solomonoff prior? Intuitively it feels less likely: the hypotheses predicting Morse-codes warning of true consequences shouldn't have too big prior probabilities, and more importantly, it shouldn't have a bigger prior probability then them conveying exactly the wrong message: maybe you will get a reward of long string of 0s if you do the opposite of the message and play X!
Lattimore, Hutter and Gavane investigate a closely related question in their 2011 paper Universal Prediction of Selected Bits. They take a simpler case than the one we looked at: The agent already knows that the environment is oblivious, it's a simple sequence prediction task. The agent starts with a Solomonoff prior over computable binary sequences, then it updates on the bits observed so far, and based on this, makes a prediction for the next bit in every step.
The sequence it observes has 0 on every even bit, however the odd bits follow an arbitrary pattern that's not even necessarily computable. Still, any human would figure out the simple rule about the even bits, so we would hope that Solomonoff induction, which is supposed to be the Idealized Sequence Predictor, can also do that.
Let pi be the probability the agent gives for bit 2i being 0 after it observed all the previous bits. Is it true, that for all sequences in which every even bit is 0,
limi→∞pi=1?I could have imagined the answer being "No": the even bits are always 0, while the odd bits always contain the Morse-code like "The bit 2n will be an exception!", always changing the prediction to a bigger n immediately after the last prediction failed. Maybe something like this could deceive the Solomonoff induction to occasionally, on these prophesied 2n bits, predict low probability for 0, in which case the limit wouldn't be 1.
Long story short, it follows as a special case from Theorem 10 of the paper that Solomonoff[2] can't be fooled like that, and it does in fact always give close to 1 probability of 0 on the even bits, given long enough time.
Okay, and what happens if the even bits are not deterministically 0, but have value 1 with probability 20%, and have value 0 with probability 80%, independently of each other and of the odd bits? Will pi converge to 80% with probability 1 for any given sequence on the odd bits?
The answer is that we don't know, this is a special case of Open Question 1 from the paper. My personal conjecture is yes, but it's probably hard to prove and remains a a gap in our knowledge about Solomonoff induction's performance in non-realizable settings.
Would an expected utility maximizer using Solomonoff induction also learn to always play X in the previous example? The example requires the agent to have non-oblivious environments in its hypothesis class, but there still shouldn't be traps according to any of its hypothesis, otherwise the class wouldn't be learnable. So we should restrict the Solomonoff prior to a set of interactive Turing machines that don't contain traps, for the question to make sense at all. I haven't thought deeply about what would be natural choices for such sets, and I remain genuinely unsure whether a Solomonoff based expected utility maximizer would learn to always play X in every oblivious environment. Hutter, Lattimore and Gavena's similar result indicates yes, but the possibility of large future rewards might make a big difference between interactive expected utility maximization and simple sequence prediction.
Infra-Bayesian learning theory
The creation of infra-Bayesianism was motivated by these examples. When we are constructing a universal model of learning agents, it shouldn't be a hard theorem that it is able to learn that every second bit is 0. This should be obvious! After all, that's how humans learn about the world: we don't have a full probability distribution over how the world can be like to the atomic level, but we do learn a few regularities about our environment, like every time we touch the hot stove, it burns our hand. After we learn this law about the world, we might still have huge uncertainties about the rest of the universe, but we still stop touching that damn stove.
Motivated by this intuition, the hypothesis class of an IB learning agent is not made of environments, but laws. (Important note: laws were named belief functions in the original Infra-Bayesianism sequence, but were later renamed.) A law describes some regularities of the environment, while the agent can remain in Knightian uncertainty about everything else. For simplicity, we will focus on so-called crisp infra-POMDP laws, which is a general enough definition for our purposes now.
First, a quick notation:
Notation 15 (□O).
For a space X, the notation □X denotes the space of closed convex subsets of ΔX (the space of probability distributions over X).
Definiton 16 (Crisp infra-POMDP laws).
A crisp infra-POMDP law Θ is specified by a space of states S, an initial state s0∈S, an observation function b:S→O and a transition kernel T:S×A→□S. The law is, that in every turn, the distribution of the next state must be in the closed convex set T(s,a)∈□S where s is the current state and a is the action of the agent that was taken just now. The agent can't observe which state it's in, but after every turn, it observes b(s′)∈O where s′ is the current state at the end of the turn.
From now on, whenever we write "law" we mean "crisp infra-POMDP law", at least in this post.
Given a law Θ, the agent tries to minimize its loss in the worst case scenario allowed by Θ. We personify the "worst case scenario" as the malevolent deity, Murphy. We assume that Murphy knows the policy π of our agent, and in every turn, chooses a distribution from T(s,a) in a way that maximizes the expected loss of the agent over its lifespan, given the agent's policy.
Definiton 17 (Expected loss with respect to a law).
Given an infra-POMDP law Θ,
EΘ(π)(Lγ)=maxψE(π,ψ,Lγ)where ψ:S×V→T(s,a)⊂ΔS is a counter-policy of Murphy to the the agent's π policy: Murphy can decide given the history so far and the state they are in, which distribution inside T(s,a) should the next state be sampled from. (V=(A×O)∗×A denotes the set of finite histories).
So given a law Θ, our agents wants to choose the policy that minimizes this worst-case EΘ(π)(Lγ)
How does learning happen?
Analogously with the classical reinforcement learning agent, the infra-Bayesian RL agent also has a hypothesis class, but not made of environments but of laws. It aims to have a policy that has low infra-Bayesian regret with respect to every law in its hypothesis class.
Definiton 18 (Regret in infra-Bayesianism).
A policy π's regret with respect to a law Θ, given the loss function Lγ is
R(π,Θ,Lγ)=EΘ(π)(Lγ)−minπ∗EΘ(π∗)(Lγ)Analogously with the classical case, we can also define learnability:
Definiton 19 (Infra-Bayesian learnability).
A hypothesis class H is said to be learnable, if there exists family of policies {πγ}, such that for every law Θ∈H,
limγ→1R(πγ,Θ,Lγ)=0.The analog of Theorem 4 applies to infra-Bayesian hypothesis classes too: if every finite subset of a countable H is learnable, then H itself is learnable. The proof is exactly the same.
Again, not every hypothesis class is learnable, because every environment is a law, so the previous counter-example with the two environments sending the agent to Heaven or Hell based on its first move is a good counter-example here too. But again, many natural hypothesis classes turn out to be learnable. And compared to classical learning theory, we can have more hope that a policy that has low regret with respect to some infra-Bayesian hypotheses, will behave reasonably in every environment. After all, if it's prepared to do well in a world governed by a malevolent deity, then a real environment shouldn't be worse.
To understand these concepts better, let's restrict our attention to a special type of laws: crisp infra-bandit laws.
Definiton 20 (Infra-bandit).
A crisp infra-bandit is a function B:A→□O.
An example of crisp infra-bandit is: "If the agent takes action X, then the probability of of observation 1 in that turn is between 17% and 31%. If the agent takes action Y, then the probability of observation 2 and 3 are equal in that turn." We can check that in this example, B(X) and B(Y) are both closed convex sets of distributions.
Moreover, it's easy to see that every infra-bandit law is an infra-POMDP: Take an S such that b(s) is a bijection between S and O and let T(s,a)=B(a) for all s∈S and a∈A.
Given an infra-bandit law, the agent needs to prepare for the worst case scenario, that is, playing against Murphy, who is only constrained by the infra-bandit law.
After our agent plays action a, Murphy can choose a distribution from B(a) and then an observation is sampled from that distribution. The agent wants to choose a policy that keeps its loss low when playing against Murphy.
In the example above, if observation 1 implies a huge loss compared to 2 and 3, then the agent should take action X, because then Murphy will be constrained to pick a distribution in which the probability of 1 is at most 31%. On the other hand, if the agent took action Y, then Murphy would choose the distribution in which observation 1 has probability 1.
It's important to note that Murphy sees the full policy of the agent, so Murphy is not required to always shortsightedly choose the distribution that will cause the biggest expected loss in that particular turn: knowing the agent's learning algorithm, Murphy can choose not to cause maximal loss for some turns if it provokes the agent to learn a wrong lesson and start acting in a way such that Murphy will be able to cause him larger harm in the future. So the agent needs to prepare a policy that's not susceptible to this kind of deception.
Theorem 21. A hypothesis class H made of infra-bandits is always learnable.
Proof. Again, we only prove it for countable hypothesis classes, and because of the infra-Bayesian version of Theorem 4, it's equivalent to proving the learnability of finite H.
To have the regret bound going to 0 as γ goes to 1, it's enough to just always play the action that had the lowest average loss so far, and occasionally explore all the other other actions, with frequency decreasing to 0 but never disappearing. Call this policy π.
Given an infra-bandit B∈H, the best policy would be to always play the action a for which
ca=maxθ∈B(a)Eo∼θℓ(o,a)is minimal, that is, Murhpy can cause the smallest loss given this action. If we call π∗ the policy of always playing action a, then for every γ, the IB expected loss is
EB(π∗)(Lγ)=ca.On the other hand, for any ε>0, if the agent follows policy π and Murphy really is constrained by law B, then as γ→0, the agent will observe with probability approaching 1 that its average loss when playing action a is at most ca−ε. So whatever trick Murphy is playing, the action the agent is playing on its non-exploratory turns will always have an average loss so far at most ca−ε. If it was higher than that, the agent would just switch to playing action a. This means that as γ→1, the agent's average loss will be at most ca, so
limγ→1EB(π)(Lγ)=ca.This means that that
limγ→1R(π,B,Lγ)=0for every B∈H, so the hypothesis class is learnable. ◻
We could be interested not just in learnability, but in what regret bounds are achievable given a time discount rate γ or time horizon T. For this, we would need to find the optimal rate to decrease the frequency of exploratory steps. Vanessa investigated this in a yet-unpublished paper and got similar regret bounds to the ones we have for classical bandits. However, for this post, we don't inquire further than learnability, as the focus of the post is on how much we can prove behavioral guarantees for learning policies in general environments, and my current understanding is that the exact regret bounds don't matter for the examples this post will examine.
Infra-bandits was an easy-to-understand example for infra-Bayesian learnability, but Vanessa also proved a more general result:
Theorem 22. A hypothesis class H made of finite-state, communicating crisp infra-POMDP laws is learnable.
The definition of communicating is the same as before: for any two states, the agent has a policy which ensures that he will get from one state to the other with probability 1, no matter what Murphy is doing (within the bounds of the law).
Infra-Bayes-optimal policies
Before we move on to examples and counterexamples, we define a few more concepts analogous to the ones in classical theory.
Definiton 23 (Bayesian regret in infra-Bayesianism).
Given a prior distribution ζ on the hypothesis class H, the Bayesian regret of a policy π is
EΘ∼ζR(π,Θ,Lγ).The analog of Lemma 12 is true in the infra-Bayesian setting too, with the exact same proof.
Also, analogously with the classical setting, we can look at the policy with the lowest Bayesian regret:
Definiton 24 (Infra-Bayes-optimal policy).
Given a hypothesis class H made of laws and given a prior distribution ζ over H, the policy π is said to be infra-Bayes-optimal if it minimizes
EΘ∼ζEΘ(π)(Lγ)Theorem 14 is also true here, with the exact same proof: an infra-Bayes-optimal policy with a non-dogmatic prior over a learnable hypothesis class H will learn the class H.
Similarly to the classical case, infra-Bayes-optimal policies are more likely to behave reasonably in general environments, but unfortunately it's usually computationally intractable to find them, so we would prefer to prove guarantees about policies that are not required to be optimal, just to have low regret for every law in the class.
Non-realizability and infra-Bayesian learning theory
So, how well does infra-Bayesianism do in general environments? Let's go through the previously listed examples.
The 010101... failure mode
We assume that the agent's hypothesis class consists of infra-bandits.
The agent's loss function is this: if its opponent plays action 1, he gets 10 loss, if the opponent plays action 0, he gets 0 loss. On top of that, if he plays action X, he gets 0 loss and if he plays action Y, he gets 1 loss.
The environment is oblivious, the observations don't depend on the agent's actions. So we would hope that the agent will learn almost never to play Y, which is a dominated action.
On the other hand, the agent can still have this the policy π below that contains a failure mode:
"Start with action Y, and play action Y as long as the observation history looks like 010101... After the first deviation from this sequence, take action X ever after."
Take an infra-bandit law Θ∈H.
R(π,Θ,Lγ)=EΘ(π)(Lγ)−minπ∗EΘ(π∗)(Lγ)where EΘ(π)(Lγ) is the expected loss if Murphy, knowing the agent's policy, chooses the distribution in every turn, constrained only by the law Θ, in a way that it maximizes the agent's overall expected loss through his life.
The infra-bandit law Θ assigns a closed convex set of probability distributions on the observations {0,1}. These distributions can be described with one variable p∈[0,1], the probability of observation 1, because then the probability of observation 0 is just 1−p. So a closed convex set of distributions is just [a,b]⊂[0,1] interval that determines that pmust be in [a,b].
If Θ(Y)=[a,b] then however Murphy chooses the distributions in each turn, by the 2nth turn, the 0101... failure mode history's probability will be at most bn(1−a)n. If b<99100, that is, at most 99100 is the maximum probability Murphy can assign to 1, then the failure mode history staying up for long is vanishingly unlikely, and the expected regret compared to the optimal policy of always playing X is at most
(1−γ)∞∑k=0(99100)k(γ2k+γ2k+1)=1−γ21−99100γ2which goes to 0 as γ goes to 1.
And if b>99100, then Murphy has no reason to try tricking the agent into losing 1 each turn by playing a suboptimal strategy at the price of Murphy being generous on every second turn: no, the loss is maximized when Murphy plays 1 with the maximal possible probability at every turn and makes the agent lose 10. So the failure mode in the policy doesn't affect EΘ(π)(Lγ) much, so π's regret will be low under every law in the hypothesis class, even though it's stupid.
The square-free failure mode
If we broaden the hypothesis class to finite-state communicating crisp infra-POMDPs, then there will be a simple law that predicts the 010101... sequence, so the agent can't have a failure mode on that sequence anymore. But no matter how broad hypothesis class we take, there will be sequences that are not directly encoded in any of the hypotheses, and the agent might be placed in an environment which produces that sequence. This is the core of the non-realizability problem.
Then, the agent will be unprepared to respond to this sequence: according to all its hypotheses, this sequence can only occur if Murphy has a very strong control over which bits should appear when. But then why wouldn't Murphy just use his control to make as many observations 1s as possible, instead of using its power to trick the agent into making suboptimal moves that are relatively harmless compared to observing more 1s?
Still, this is an improvement compared to performance guarantees provided by classical learning theory. For example, let's assume that the agent's hypothesis class is made of finite-state communicating crisp infra-POMDP laws, and it encounters the sequence that's 1 on square-free indices and 0 otherwise. This was our previous example that a classical learning agent with a hypothesis class made of finite-state POMDPs couldn't handle.
Fortunately, there is a law Θ that describes this sequence in a large part: "Every fourth and ninth bit must be 0, and Murphy is unconstrained on the other bits." This is a law that can be easily formalized as a finite-state communicating crisp infra-POMDP. If Θ is in the infra-Bayesian agent's hypothesis class, it is suddenly not allowed to have a failure mode where it plays action Y if it sees the sequence of square-free numbers: if the agent had this failure mode, then under law Θ, it would be worth it for Murphy to produce the square-free sequence, thus tricking the agent to play Y and lose 1 in every turn. The other option of Murphy would be to just play observation 1 on every index not divisible by 4 or 9 but this wouldn't be much better for him than playing 1 only on the square-free indices, the difference is just the 1−6π2−14−19+136=0.058 portion of bits, so the average loss 0.58 is smaller than the one Murphy can get by tricking the agent.
(Fun fact: the portion of square-free numbers from 1 to N goes to 6π2 as N goes to infinity.)
This means that if the agent has law Θ in its hypothesis class, then it must be prepared for this, which means it cannot have this exploitable failure mode.
How far does this logic go? If I understand correctly, this is the main hope of infra-Bayesian learning: that we can get a relatively narrow hypothesis class of laws that fits into the agent's head, but which is still broad enough that every environment we can expect the agent to encounter (what does that even mean?) is approximated well enough by one of the laws Θthat the agent can't have a failure mode in the sequence produced by the environment, because otherwise it would have non-negligible regret with respect to Θ (because if there was a failure mode for the sequence, it would be worth it to produce the sequence for Murphy constrained by Θ.)
Whether this is the case in real life (again, what does this really mean?) and whether we can formalize this further is an open question, but this is a promising direction.
Infra-Bayes-optimal policies might also be deceived by Morse-codes
Just as in the classical case, imagine that the agent has a contrived prior that gives high probability to laws in the form "Murphy is required to advertise in Morse-code that he will give lots of observations 0 after the 10nth term if I now play Y for a while, and then Murphy is required to keep its promise" but there aren't other laws in the hypothesis class that include constraints on Murphy that would require him to play Morse-codes (or these laws just have very low prior).
Then, after observing the Morse-codes, the agent will think "If Murphy is not required to do this, then the only way he can create such a sequence is by having strong control over the bits, but then why wouldn't he use that to just show a lot of 1s? The only way this makes sense is under the hypothesis that Murphy is required to do this, in which case I should play Y, because then Murphy will be required to play a long string of 0s in the future."
If we define a Solomonoff-style simplicity prior over infra-Bayesian laws, this more natural prior might get rid of this problem. But it might get rid of the problem in classical learning theory too (in fact, we know that it can predict the even bits), and I feel that the two problems are actually very similar, although the reasoning in the previous section might give us more hope in the infra-Bayesian case.
A day in the life of an IB agent
The original intent of infra-Bayesian learning was that the agent doesn't need to learn every detail about its full environment to understand a simple law of nature, like that the hot stove burns his hands, so he should stop touching it.
Instead, what we got is this weird decision mechanism:
"I learned that the hot stove always burns my hand. Today it's raining outside and I just stubbed my toe, so everything in the environment that I don't have a model about is very bad, which matches my expectations that an evil deity controls the universe."
"Oh weird, the rain just stopped. Aha, I see, this is just the evil deity trying to confuse me and trick me into touching the stove. But I'm not falling for this!"
"Oh, I just found a shiny coin in my pocket. This is really unexpected, as the sunshine and the coin together are worth more for me than not burning my hand, so it can't just be for tricking me. I don't understand why and evil god would let so good things happen to a person, and I'm really confused and don't have a plan for this unlikely situation. I might as well celebrate by touching the hot stove."
This story is somewhat unfair, because this problem only occurs if enough good things happen to the agent that are not explained by any of the laws in it hypothesis class: as we have seen with the square-free example, if the agent knows a law that can model most of the good things happening, then the agent will just conclude that Murphy is mostly constrained by this law, and the little additional goodness not explained by the law is just for tricking it to touch the stove, which dissolves the agent's confusion, and saves it from burning its hand. However it's still unclear what would be a good hypothesis class of laws that can explain every good event with close enough approximation.
A proposed solution
The best idea I heard for getting a simple hypothesis class for solving this problem is a class of laws in this form: "The hot stove always burns my hand anytime I touch it. The not stove-related event in the world on average cause at most loss c to me." Having laws like this for a lot a of different values of c seems like it could solve the problem: if unmodeled things are not going the worst possible way, then the agent doesn't get confused, but thinks that the law requires Murphy to constrain its evilness in general.
Unfortunately, I don't believe there is such a simple way out. How would we define in a law that "I get on average at most c loss"? Average on which time frame?
We can indeed make laws specifying concrete, finite time-frames, like "In every 100 turn, there must be at least 35 positive observations". But then the agent might just get into an environment in which longer and longer strings of 0s and 1s follow each other, which falsifies all these laws, in which case we are back to the point where nothing is holding back the agent from touching the stove.
The other possibility would be a law saying that throughout the agent's whole life, the overall loss must be at most c, calculated with γ time discount rate.
Unfortunately, you can't define a law like that. Laws are not supposed to include γ, which is a moving variable, it being encoded in the hypotheses would break the definition of learnability.
Let's take an example what could happen if we allowed laws including γ:
"The law is that a stove can only burn you at most 12(1−γ) times, after that it will turn things to gold."
How can an agent with γ time discount (this is similar to having a lifespan of 11−γ turns) supposed to test this hypothesis? It can only do so by spending half its life touching hot stoves. If the hypothesis was false, this is a pretty bad deal. On the other hand, if it was right, then not trying it would mean missing out on giant heaps of gold.
If the law specified 10000 instead of 12(1−γ) turns, then as γ goes to 1, the hypothesis would be testable, which could make the hypothesis class containing it learnable. But if we allowed incorporating γ in the hypotheses themselves, learnability breaks down.
I don't see a third way how a law of "average c loss" could be implemented, so while I still think it's an interesting idea to be potentially used later, it doesn't solve the problem in itself.
Conclusion
Infra-Bayesianism is better equipped than classical learning theory in providing provable performance guarantees in general environments, but it still probably needs a very broad hypothesis class of very detailed laws that can describe the world in close-enough detail that the agent doesn't get confused when good things are happening, and provably avoids failure modes in those cases. I don't have a clear idea of exactly how all-encompassing these laws would need to be, but it's at least some improvement over classical learning theory that only functions well if the environment's full model is included in the hypothesis class. I think the natural next step would be to describe better what kind of hypothesis class of laws is detailed enough the give good performance guarantees, but still realistic to fit into the head of an agent smaller than the environment. I feel skeptical about getting meaningful results to this question, as I believe that the word "realistic" hides a lot here and my suspicion is that our theorizing is not really useful to tackle that. But this topic belongs to my other post.
Appendix: Technical proofs
Theorem 4.
If every finite subset of the countable hypothesis class H is learnable, then H itself is learnable.
Proof. As H is countable, we can order its elements. Let Hn denote the set first n elements of H. As every Hn is a learnable hypothesis class, there exists a corresponding {πγn} family of policies for all n.
Let f:(0,1)→Z+ be the function such that f(γ) is defined as the biggest number n such that
ER(πγn,e,Lγ)<1nfor every e∈Hn.
If no such n exists, then let f(γ)=1. If all positive integers n satisfy the condition, that let f(γ)=⌊11−γ⌋.
We prove by contradiction that f(γ) goes to infinity as γ goes to 1. Assume that it's not the case: there is an M such that there is a sequence of values γi→1 such that f(γi)<M. This would mean that for all i, there exists an ei∈HM, such that ER(πγiM,ei,Lγi)≥1M. As there are only finitely many environments in HM, there must be one that that appears infinitely many times in the ei sequence. Then this environment e would contradict the definition of HM's learnablility, as
limγ→1ER(πγM,e,Lγ)=0would not hold. Contradiction.
We can now construct a family of policies {πγ} that learns the whole class H:
πγ=πγf(γ).For every environment eM∈H and ε>0, there exists a point r∈(0,1) such that f(γ)>max{M,1ε} for γ>r. This means that for γ>r, we get
ER(πγf(γ),eM,Lγ)<εbecause eM∈Hf(γ) and 1f(γ)<ε.
Thus, we proved for all e∈H that
limγ→1ER(πγ,e,Lγ)=0.◻
Theorem 12.
Given a countable hypothesis class H and a non-dogmatic prior ζ on H (non-dogmatic meaning that ζ(h)>0 for all h∈H), and a family of policies πγ, then
limγ→1Ee∼ζER(πγ,e,Lγ)=0is equivalent to
limγ→1ER(πγ,e,Lγ)=0∀e∈H,that is, the class H being learnable and π being a good learning algorithm.
Proof. Suppose that
limγ→1Ee∼ζER(πγ,e,Lγ)=0.For any hypothesis e∈H,
ER(πγ,e,Lγ)≤1ζ(e)Ee∼ζER(πγ,e,Lγ),so as the latter goes to 0, the former goes to 0 too, because we had the condition that ζ(e)>0.
Now let's suppose that
limγ→1ER(πγ,e,Lγ)=0∀e∈H.As the loss acquirable in one turn is bounded by b, the overall loss with discount rate γ is at most (1−γ)∑∞k=0bγk=b. This means that for every policy and environment, R(πγ,e,Lγ)≤b.
Let H={e1,e2,…}. Because ∑∞i=1ζ(ei)=1, for every ε>0 there exists an N such that if ∑∞i=N+1ζ(ei)<ε2b.
For the first N (finitely many!) hypotheses there exists a δ that if γ>1−δ, then
ER(πγ,ei,Lγ)<ε2∀1≤i≤N.So for any ε>0 there exists a δ such that if γ>1−δ, then
Ee∼ζER(πγ,e,Lγ)=N∑i=1ζ(ei)ER(πγ,ei,Lγ)+∞∑i=N+1ζ(ei)ER(πγ,ei,Lγ)≤ε2+ε2bb=εWith this we proved that
limγ→1Ee∼ζER(πγ,e,Lγ)=0.◻
Theorem 14.
Given a learnable, countable hypothesis class H and a non-dogmatic prior ζ on H, if we take the Bayes-optimal policy πγ for every γ∈(0,1), then this family of policies learn the hypothesis class H. :::
Proof. Because H is learnable, there exists a family of policies {πγ∗} such that
limγ→1ER(πγ∗,e,Lγ)=0∀e∈H.By Theorem 12, this is equivalent to
limγ→1Ee∼ζER(πγ∗,e,Lγ)=0.πγ is the Bayes-optimal policy for every γ∈(0,1), given the prior ζ, in particular, it accumulates at most as much loss in expectation as πγ∗. Thus,
limγ→1Ee∼ζER(πγ∗,e,Lγ)=0.By Theorem 12, this is equivalent to
limγ→1ER(πγ,e,Lγ)=0∀e∈H.Thus, the Bayes-optimal family of policies {πγ} learns the hypothesis class H. ◻
See "Reinforcement learning in POMDPs without resets" (Even-dar at al 2005), theorem 4.1.
A normalized version of Solomonoff, to be more precise.