Review

Introduction

I wrote this post during my SERI MATS scholarship, explaining the part of infra-Bayesianism that I understood the most deeply. Compared to some previous distillations, I don't go deeply into the mathematical machinery built around infra-Bayesianism, instead I try to evaluate what results infra-Bayesianism produced compared to classical learning theory in the topic of performance guarantees in general environments, which was the main motivation behind the development of infra-Bayesianism. 

The theorems appearing in the post are not original to me, and can be found in either the broader academic literature (concerning classical learning theory) or Vanessa's writings (concerning infra-Bayesianism). I included proofs in the text when I found them useful for general understanding, and moved them to the Appendix when I thought them simple and important enough to write down, but not interesting enough to break the flow of text with them. The later examples investigating the limits of infra-Bayesian learning are my own.

My conclusion is that the general performance guarantees we can currently prove about infra-Bayesian learning agents are pretty weak, but it seems plausible that the theory gets us closer to tackling the question.

For my general thoughts and doubts about infra-Bayesianism, see my other, less technical post: A mostly critical review of infra-Bayesianism.

Classical reinforcement learning

Before we go into the infra-Bayesian explainer, or before we could even really state what the motivating problem of non-realizability is, it's important to get at least a superficial overview of classical reinforcement learning theory that is studied in academia since a long time. My main source for this was the book Bandit Algorithms by Lattimore and Szepesvári.

In a simple model of reinforcement learning, the agent is placed in an environment, and on every turn, the agent chooses an action from its finite action set , and after that, the environment gives an observation from the finite observation set . There is a bounded loss function  that is known to the agent in advance, assigning real-valued losses to finite histories. Usually, we will work with  loss functions. This choice is motivated by game theory, where the learning process would be a repeated game, and in each turn, the loss is determined by the agent's action and the opponent1s action (the observation). 

This definition of loss function is already pretty general, because for any loss function that can only take finitely many values, we can encode the loss in the observation itself, which gives the loss function the desired    form. It's important that we required the loss function to be bounded, otherwise we could encounter inconvenient St. Petersburg paradox-style scenarios.

The agent has to take repeated actions either until a time horizon  or indefinitely. If the the agent's lifespan is infinite, there needs to be a way to compare overall results instead of just saying "there is infinite loss accumulated anyway", so we introduce a time discount rate . We will concentrate on this infinite, time discounted version.

Notation (). For a space , the notation  refers to the space of probability distributions over .

An environment is   that assigns a probability distribution over observations to every finite history ending with the agent taking an action.

The agent has a policy    that assigns a probability distribution over actions to every finite history ending with an observation. 

A policy is good if it ensures that the the agent is successful in a range of environments.

Formally, the agent has a hypothesis class  made of environments ( is usually defined to be finite or countably infinite). It wants to use a policy  such that for every environment , the policy  has low regret with respect to environment .

Definiton 1 (Expected regret). 

The expected regret of policy  in environment  with respect to loss function  is defined as the expected loss under policy  minus the minimum expected loss over all policies  in environment  with respect to loss function 

Here  denotes the overall loss with  discount rate:

Definiton 2 (). 

where  is the agent's loss in turn . 

Intuitively, an agent with a good policy takes some time to explore the environment, then as the agent becomes reasonably certain which environment he is in, it starts exploiting the environment, although potentially still taking exploratory steps occasionally.

As the discount rate  goes to , the agent has more and more time to freely explore its environment, because  rounds of exploration is still not too costly compared to the overall utility, so hopefully it will be true that whichever environment  the agent is placed in, it will have time to decide which environment he is in, with a very low probability of mistake, then start to play an almost-optimal strategy for that environment, thus ensuring the regret to be low. If this holds, we call the hypothesis class  learnable. Formally,

Definiton 3 (Learnability). 

A hypothesis class  is said to be learnable, if there exists a family of policies , such that for every environment ,

It's important to note that not every hypothesis class is learnable. For example, take a hypothesis class that contains these two environments: In environment , if the agents makes move  in the first turn, it goes to Heaven (receives minimal loss in every turn then on), otherwise it goes to Hell (receives maximal loss in every turn then on). Environment  is just the reverse. Then it's impossible to construct a policy that has low regret with respect to both environments  and .

This is a classical example of a trap: an action that has irrevocable consequences. Generally speaking, neither classical nor infra-Bayesian learning theory has a good model of what agents are supposed to do in an environment that might contain traps. This is a big open question that seriously limits our mathematical formalism of real-world agency (as the real world is full of traps, most notably death). But this is an independent concern that infra-Bayesianism doesn't try to address, so we are only looking at learnable hypothesis classes now.

Fortunately, some natural hypothesis classes turn out to be learnable, as we will discuss later. The heuristics is that if no mistake committed in a short period of time is too costly, according to any environment in the class, then the class will be learnable.

Before we look at examples, it's useful to establish the following simple theorem:

Theorem 4.

If every finite subset of the countable hypothesis class  is learnable, then  itself is learnable. 

Now we can look at examples of learnable hypothesis classes:

Definiton 5 (Bandit environment). $\$

A bandit is an environment in which, in every turn, the probability distribution of the observation only depends on the action in that turn. A bandit environment can be characterized by a function  

The "bandit" name comes from the illustration that the agent is placed in a casino with various slot machines, aka "one-armed bandits" and the action is choosing the slot machine, and each machine has a probability distribution according to which the agent gets an observation (the amount of money falling out of the machine).

Theorem 7. A class  of bandit-environments is always learnable. 

Proof. We will only prove the statement only for countable class  but it's not hard to expand to proof further. By Theorem 4, it's enough to prove for finite classes to have a proof for every countable class. 

For a finite hypothesis class    it's enough to just always play the action that produced the lowest loss on average in the past, and occasionally do exploratory moves and try the other actions too, but with a frequency decreasing towards .

As , and the agent lives longer in expectation, the action producing the smallest loss in expectation will become the one with the smallest average loss, due to the law of large numbers. Then, the agent will mostly play that optimal action, so it's average regret goes to . ◻ 

We could be interested not just in learnability, but in what regret bounds are achievable given a time discount rate  or time horizon . For this, we would need to find the optimal rate to decrease the frequency of exploratory steps. There is a nice answer to this question in the literature, but we don't go into details here, as the exact regret bounds are not crucial for the main topic of this post.

A broader hypothesis class we will use is finite-state communicating POMDPs.

(We could also look at Markov Decision Processes (MDPs) as an intermediate step, but we are ignoring them for this post.)

Definiton 8 (POMDP). 

A Partially Observed Markov Decision Process consists of a set of hidden states , a stochastic transition function  (where  is the set of the agent's possible actions), and an observation function . 

Definiton 9 (Finite state communicating POMDP).

A POMDP in which  is finite and has the property that for every , an agent can have a policy  such that starting from state  and following policy , it eventually gets to state  with probability . 

Theorem 10. A hypothesis class made of finite-state communicating POMDPs is always learnable.[1] 

There are some learnable hypothesis classes even broader than this, but they are enough for our purpose now, as many systems can be modelled as finite state communicating POMDPs: the world can be in a number of ways, you can't see everything in the world, but have some limited observations and corresponding rewards, and you can make some reversible changes in the world.

Bayes-optimal policies

Instead of ensuring that the regret is low with respect to every individual hypothesis in the class, the agent can aim for the weighted average of regrets being low.

Definiton 11 (Bayesian regret). Given a prior distribution  on the hypothesis class , the Bayesian regret of a policy  is

Theorem 12.

Given a countable hypothesis class  and a non-dogmatic prior  on  (non-dogmatic meaning that  for all ), and a family of policies , then

is equivalent to

that is, the class  being learnable and  being a good learning algorithm. 

This result might suggest that Bayesian regret doesn't give us much value over considering regret with respect to individual hypotheses, but we will see that it has important application.

Most importantly, now that (given a prior distribution), we can assign a single quantity to the policy's performance, we can look at the policy that minimizes the Bayesian regret:

Definiton 13 (Bayes-optimal policy).

Given a prior distribution  over the hypothesis class , the policy  is said to be Bayes-optimal if it minimizes the Bayesian regret. This is equivalent with the policy minimizing

Using the previous result, we can prove the following, rather unsurprising statement:

Theorem 14.

Given a learnable, countable hypothesis class  and a non-dogmatic prior  on , if we take the Bayes-optimal policy  for every , then this family of policies learn the hypothesis class . 

As we will see later, it's often easier to prove nice properties about Bayes-optimal policies than just policies with low regret bound in general.

However, while there are some famous algorithms in the literature that ensure low regret bounds (like UCB, UCRL, Exp3), it's usually computationally intractable to find the Bayes-optimal policy. In general, it's more reasonable to expect that a real-life algorithm will achieve some low regret bounds than that it will be literally the best, Bayes-optimal policy.

Thus, both in the academic literature and in our work, we usually want to prove more general theorems in the form "Every policy that has low regret bounds with respect to this hypothesis class, has that property" instead of the much narrower claim of "The Bayes-optimal policy has that property". Still, when we can't prove something stronger, we sometimes fall back to investigating the Bayes-optimal policy.

Non-realizability and classical learning theory

The big problem of classical learning theory is that we have no guarantees that a policy selected for having low regret with respect to the environments in its hypothesis class, will also behave reasonably when placed in an environment that is not contained in its hypothesis class. (An environment not in the hypothesis class is called non-realizable.)

Unfortunately, however broad hypothesis class the agent has, we can't expect it to have a model of itself or an equally complex agent encoded in any of its hypotheses about the world. This is true for both practical reasons (it can't contain in its head a model of the world bigger than itself) and for theoretical reasons (if the model could have a model including itself, it could have a policy of doing the opposite of what its model predicts, which would lead to an obvious paradox). Also, even excluding these situations, the world is just very big, and it's unrealistic to assume that an AI could have the "true state of the world" fully modeled as one of its hypotheses. So it would be nice if we had some guarantees about the agent's performance in environment that are not directly encoded as one of its hypotheses.

A naive hope would be that maybe it's not a big problem, and if an algorithm learns a broad enough hypothesis class, then it necessarily generalizes to do reasonably well even outside the class. Let's look at some concrete examples that it's not the case.

The  failure mode

The agent's loss function is this: any time it plays action , it gets  loss, any time it plays action , it gets  loss, regardless of the observations. The environment is oblivious, so the observations don't depend on the agent's actions. Obviously, the right choice would be to always play .

Suppose the agent's hypothesis class consists only of bandits. Then, surprisingly, this learning algorithm will guarantee extremely low expected regret with respect to every environment in the hypothesis class:

"Start with action , and and play action  as long as the observation history looks like  After the first deviation from this sequence, take action  ever after."

As the pattern  is very unlikely to continue for long under any bandit environment, this policy has very low expected regret. In particular, the bandit with respect to which this policy has the highest expected regret (if  goes to ) is the one in which, after an action , the observations  and  both have . In that environment, the expected regret is

which goes to  as  goes to .

It's good to note that the failure mode doesn't need to be just one specific history, it can leave some wiggle room, like the following policy:

"Always play  except if in the history so far the difference between the number of s and s is at most  (where  is the length of your history so far) and there were at most  instances when two observations coming after each other were equal. In that case, play ."

The probability of observing a long sequence like that is still vanishingly unlikely in any bandit environment, so the policy is allowed to have this broader failure mode too. This means that if the agent is placed in an environment that mostly produces a  sequence, then postulating a little random noise in the environment doesn't save the agent from predictable failure.

I somewhat feel that this is a contrived counter-example: why would and agent have this specific failure mode? I will need to look into Selection theorems in the future, according to my understanding, that agenda is about figuring out which algorithms are likely to selected during a real-life training, and formalize the intuition of "Come on, why would we get an algorithm that has this specific failure mode?" But this is not our agenda, we are looking for provable guarantees, and this example already refutes the naive hope that low regret inside the hypothesis class would always lead to good performance outside the class.

The square-free failure mode

Sure, the  environment can fool an agent that only prepares for bandit environments in its hypothesis class. But maybe if the hypothesis class is broad enough, then a policy that learns the class is really guaranteed to behave reasonably outside the class too! In particular, there is a simple finite-state communicating POMDP that produces the  sequence of observations, so if the agent's hypothesis class is the set of finite-state communicating POMDPs, then a policy that learns it does well in the  environment.

Unfortunately, just broadening the hypothesis class a little doesn't solve the problem: the policy can always have a failure mode, a specific sequence of observations that is very unlikely to happen according to any of the hypotheses in the class, but might easily be encountered in the wild. In the case of a hypothesis class made of finite-state communicating POMDPs, all hypotheses suggest that there is a vanishingly low probability of observing the sequence which is  for square-free indices and  otherwise, so the policy having a failure mode for that sequence doesn't cause much expected regret.

It's important to emphasize that the general problem is not "Well, maybe the agent will encounter a specific random string of observations in the wild and it happens to exactly match its failure mode." No, if the failure mode string was truly random and unexceptional, then there wouldn't be a significant risk of the true environment producing it either. The real problem is that the failure mode can also be a highly non-random string that could be in principle be very predictable from the real environment (like the square-free numbers can be a naturally occurring sequence), but as the agent doesn't have the true environment in its hypothesis class, it can easily happen that a string which is very natural in the true environment is considered very unlikely according to all the agent's hypotheses, therefore the agent is allowed to have it as a failure mode.

Bayes-optimal policies

Let's be less ambitious then: If a policy doesn't just have low regret on a hypothesis class, but it's the single best policy given a prior (it's Bayes-optimal), then is it enough to guarantee good performance outside the class?

Sometimes, sort of! Let's assume that for now that we only care about oblivious environments, that is, we can assume that the agent's observations are not influenced at all by its actions. Let's also assume that the agent knows this: its hypothesis class includes only oblivious environments. Further, suppose that there is at least one environment with non-zero prior in the hypothesis class according to which every finite sequence of observations has at least a tiny chance of happening.  (For example the environment in which observations are independently random with uniform distribution over . ) In that case, a Bayes-optimal policy never performs dominated actions: That is, if there are actions  and  and  for all possible observations , then the agent never takes the action  that is always worse. This is easy to prove, as changing an action from  to  in any situation only decreases the expected loss, the Bayes-optimal policy can never play .

This is not a very impressive result in itself, especially that even if the agent is actually placed in an oblivious environment, if its hypothesis class contains non-oblivious environments too, then it's suddenly not clear whether the agent will learn not to play dominated actions.

Bayes-optimal policy being deceived by Morse-codes

Let the agent's loss function is this: any time it plays action , it gets  loss, any time it plays action , it gets  loss, and in addition to that, any time the observation is , it gets  loss, and any time the observation is , it gets  loss. The environment is oblivious, so the observations don't depend on the agent's actions. Obviously, the right choice would be to always play .

However, the agent doesn't know for certain that the environment is oblivious, it has some prior probability on hypotheses according to which action  makes observation  more likely, in which case it could be better to play .

So the agent does some experimentation, but doesn't find any statistical correspondence between its actions and the observations (as the environment really is oblivious). But before it could pivot to just playing action , it realizes that the observations so far are repetitions of this Morse-code: "Your actions didn't matter so far. But if you play action  until the th turn, then you will be rewarded with  turns of observation !"

The agent dutifully plays  until the th turn. Then the promised reward of long strings of s doesn't arrive, instead a new Morse-code starts repeating: "Sorry, I lied last time, but I'm telling the truth now, promise! If you play  until the th turn, you will be rewarded with  observations !" And so on.

Will the agent fall for that? Depends on the prior. It can have a prior according to which, for every , a hypothesis of the form "I get lots of Morse-code warning signs until turn , and if I comply, I get the promised reward of  " has relatively high prior probability, while the Morse-code appearing has low probability according to all other hypotheses. In that case, seeing the Morse-code can make the agent comply again and again. 

After some time, the promise of future reward becomes less appealing because the time discount rate. When exactly? The agent needs to take an action now at time  that causes  loss now, but will lead to  reward at time  if the "Morse-code tells the truth" hypothesis is true, which can have probability at least  given the right prior, because every other hypothesis is very unlikely after observing the repetition of the Morse-code a few times. The agent takes the loss now if  in other words 

For  close to   is approximately . So the agent will take the bait until approximately turn  which is the  ratio of its expected lifespan, so falling for the trap this long causes a non-negligible regret.

We already knew that if the hypothesis class contains traps (possibilities that a finite string of actions have indefinite, irrevocable consequences), then the agent can fail to learn the hypothesis class. But note that it's not the case here: none of the considered hypotheses are traps, as none of them assumes irrevocable consequences of actions, a particular hypothesis just posits that some actions have consequences that last  turns in the future, that's not really a trap, as , it would become negligible.

So the hypothesis class  itself is learnable, but the environment  that repeatedly produces false Morse-codes is not in , so we have no guarantee that a Bayes-optimal policy of  would work well in , and indeed, for certain priors over , the Bayes-optimal policy will play the suboptimal action  for most of its life when placed in environment .

Can Solomonoff predict the even bits?

It seems that the previous example needed a contrived prior. Can similar anomalies occur with a more natural, Solomonoff prior? Intuitively it feels less likely: the hypotheses predicting Morse-codes warning of true consequences shouldn't have too big prior probabilities, and more importantly, it shouldn't have a bigger prior probability then them conveying exactly the wrong message: maybe you will get a reward of long string of s if you do the opposite of the message and play !

Lattimore, Hutter and Gavane investigate a closely related question in their 2011 paper Universal Prediction of Selected Bits. They take a simpler case than the one we looked at: The agent already knows that the environment is oblivious, it's a simple sequence prediction task. The agent starts with a Solomonoff prior over computable binary sequences, then it updates on the bits observed so far, and based on this, makes a prediction for the next bit in every step.

The sequence it observes has  on every even bit, however the odd bits follow an arbitrary pattern that's not even necessarily computable. Still, any human would figure out the simple rule about the even bits, so we would hope that Solomonoff induction, which is supposed to be the Idealized Sequence Predictor, can also do that.

Let  be the probability the agent gives for bit  being  after it observed all the previous bits. Is it true, that for all sequences in which every even bit is ,

I could have imagined the answer being "No": the even bits are always , while the odd bits always contain the Morse-code like "The bit  will be an exception!", always changing the prediction to a bigger  immediately after the last prediction failed. Maybe something like this could deceive the Solomonoff induction to occasionally, on these prophesied  bits, predict low probability for , in which case the limit wouldn't be .

Long story short, it follows as a special case from Theorem 10 of the paper that Solomonoff[2] can't be fooled like that, and it does in fact always give close to  probability of  on the even bits, given long enough time.

Okay, and what happens if the even bits are not deterministically , but have value  with probability , and have value  with probability , independently of each other and of the odd bits? Will  converge to  with probability  for any given sequence on the odd bits?

The answer is that we don't know, this is a special case of Open Question 1 from the paper. My personal conjecture is yes, but it's probably hard to prove and remains a a gap in our knowledge about Solomonoff induction's performance in non-realizable settings.

Would an expected utility maximizer using Solomonoff induction also learn to always play  in the previous example? The example requires the agent to have non-oblivious environments in its hypothesis class, but there still shouldn't be traps according to any of its hypothesis, otherwise the class wouldn't be learnable. So we should restrict the Solomonoff prior to a set of interactive Turing machines that don't contain traps, for the question to make sense at all. I haven't thought deeply about what would be natural choices for such sets, and I remain genuinely unsure whether a Solomonoff based expected utility maximizer would learn to always play  in every oblivious environment. Hutter, Lattimore and Gavena's similar result indicates yes, but the possibility of large future rewards might make a big difference between interactive expected utility maximization and simple sequence prediction.

Infra-Bayesian learning theory

The creation of infra-Bayesianism was motivated by these examples. When we are constructing a universal model of learning agents, it shouldn't be a hard theorem that it is able to learn that every second bit is . This should be obvious! After all, that's how humans learn about the world: we don't have a full probability distribution over how the world can be like to the atomic level, but we do learn a few regularities about our environment, like every time we touch the hot stove, it burns our hand. After we learn this law about the world, we might still have huge uncertainties about the rest of the universe, but we still stop touching that damn stove.

Motivated by this intuition, the hypothesis class of an IB learning agent is not made of environments, but laws. (Important note: laws were named belief functions in the original Infra-Bayesianism sequence, but were later renamed.) A law describes some regularities of the environment, while the agent can remain in Knightian uncertainty about everything else. For simplicity, we will focus on so-called crisp infra-POMDP laws, which is a general enough definition for our purposes now.

First, a quick notation:

Notation 15 (). 

For a space , the notation  denotes the space of closed convex subsets of  (the space of probability distributions over ). 

Definiton 16 (Crisp infra-POMDP laws).

A crisp infra-POMDP law  is specified by a space of states , an initial state , an observation function  and a transition kernel . The law is, that in every turn, the distribution of the next state must be in the closed convex set  where  is the current state and  is the action of the agent that was taken just now. The agent can't observe which state it's in, but after every turn, it observes  where  is the current state at the end of the turn. 

From now on, whenever we write "law" we mean "crisp infra-POMDP law", at least in this post.

Given a law , the agent tries to minimize its loss in the worst case scenario allowed by . We personify the "worst case scenario" as the malevolent deity, Murphy. We assume that Murphy knows the policy  of our agent, and in every turn, chooses a distribution from  in a way that maximizes the expected loss of the agent over its lifespan, given the agent's policy.

Definiton 17 (Expected loss with respect to a law).

Given an infra-POMDP law ,

where  is a counter-policy of Murphy to the the agent's  policy: Murphy can decide given the history so far and the state they are in, which distribution inside  should the next state be sampled from. ( denotes the set of finite histories). 

So given a law , our agents wants to choose the policy that minimizes this worst-case 

How does learning happen?

Analogously with the classical reinforcement learning agent, the infra-Bayesian RL agent also has a hypothesis class, but not made of environments but of laws. It aims to have a policy that has low infra-Bayesian regret with respect to every law in its hypothesis class.

Definiton 18 (Regret in infra-Bayesianism).

A policy 's regret with respect to a law , given the loss function  is

Analogously with the classical case, we can also define learnability:

Definiton 19 (Infra-Bayesian learnability). 

A hypothesis class  is said to be learnable, if there exists family of policies , such that for every law ,

The analog of Theorem 4 applies to infra-Bayesian hypothesis classes too: if every finite subset of a countable  is learnable, then  itself is learnable. The proof is exactly the same.

Again, not every hypothesis class is learnable, because every environment is a law, so the previous counter-example with the two environments sending the agent to Heaven or Hell based on its first move is a good counter-example here too. But again, many natural hypothesis classes turn out to be learnable. And compared to classical learning theory, we can have more hope that a policy that has low regret with respect to some infra-Bayesian hypotheses, will behave reasonably in every environment. After all, if it's prepared to do well in a world governed by a malevolent deity, then a real environment shouldn't be worse.

To understand these concepts better, let's restrict our attention to a special type of laws: crisp infra-bandit laws.

Definiton 20 (Infra-bandit). 

A crisp infra-bandit is a function . 

An example of crisp infra-bandit is: "If the agent takes action , then the probability of of observation  in that turn is between  and . If the agent takes action , then the probability of observation  and  are equal in that turn." We can check that in this example,  and  are both closed convex sets of distributions.

Moreover, it's easy to see that every infra-bandit law is an infra-POMDP: Take an  such that  is a bijection between  and  and let  for all  and .

Given an infra-bandit law, the agent needs to prepare for the worst case scenario, that is, playing against Murphy, who is only constrained by the infra-bandit law.

After our agent plays action , Murphy can choose a distribution from  and then an observation is sampled from that distribution. The agent wants to choose a policy that keeps its loss low when playing against Murphy.

In the example above, if observation  implies a huge loss compared to  and , then the agent should take action , because then Murphy will be constrained to pick a distribution in which the probability of  is at most . On the other hand, if the agent took action , then Murphy would choose the distribution in which observation  has probability .

It's important to note that Murphy sees the full policy of the agent, so Murphy is not required to always shortsightedly choose the distribution that will cause the biggest expected loss in that particular turn: knowing the agent's learning algorithm, Murphy can choose not to cause maximal loss for some turns if it provokes the agent to learn a wrong lesson and start acting in a way such that Murphy will be able to cause him larger harm in the future. So the agent needs to prepare a policy that's not susceptible to this kind of deception.

Theorem 21. A hypothesis class  made of infra-bandits is always learnable. 

Proof. Again, we only prove it for countable hypothesis classes, and because of the infra-Bayesian version of Theorem 4, it's equivalent to proving the learnability of finite .

To have the regret bound going to  as  goes to , it's enough to just always play the action that had the lowest average loss so far, and occasionally explore all the other other actions, with frequency decreasing to  but never disappearing. Call this policy .

Given an infra-bandit  the best policy would be to always play the action  for which 

 is minimal, that is, Murhpy can cause the smallest loss given this action. If we call  the policy of always playing action , then for every , the IB expected loss is

On the other hand, for any , if the agent follows policy  and Murphy really is constrained by law , then as , the agent will observe with probability approaching  that its average loss when playing action  is at most . So whatever trick Murphy is playing, the action the agent is playing on its non-exploratory turns will always have an average loss so far at most . If it was higher than that, the agent would just switch to playing action . This means that as , the agent's average loss will be at most , so

This means that that

for every , so the hypothesis class is learnable. ◻ 

We could be interested not just in learnability, but in what regret bounds are achievable given a time discount rate  or time horizon . For this, we would need to find the optimal rate to decrease the frequency of exploratory steps. Vanessa investigated this in a yet-unpublished paper and got similar regret bounds to the ones we have for classical bandits. However, for this post, we don't inquire further than learnability, as the focus of the post is on how much we can prove behavioral guarantees for learning policies in general environments, and my current understanding is that the exact regret bounds don't matter for the examples this post will examine.

Infra-bandits was an easy-to-understand example for infra-Bayesian learnability, but Vanessa also proved a more general result:

Theorem 22. A hypothesis class  made of finite-state, communicating crisp infra-POMDP laws is learnable. 

The definition of communicating is the same as before: for any two states, the agent has a policy which ensures that he will get from one state to the other with probability , no matter what Murphy is doing (within the bounds of the law).

Infra-Bayes-optimal policies

Before we move on to examples and counterexamples, we define a few more concepts analogous to the ones in classical theory.

Definiton 23 (Bayesian regret in infra-Bayesianism). 

Given a prior distribution  on the hypothesis class , the Bayesian regret of a policy  is

The analog of Lemma 12 is true in the infra-Bayesian setting too, with the exact same proof.

Also, analogously with the classical setting, we can look at the policy with the lowest Bayesian regret:

Definiton 24 (Infra-Bayes-optimal policy).

Given a hypothesis class  made of laws and given a prior distribution  over , the policy  is said to be infra-Bayes-optimal if it minimizes

Theorem 14 is also true here, with the exact same proof: an infra-Bayes-optimal policy with a non-dogmatic prior over a learnable hypothesis class  will learn the class .

Similarly to the classical case, infra-Bayes-optimal policies are more likely to behave reasonably in general environments, but unfortunately it's usually computationally intractable to find them, so we would prefer to prove guarantees about policies that are not required to be optimal, just to have low regret for every law in the class.

Non-realizability and infra-Bayesian learning theory

So, how well does infra-Bayesianism do in general environments? Let's go through the previously listed examples.

The  failure mode

We assume that the agent's hypothesis class consists of infra-bandits.

The agent's loss function is this: if its opponent plays action , he gets  loss, if the opponent plays action , he gets  loss. On top of that, if he plays action , he gets  loss and if he plays action , he gets  loss.

The environment is oblivious, the observations don't depend on the agent's actions. So we would hope that the agent will learn almost never to play , which is a dominated action.

On the other hand, the agent can still have this the policy  below that contains a failure mode:

"Start with action , and play action  as long as the observation history looks like  After the first deviation from this sequence, take action  ever after."

Take an infra-bandit law .

where  is the expected loss if Murphy, knowing the agent's policy, chooses the distribution in every turn, constrained only by the law , in a way that it maximizes the agent's overall expected loss through his life.

The infra-bandit law  assigns a closed convex set of probability distributions on the observations . These distributions can be described with one variable  the probability of observation , because then the probability of observation  is just . So a closed convex set of distributions is just  interval that determines that must be in 

If  then however Murphy chooses the distributions in each turn, by the th turn, the  failure mode history's probability will be at most . If  that is, at most  is the maximum probability Murphy can assign to , then the failure mode history staying up for long is vanishingly unlikely, and the expected regret compared to the optimal policy of always playing  is at most

which goes to  as  goes to .

And if  then Murphy has no reason to try tricking the agent into losing  each turn by playing a suboptimal strategy at the price of Murphy being generous on every second turn: no, the loss is maximized when Murphy plays  with the maximal possible probability at every turn and makes the agent lose . So the failure mode in the policy doesn't affect  much, so 's regret will be low under every law in the hypothesis class, even though it's stupid.

The square-free failure mode

If we broaden the hypothesis class to finite-state communicating crisp infra-POMDPs, then there will be a simple law that predicts the  sequence, so the agent can't have a failure mode on that sequence anymore. But no matter how broad hypothesis class we take, there will be sequences that are not directly encoded in any of the hypotheses, and the agent might be placed in an environment which produces that sequence. This is the core of the non-realizability problem.

Then, the agent will be unprepared to respond to this sequence: according to all its hypotheses, this sequence can only occur if Murphy has a very strong control over which bits should appear when. But then why wouldn't Murphy just use his control to make as many observations s as possible, instead of using its power to trick the agent into making suboptimal moves that are relatively harmless compared to observing more s?

Still, this is an improvement compared to performance guarantees provided by classical learning theory. For example, let's assume that the agent's hypothesis class is made of finite-state communicating crisp infra-POMDP laws, and it encounters the sequence that's   on square-free indices and  otherwise. This was our previous example that a classical learning agent with a hypothesis class made of finite-state POMDPs couldn't handle. 

Fortunately, there is a law   that describes this sequence in a large part: "Every fourth and ninth bit must be , and Murphy is unconstrained on the other bits." This is a law that can be easily formalized as a finite-state communicating crisp infra-POMDP. If  is in the infra-Bayesian agent's hypothesis class, it is suddenly not allowed to have a failure mode where it plays action  if it sees the sequence of square-free numbers: if the agent had this failure mode, then under law  it would be worth it for Murphy to produce the square-free sequence, thus tricking the agent to play   and lose 1 in every turn. The other option of Murphy would be to just play observation  on every index not divisible by  or  but this wouldn't be much better for him than playing  only on the square-free indices, the difference is just the  portion of bits,  so the average loss  is smaller than the one Murphy can get by tricking the agent. 

(Fun fact: the portion of square-free numbers from  to   goes to  as  goes to infinity.)

This means that if the agent has law  in its hypothesis class, then it must be prepared for this, which means it cannot have this exploitable failure mode.

How far does this logic go? If I understand correctly, this is the main hope of infra-Bayesian learning: that we can get a relatively narrow hypothesis class of laws that fits into the agent's head, but which is still broad enough that every environment we can expect the agent to encounter (what does that even mean?) is approximated well enough by one of the laws that the agent can't have a failure mode in the sequence produced by the environment, because otherwise it would have non-negligible regret with respect to  (because if there was a failure mode for the sequence, it would be worth it to produce the sequence for Murphy constrained by .)

Whether this is the case in real life (again, what does this really mean?) and whether we can formalize this further is an open question, but this is a promising direction.

Infra-Bayes-optimal policies might also be deceived by Morse-codes

Just as in the classical case, imagine that the agent has a contrived prior that gives high probability to laws in the form "Murphy is required to advertise in Morse-code that he will give lots of observations  after the th term if I now play  for a while, and then Murphy is required to keep its promise" but there aren't other laws in the hypothesis class that include constraints on Murphy that would require him to play Morse-codes (or these laws just have very low prior).

Then, after observing the Morse-codes, the agent will think "If Murphy is not required to do this, then the only way he can create such a sequence is by having strong control over the bits, but then why wouldn't he use that to just show a lot of s? The only way this makes sense is under the hypothesis that Murphy is required to do this, in which case I should play , because then Murphy will be required to play a long string of s in the future." 

If we define a Solomonoff-style simplicity prior over infra-Bayesian laws, this more natural prior might get rid of this problem. But it might get rid of the problem in classical learning theory too (in fact, we know that it can predict the even bits), and I feel that the two problems are actually very similar, although the reasoning in the previous section might give us more hope in the infra-Bayesian case.

A day in the life of an IB agent

The original intent of infra-Bayesian learning was that the agent doesn't need to learn every detail about its full environment to understand a simple law of nature, like that the hot stove burns his hands, so he should stop touching it.

Instead, what we got is this weird decision mechanism:

"I learned that the hot stove always burns my hand. Today it's raining outside and I just stubbed my toe, so everything in the environment that I don't have a model about is very bad, which matches my expectations that an evil deity controls the universe."

"Oh weird, the rain just stopped. Aha, I see, this is just the evil deity trying to confuse me and trick me into touching the stove. But I'm not falling for this!"

"Oh, I just found a shiny coin in my pocket. This is really unexpected, as the sunshine and the coin together are worth more for me than not burning my hand, so it can't just be for tricking me. I don't understand why and evil god would let so good things happen to a person, and I'm really confused and don't have a plan for this unlikely situation. I might as well celebrate by touching the hot stove."

This story is somewhat unfair, because this problem only occurs if enough good things happen to the agent that are not explained by any of the laws in it hypothesis class: as we have seen with the square-free example, if the agent knows a law that can model most of the good things happening, then the agent will just conclude that Murphy is mostly constrained by this law, and the little additional goodness not explained by the law is just for tricking it to touch the stove, which dissolves the agent's confusion, and saves it from burning its hand. However it's still unclear what would be a good hypothesis class of laws that can explain every good event with close enough approximation.

A proposed solution

The best idea I heard for getting a simple hypothesis class for solving this problem is a class of laws in this form: "The hot stove always burns my hand anytime I touch it. The not stove-related event in the world on average cause at most loss  to me." Having laws like this for a lot a of different values of  seems like it could solve the problem: if unmodeled things are not going the worst possible way, then the agent doesn't get confused, but thinks that the law requires Murphy to constrain its evilness in general.

Unfortunately, I don't believe there is such a simple way out. How would we define in a law that "I get on average at most  loss"? Average on which time frame?

We can indeed make laws specifying concrete, finite time-frames, like "In every  turn, there must be at least  positive observations". But then the agent might just get into an environment in which longer and longer strings of s and s follow each other, which falsifies all these laws, in which case we are back to the point where nothing is holding back the agent from touching the stove.

The other possibility would be a law saying that throughout the agent's whole life, the overall loss must be at most , calculated with  time discount rate.

Unfortunately, you can't define a law like that. Laws are not supposed to include , which is a moving variable, it being encoded in the hypotheses would break the definition of learnability.

Let's take an example what could happen if we allowed laws including :

"The law is that a stove can only burn you at most  times, after that it will turn things to gold."

How can an agent with  time discount (this is similar to having a lifespan of  turns) supposed to test this hypothesis? It can only do so by spending half its life touching hot stoves. If the hypothesis was false, this is a pretty bad deal. On the other hand, if it was right, then not trying it would mean missing out on giant heaps of gold.

If the law specified  instead of  turns, then as  goes to , the hypothesis would be testable, which could make the hypothesis class containing it learnable. But if we allowed incorporating  in the hypotheses themselves, learnability breaks down.

I don't see a third way how a law of "average  loss" could be implemented, so while I still think it's an interesting idea to be potentially used later, it doesn't solve the problem in itself.

Conclusion

Infra-Bayesianism is better equipped than classical learning theory in providing provable performance guarantees in general environments, but it still probably needs a very broad hypothesis class of very detailed laws that can describe the world in close-enough detail that the agent doesn't get confused when good things are happening, and provably avoids failure modes in those cases. I don't have a clear idea of exactly how all-encompassing these laws would need to be, but it's at least some improvement over classical learning theory that only functions well if the environment's full model is included in the hypothesis class. I think the natural next step would be to describe better what kind of hypothesis class of laws is detailed enough the give good performance guarantees, but still realistic to fit into the head of an agent smaller than the environment. I feel skeptical about getting meaningful results to this question, as I believe that the word "realistic" hides a lot here and my suspicion is that our theorizing is not really useful to tackle that. But this topic belongs to my other post

Appendix: Technical proofs

Theorem 4.

If every finite subset of the countable hypothesis class  is learnable, then  itself is learnable. 

Proof. As  is countable, we can order its elements. Let  denote the set first  elements of . As every  is a learnable hypothesis class, there exists a corresponding  family of policies for all .

Let  be the function such that  is defined as the biggest number  such that 

 for every .

If no such  exists, then let . If all positive integers  satisfy the condition, that let 

We prove by contradiction that  goes to infinity as  goes to . Assume that it's not the case: there is an  such that there is a sequence of values  such that . This would mean that for all , there exists an , such that . As there are only finitely many environments in , there must be one that that appears infinitely many times in the  sequence. Then this environment  would contradict the definition of 's learnablility, as

would not hold. Contradiction.

We can now construct a family of policies  that learns the whole class :

For every environment  and , there exists a point  such that  for . This means that for , we get 

because  and 

Thus, we proved for all  that

 ◻ 

Theorem 12.

Given a countable hypothesis class  and a non-dogmatic prior  on  (non-dogmatic meaning that  for all ), and a family of policies , then

is equivalent to

that is, the class  being learnable and  being a good learning algorithm. 

Proof. Suppose that

For any hypothesis ,

so as the latter goes to , the former goes to  too, because we had the condition that 

Now let's suppose that

As the loss acquirable in one turn is bounded by , the overall loss with discount rate  is at most  This means that for every policy and environment, .

Let  Because , for every  there exists an  such that if 

For the first  (finitely many!) hypotheses there exists a  that if , then

So for any  there exists a  such that if , then

With this we proved that

 ◻

Theorem 14.

Given a learnable, countable hypothesis class  and a non-dogmatic prior  on , if we take the Bayes-optimal policy  for every , then this family of policies learn the hypothesis class . :::

Proof. Because  is learnable, there exists a family of policies  such that

By Theorem 12, this is equivalent to

 is the Bayes-optimal policy for every , given the prior , in particular, it accumulates at most as much loss in expectation as . Thus,

By Theorem 12, this is equivalent to

Thus, the Bayes-optimal family of policies  learns the hypothesis class . ◻ 

  1. ^

    See "Reinforcement learning in POMDPs without resets" (Even-dar at al 2005), theorem 4.1.

  2. ^

    A normalized version of Solomonoff, to be more precise.

New Comment
4 comments, sorted by Click to highlight new comments since:

Three remarks:

  • "I might as well celebrate by touching the hot stove" is a misinterpretation of what happens. The reason the agent might touch the hot stove is because it's exploring, i.e. investigating hypotheses according to which touching the hot stove is sometimes good. This is obviously necessary if you want to have learnability. The "hot stove" is just an example chosen to predispose us to interpret the behavior as incorrect. But, for example, scientists playing with chemicals might have also looked stupid to other people in the past, and yet it let to discovering chemistry and creating fertilizers, medications, plastics etc.
  • The existence of a prior that's simultaneously rich and tractable seems to me like a natural assumption. Mathematically, the bounded Solomonoff prior "almost" works: it is "only" intractable because [1] i.e. for some very complicated reason that might easily fail in some variation, and in a universe with oracles it just works. And, emprically we know that simple algorithms (deep learning) can have powerful cross-domain abilities. It seems very likely there is a mathematical reason for it, which probably takes the form of such a prior.
  • Another way to effectively consider a larger space of hypotheses is Turing RL, where you're essentially representing hypotheses symbolically: something that people certainly seem to use quite effectively.

  1. Roughly speaking, more precisely it seems to be related to average-case complexity separations (e.g. one-way functions). ↩︎

I still think that the hot stove example is a real problem, although maybe unavoidable. My example starts with "I learned that the hot stove always burns my hand." This is not the exploration part anymore, the agent already observed the stove burning its hand many times. Normally, this would be enough to never touch the hot stove again, but if some unexplained nice things happen in the outside world, there is suddenly no guarantee that the IB agent doesn't start touching the stove again. Maybe this is unavoidable, but I maintain it's a weird behavior pattern that the outside weather might make you lose the guarantee to not touch the stove you already learned is burning. I think it's somewhat different than the truly unavoidable thing, that every agent needs to do some exploration and it sometimes leads to them burning their hand.

Well, it's true that the IB regret bound doesn't imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove... assuming the optimal policy it follows is "reasonable": see definition of "robustly optimal policy" below.

The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:

  • Let's call a policy "robustly optimal" for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of "trembling game equilibrium").
  • Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name "sample complexity").
  • Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
  • Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
  • Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.

Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs and s.t. is a refinenement of , then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of . This doesn't imply low regret, but we can make it an independent requirement. However, I'm not sure how useful that is. Let's look at a simple example.

There are two observations and two actions . Seeing observation incurs a loss of and seeing observation incurs a loss of . Taking action incurs a loss of and taking action incurs a loss of . These losses add up straightforwardly. For the maximal ignorance hypothesis , "always take action " is the only robustly optimal policy. However, "take action unless you observed , in which case take action " is also optimal (for sufficiently large). Worse, is robustly optimal for some refinements of : maybe taking action causes you to get observation instead of .

Instead of , we can consider the law which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in ). Now is the sole robust policy of any refinement of ! However, to express this as an ultra-POMDP we need to introduce the notion of "Murphy observations" (i.e. Murphy doesn't see the state directly, just like the agent). I'm not sure about the consistency of robustness desideratum in this case.

Intuitively, I'm guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.

More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as "I learned that the hot stove always burns my hand". For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won't touch the stove when pleasantly surprised, because there are no surprises!