So to sum up so far, the basic idea is to shoot for a specific expected value of something by stochastically combining policies that have expected values above and below the target. The policies to be combined should be picked from some "mostly safe" distribution rather being whatever policies are closest to the specific target, because the absolute closest policies might involve inner optimization for exactly that target, when we really want "do something reasonable that gets close to the target."
And the "aspiration updating" thing is a way to track which policy you think you're shooting for, in a way that you're hoping generalizes decently to cases where you have limited planning ability?
Exactly! Thanks for providing this concise summary in your words.
In the next post we generalize the target from a single point to an interval to get even more freedom that we can use for increasing safety further.
In our current ongoing work, we generalize that further to the case of multiple evaluation metrics, in order to get closer to plausible real-world goals, see our teaser post.
Summary. In this post, we present the formal framework we adopt during the sequence, and the simplest form of the type of aspiration-based algorithms we study. We do this for a simple form of aspiration-type goals: making the expectation of some variable equal to some given target value. The algorithm is based on the idea of propagating aspirations along time, and we prove that the algorithm gives a performance guarantee if the goal is feasible. Later posts discuss safety criteria, other types of goals, and variants of the basic algorithm.
Assumptions
In line with the working hypotheses stated in the previous post, we assume more specifically the following in this post:
First challenge: guaranteeing the fulfillment of expectation-type goals
The challenge in this post is to design a decision algorithm for tasks where the agent's goal is to make the expected (!) Total equal (!) a certain value E∈R which we call the aspiration value. [2] This is a crucial difference from a "satisficing" approach that would aim to make expected Total at least as large as E and would thus still be happy to maximize Total. Later we consider other types of tasks, both less restrictive ones (including those related to satisficing) and more specific ones that also care about other aspects of the resulting distribution of Total or states.
It turns out that we can guarantee the fulfillment of this type of goal under some weak conditions!
Notice that a special case of such expectation-type goals is making sure that the probability of reaching a certain set of acceptable terminal states equals a certain value, because we can simply assume that each such acceptable terminal state gives 1 Delta and all others give zero Delta. We will come back to that special case in the next post when discussing aspiration intervals.
Example: Shopping for apples
Possible generalizations (can be skipped safely)
In later posts, we will generalize the above assumptions in the following ways:
Notation
We focus on a single episode for a specific task.
Environment. We assume the agent's interaction with the environment consists of an alternating sequence of discrete observations and actions. As usual, we formalize this by assuming that after each observation, the agent chooses an action a and then "calls" a (potentially stochastic) function step(a) provided by the environment that returns the next observation.[3]
World model. The episode's world model, M, is a finite, acyclic MDP. The model's discrete time counter, t, advances whenever the agent makes an observation. From the sequence of observations made until time t, the world model constructs a representation of the state of the world, which we call the model state or simply the state and denote by st, and which also contains information about the time index t itself.[4] The set of all possible (model) states is a finite set S. A subset S⊤ of states is considered terminal. If the state is terminal, the episode ends without further action or Delta, which represents the fact that the agent becomes inactive until given a new task. Otherwise, the agent chooses an action at∈A. The world model predicts the consequences of each possible action by providing a probability distribution PM(st+1|st,at) for the successor state st+1. It also predicts the expected Delta for the task-relevant evaluation metric as a function of the state, action, and successor state: Eδt+1=Eδ(st,at,st+1).
The following two graphs depict all this. Entities occurring in the world model are blue, those in the real world red, green is what we want to design, and dotted things are unknown to the agent:
We hide the process of constructing the next state from the previous state, action, and next observation by simply assuming that the agent can call a version of the function step that is given the current state and action and returns the successor state constructed from the next observation: st+1←step(st,at).
Goal. The goal is given by an aspiration value E(s0)∈R. The task is to choose actions so that the expected Total,
Eτ=E(s0,a0,s1,a1,…)∑tEδ(st,at,st+1),equals E(s0).
Auxiliary notation for interval arithmetic. We will use the following abbreviations:
(interpolation between x and z),
(relative position of y in interval [x,z], with the convention that 00=12),
("clipping'' y to interval [x,z]).
Sequential decisions based on propagated aspirations
Main idea
Our agent will achieve the goal by
For both aspiration propagation and decision making, the agent uses some auxiliary quantities that it computes upfront at the beginning of the episode from the world model as follows.
Feasibility intervals
Similar to what is done in optimal control theory, the agent computes[5] the V- and Q-functions of the hypothetical policy that would maximize expected Total, here denoted ¯¯¯¯V and ¯¯¯¯Q, by solving the respective Bellman equations
¯¯¯¯V(s)=maxa∈A¯¯¯¯Q(s,a),¯¯¯¯Q(s,a)=Es′∼PM(⋅|s,a)(Eδ(s,a,s′)+¯¯¯¯V(s′)),with ¯¯¯¯V(s)=0 for terminal states s∈S⊤. It also computes the analogous quantities for the hypothetical policy that would minimize expected Total, denoted V–– and Q––:
V––(s)=mina∈AQ––(s,a),Q––(s,a)=Es′∼PM(⋅|s,a)(Eδ(s,a,s′)+V––(s′)),with V––(s)=0 for terminal states s∈S⊤. These define the state's and action's feasibility intervals,
V(s)=[V––(s),¯¯¯¯V(s)],Q(s,a)=[Q––(s,a),¯¯¯¯Q(s,a)].(F)The eventual use of these intervals will be to rescale aspirations from step to step. Before we come to that, however, we can already prove a first easy fact about goals of the type "make sure that expected Total equals a certain value":
Lemma: Trivial guarantee
If the world model predicts state transitions correctly, then there is a decision algorithm that fulfills the goal Eτ=E(s0) if and only if the episode's starting aspiration E(s0) is in the initial state's feasibility interval V(s0).
Proof. The values ¯¯¯¯V(s0) and V––(s0) are, by definition, the expected Total of the maximizing resp. minimizing policy, and hence it is clear that there cannot be a policy which attains E(s0) in expectation if E(s0) is larger than ¯¯¯¯V(s0) or smaller than V––(s0).
Conversely, assuming that E(s0) lies inside the interval V(s0), the following procedure fulfills the goal:
This fulfills the goal, since the correctness of the model implies that, when using ¯¯¯π or π––, we actually get an expected Total of ¯¯¯¯V(s0) resp. V––(s0). □
Of course, using this "either maximize or minimize the evaluation metric" approach would be catastrophic for safety. For example, if we tasked an agent with restoring Earth's climate to a pre-industrial state, using as our evaluation metric the global mean temperature, this decision algorithm might randomize, with carefully chosen probability, between causing an ice age and inducing a runaway greenhouse effect! This is very different from what we want, which is something roughly similar to pre-industrial climate.
Another trivial idea is to randomize in each time step t between the action with the largest ¯¯¯¯Q(st,a) and the one with the smallest Q––(st,a), using a fixed probability p′ resp. 1−p′. Since expected Total is a continuous function of p′ which varies between V––(s0)and ¯¯¯¯V(s0), by the Intermediate Value Theorem there exists some value of p′ for which this algorithm gives the correct expected Total; however, it is unclear how to compute the right p′ in practice.
If the episode consists of many time steps, this method might not lead to extreme values of the Total, but it would still make the agent take an extreme action in each time step. Intuition also suggests that the agent's behavior would be less predictable and fluctuate more than in the first version, where it consistently maximizes or minimizes after the initial randomization, and that this is undesirable.
So let us study more intelligent ways to guarantee that Eτ=E(s0).
Decision Algorithm 1: steadfast action-aspirations, rescaled state-aspirations
In order to avoid extreme actions, our actual decision algorithm chooses "suitable" intermediate actions which it expects to allow it to fulfill the goal in expectation. When in state s, it does so by
More precisely: First compute (or learn) the functions V––,¯¯¯¯V,Q––, and ¯¯¯¯Q. Then, given state s and a feasible state-aspiration E(s)∈V(s),
and the successor state's state-aspiration E(s′)=V––(s′):λ:¯¯¯¯V(s′)∈V(s′).
If we add the state- and action aspirations as entities to the diagram of Fig. 1, we get this:
Example: Shopping for apples, revisited with math
We return to the apple-shopping scenario mentioned above, which we model by the following simple state-action diagram:
Our agent starts at home (state s) and wishes to obtain a certain number of apples, which are available at a market (state m). It can either walk to the market (action a), which will certainly succeed, or take public transportation (action b), which gives a 2/3 chance of arriving successfully at the market and a 1/3 chance of not reaching the market before it closes and returning home empty-handed. Of course, the agent can also decide to take the null action (action c) and simply stay home the entire day doing nothing.
Once it reaches the market m, the agent can buy either one or two packs of three apples (actions m1 and m2, respectively) before returning home at the end of the day (state t).
To apply Algorithm 1, we first compute the Q- and V-functions for the maximizing and minimizing policies. Since there are no possible cycles in this environment, straightforwardly unrolling the recursive definitions from the back ("backward induction") yields:
(The asterisks* mark which actions give the V values)
Suppose that the agent is in the initial state s and has the aspiration E(s)=2.5.
Suppose we arbitrarily choose actions c (do nothing) and a (walk to the market).
Suppose now that we started with the same initial aspiration E(s)=2.5, but instead chose action b as our over-achieving action in step 2. In this case, algorithm execution would go as follows:
If we ended up in state m, our new state-aspiration is then E(m)=V––(m):λ:¯¯¯¯V(m)=3.75; if we ended up in state t, the state-aspiration is E(t)=0.
These examples demonstrate two cases:
There is one last case, where both feasibility intervals contain E(s); this is the case, for example, if we choose E(s)=3.5 in the above environment. Execution then proceeds as follows:
Now that we have an idea of how the decision algorithm works, it is time to prove its correctness.
Theorem: Algorithm 1 fulfills the goal
If the world model predicts state transitions correctly and the episode-aspiration E(s0) is in the initial state's feasibility interval, E(s0)∈V(s0), then decision algorithm 1 fulfills the goal Eτ=E(s0).
Proof.
First, let us observe that algorithm 1 preserves feasibility: if we start from state s0 with state-aspiration E(s0)∈V(s0), then for all states s and actions a visited, we will have E(s)∈V(s) and E(s,a)∈Q(s,a).
This statement is easily seen to be true for action-aspirations, as they are required to be feasible by definition in step 1, and correctness for state-aspirations follows from the definition of E(s′) in step 6.
Let us now denote by Vπ1(s,e) the expected Total obtained by algorithm 1 starting from state s with state-aspiration E(s)=e, and likewise by Qπ1(s,a,e) the expected Total obtained by starting at step 5 in algorithm 1 with action-aspiration E(s,a)=e.
Since the environment is assumed to be acyclic and finite[6], we can straightforwardly prove the following claims by backwards induction:
We start with claim 1. The core reason why this is true is that, for non-terminal states s, we chose the right p in step 3 of the algorithm:Vπ1(s,E(s))=(1−p)⋅Qπ1(s,a−,E(s,a−))+p⋅Qπ1(s,a+,E(s,a+))=(1−p)⋅E(s,a−)+p⋅E(s,a+)by induction hypothesis=E(s,a−):p:E(s,a+)=E(s)because p=E(s,a−)∖E(s)∖E(s,a+).
Claim 1 also serves as the base case for our induction: if s is a terminal state, then Q(s) is an interval made up of a single point, and in this case claim 1 is trivially true.
Claim 2 requires that the translation between action-aspirations, chosen before the world's reaction is observed, and subsequent state-aspirations, preserves expected Total. The core reason why this works is the linearity of the rescaling operations in step 6:
Qπ1(s,a,E(s,a))=∑s′∈SPM(s′∣s,a)(Eδ(s,a,s′)+Vπ1(s′,E(s′)))assuming correct worldmodel=∑s′PM(s′∣s,a)⋅(Eδ(s,a,s′)+E(s′))by induction hypothesis=∑s′PM(s′∣s,a)⋅(Eδ(s,a,s′)+V––(s′):λ:¯¯¯¯V(s′))by definition in step 6=∑s′PM(s′∣s,a)⋅((Eδ(s,a,s′)+V––(s′)):λ:(Eδ(s,a,s′)¯¯¯¯V(s′)))=(∑s′PM(s′∣s,a)⋅(Eδ(s,a,s′)+V––(s′))):λ:(∑s′PM(s′∣s,a)⋅(Eδ(s,a,s′)¯¯¯¯V(s′)))=Q––(s,a):λ:¯¯¯¯Q(s,a)using the Bellman equation for Q=E(s,a)because λ=Q––(s,a)∖E(s,a)∖¯¯¯¯Q(s,a) by definition
This concludes the correctness proof of algorithm 1. □
Notes
Outlook
The interesting question now is what criteria the agent should use to pick the two candidate actions a−,a+ in step 2. We might use this freedom to choose actions in a way that increases safety, e.g., by choosing randomly as a form a "soft optimization" or by incorporating safety criteria like limiting impact. We'll explore these ideas in the next post in this sequence.
When we extend our approach later to incorporate performance and safety criteria, we might also have to assume further functions, such as the expected squared Delta (to be able to estimate the variance of Total), or some transition-internal world trajectory entropy (to be able to estimate total world trajectory entropy), etc.
Note that this type of goal is also implicitly assumed in the alternative Decision Transformer approach, where a transformer network is asked to predict an action leading to a prompted expected Total.
If the agent is a part of a more complex system of collaborating agents (e.g., a hierarchy of subsystems), the "action" might consist in specifying a subtask for another agent, that other agent would be modeled as part of the "environment" here, and the observation returned by step might be what that other agent reports back at the end of its performing that subtask.
It is important to note that st might be an incomplete description of the true environment state, which we denote xt but rarely refer to here.
Note that we're in a model-based planning context, so it directly computes the values recursively using the world model, rather than trying to learn it from acting in the real or simulated environment using some form of reinforcement learning.
These assumptions may of course be generalized, at the cost of some hassle in verifying that expectations are well-defined, to allow cycles or infinite time horizons, but the general idea of the proof remains the same.
E.g., assume the agent can buy zero or one apple per day, has two days time, and the aspiration is one apple. On the first day, the state's feasibility interval is [0,2], the action of buying zero apples has feasibility interval [0,1] and would thus get action-aspiration 0.5, and the action of buying one apple has feasibility interval [1,2] and would thus get action-aspiration 1.5. To get 1 on average, the agent would thus toss a coin. So far, this is fine, but on the next day the aspiration would not be 0 or 1 in order to land at 1 apple exactly, but would be 0.5. So on the second day, the agent would again toss a coin. Altogether, it would get 0 or 1 or 2 apples, with probabilities 1/4, 1/2, and 1/4, rather than getting 1 apple for sure.