Suppose that vases are never knocked over in the human-generated training data (since the human operators know that we don’t like broken vases). Then, regardless of the objective function we are using, a generative model trained on this data isn’t likely to knock over vases (since vase-toppling actions are very off-distribution for the training data).
But it will still have the problems of modeling off-distribution poorly, and going off-distribution. Once it accidentally moves too near the vase, which the humans avoid doing, it may go wild and spaz out. (As is the usual problem for behavior cloning and imitation learning in general.)
Novel behaviors may take a long time to become common. For example, suppose an OGM agent discovers a deceptive strategy which gets very high reward. We shouldn\u2019t necessarily expect this agent to start frequently employing deception; at first such behavior will still look off-distribution, and it might take many more iterations for such behavior to start looking normal to the generative model. Thus novel behaviors may appear indecisively, giving humans a chance to intervene on undesirable ones.
I disagree. This isn't a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it's a model-based RL: the whole point is it's learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one. (This is in fact one of the standard experimental psychology approaches for running RL experiments on rodents to examine model-based vs model-free learning: if you do something like switch the reward location in a T-maze, does the mouse update after the first time it finds the reward in the new location such that it goes to the new location thereafter, demonstrating model-based reasoning in that it updated its internal model of the maze and did planning of the optimal strategy to get to the reward leading to the new maze-running behavior, or does it have to keep going to the old location for a while as the grip of the outdated model-free habituation slowly fades away?)
Empirically, the bigger the model, the more it is doing implicit planning (see my earlier comments on this with regard to MuZero and Jones etc), and the more it is capable of things which are also equivalent to planning. To be concrete, think inner-monologue and adaptive computation. There's no reason a Gato-esque scaled-up DT couldn't be using inner-monologue tricks to take a moment out to plan, similar to how Socratic models use their LMs to 'think out loud' a plan which they then execute. It would make total sense for a recurrent model to run a few timesteps with dummy inputs to 'think about the prompt' and do some quick meta-learning a la Dactyl, or for a old-style GPT model to print out text thinking to itself 'what should I do? This time I will try X'.
For that matter, a OGM agent wouldn't have to experience the transition itself, you could simply talk to it and tell it about the unobserved states, thanks to all that linguistic prowess it is learning from the generative training: harmless if you tell it simply "by the way, did you know there's an easter egg in Atari Adventure? If you go to the room XYZ...", not so harmless if it's about the real world or vulnerabilities like Log4j. Or it could be generalizing from data you don't realize is related at all but turns out to help transfer-learn or capabilities.
It might be easy to tune the rate of behavioral shift for OGM agents, which would allow us to more tightly control the rate at which new capabilities appear.
The smarter it is, and the better the environment models and capabilities, the more transfer it'll get, and the faster the 'rate' gets potentially.
OGM agents may explore their action spaces more predictably than other RL agents, since they explore by trying variations on human-like behavior (this consideration might also apply to other methods that involve pre-training on human demonstrations).
Yeah, that's possible, but I don't think you necessarily get that out of the box. Online Decision Transformer or Gato certainly don't explore in a human-like behavior, any more than other imitation learning paradigms do right now. (As you note, ODT just does a fairly normal bit of policy-based exploration, which is better than epsilon-random but still far short of anything one could describe as a good exploration strategy, much less human-like, nor do either ODT or Gato do as impressively when learning online/finetuning as one would expect if they really were exploring well by default.) They still need a smarter way to explore, like ensembles to express uncertainty.
An interesting question is whether large DTs would eventually learn human exploration, the way they learn so many other things as they scale up. Can they meta-learn exploration appropriately outside of toy POMDP environments deliberately designed to elicit such adaptive behavior? The large datasets in question would presumably contain a lot of human exploration; if we think about Internet scrapes, a lot of it is humans asking questions or criticizing or writing essays thinking out loud, which is linguistically encoding intellectual exploration.
From a DT perspective, I'd speculate that when used in the obvious way of conditioning on a very high reward on a specific task which is not a POMDP, the agents with logged data like that are not themselves exploring but are exploiting their knowledge, and so it 'should' avoid any exploration and simply argmax its way through that episode; eg if it was asked to play Go, there is no uncertainty about the rules, and it should do its best to play as well as it can like an expert player, regardless of uncertainty. If Go were a game which was POMDP like somehow, and expert 'POMDP-Go' players expertly balance off exploration & exploitation within the episode, then it would within-episode explore as best as it had learned how to by imitating those experts, but it wouldn't 'meta-explore' to nail down its understanding of 'POMDP-Go'. So it would be limited to accidental exploration from its errors in understanding the MDP or POMDPs in question.
Could we add additional metadata like 'slow learner' or 'fast learner' to whole corpuses of datasets from agents learning various tasks? I don't see why not. Then you could add that to the prompt along with the target reward: 'low reward, fat learner'. What trajectory would be most likely with that prompt? Well, one which staggers around doing poorly but like a bright beginner, screwing around and exploring a lot... Do some trajectories like that, increment the reward, and keep going in a bootstrap?
is it possible for a transformer to be a mesa-optimizer?
Why wouldn't it?
Pick some percentile of previously-observed rewards (e.g. 95th percentile) and condition on getting that reward. For the OGM agent, as the distribution of previously-observed rewards shifts upwards, appropriately adjust the target reward.
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
But it will still have the problems of modeling off-distribution poorly, and going off-distribution.
Yep, I agree that distributional shift is still an issue here (see counterpoint 1 at the end of the "Safety advantages" section).
---
> Novel behaviors may take a long time to become common [...]
I disagree. This isn't a model-free or policy model which needs to experience a transition many times before the high reward can begin to slowly bootstrap back through value estimates or overcome high variance updates to finally change behavior, it's a model-based RL: the whole point is it's learning a model of the environment.
Thus, theoretically, a single instance is enough to update its model of the environment, which can then flip its strategy to the new one.
I think you're wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let's imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it's just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn't yes. Instead, the new policy will probably slightly increase the probabilities of actions which, when performed together, constitute reward hacking. It will be more likely to explore this reward hacking strategy in the future, after which reward hacked episodes make up a greater proportion of the top 5% most highly rewarded episodes, but the transition shouldn't be rapid.
As a more direct response to what you write in justification of your view: if the way the OGM agent works internally is via planning in some world model, then it shouldn't be planning to get high reward -- it should be planning to exhibit typical behavior conditional on whatever reward it's been conditioned on. This is only a problem once many of the examples of the agent getting the reward it's been conditioned on are examples of the agent behaving badly (this might happen easily when the reward it's conditioned on is sampled proportional to exp(R) as in remark 1.3, but happens less easily when satisficing or quantilizing).
---
Thanks for these considerations on exploration -- I found them interesting.
I agree that human-like exploration isn't guaranteed by default, but I had a (possibly dumb) intuition that this would be the case. Heuristic argument: a OGM agent's exploration is partially driven by the stochasticity of it's policy, yes, but it's also driven by imperfections in its model of its (initially human-generated) training data. Concretely, this might mean, e.g. estimating angles slightly differently in Breakout, having small misconceptions about how highly rewarded various actions are, etc. If the OGM agent is competent at the end of its offline phase, then I expect the stochasticity to be less of a big deal, and for the initial exploration to be mainly driven by these imperfections. To us, this might look like the behavior of a human with a slightly different world model than us.
It sounds like you might have examples to suggest this intuition is bogus -- do you mind linking?
I like your idea of labeling episodes with information that could control exploration dynamics! I'll add that to my list of possible ways to tune the rate at which an OGM agent develops new capabilities.
---
> is it possible for a transformer to be a mesa-optimizer?
Why wouldn't it?
Point taken, I'll edit this to "is it likely in practice that a trained transformer be a mesa-optimiser?"
---
Why quantilize at a specific percentile? Relative returns sounds like a more useful target.
Thanks! This is exactly what I would prefer (as you might be able to tell from what I wrote above in this comment), but I didn't know how to actually implement it.
I think you're wrong here, at least in the case of an OGM satisficer or quantilizer (and in the more optimizer-y case of remark 1.3, it depends on the reward of the new episode). For concreteness, let's imagine an OGM quantilizer aiming for rewards in the top 5% of previously-observed rewards. Suppose that the generative model has a memory of 10,000 episodes, and it's just explored a reward hacking strategy by chance, which gave it a much higher reward than all previous episodes. It looks back at the last 10,000 episodes (including the reward hacking episode) and performs a gradient update to best model these episodes. Will its new policy consistently employ reward hacking (when conditioned on getting reward in the top 5% of previously-observed rewards)?
If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn't yes.
For safety, 'probably' isn't much of a property. You are counting on it, essentially, having indeed learned the ultra-high-reward but then deliberately self-sabotaging for being too high reward. How does it know it's "too good" in an episode and needs to self-sabotage to coast in to the low reward? It's only just learned about this new hack, after all, there will be a lot of uncertainty about how often it delivers the reward, if there are any long-term drawbacks, etc. It may need to try as hard as it can just to reach mediocrity. (What if there is a lot of stochastic to the reward hacking or states around it, such that the reward hacking strategy has an EV around that of the quantile? What if the reward hacking grants enough control that a quantilizer can bleed itself after seizing complete control, to guarantee a specific final reward, providing a likelihood of 1, rather than a 'normal' strategy which risks coming in too high or too low and thus having a lower likelihood than the hacking, so quantilizing a target score merely triggers power-seeking instrumental drives?) Given enough episodes with reward hacking and enough experience with all the surrounding states, it could learn that the reward hacking is so overpowered a strategy that it needs to nerf itself by never doing reward hacking, because there's just no way to self-sabotage enough to make a hacked trajectory plausibly come in at the required low score - but that's an unknown number of episodes, so bad safety properties.
I also don't buy the distribution argument here. After one episode, the model of the environment will update to learn both the existence of the new state and also the existence of extreme outlier rewards which completely invalidate previous estimates of the distributions. Your simple DT is not keeping an episodic buffer around to do planning over or something, it's just doing gradient updates. It doesn't "know" what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did, it only knows what's encoded into its model, and that model has just learned that there exist very high rewards which it didn't know about before, and thus that the distribution of rewards looks very different from what it thought, which means that '95th percentile' also doesn't look like what it thought that did. It may be unlikely that 10,000 episodes wouldn't sample it, but so what? The hack happened and is now in the data, deal with it. Suppose you have been puttering along in task X and it looks like a simple easily-learned N(100,15) and you are a quantilizer aiming for 95th percentile and so steer towards rewards of 112, great; then you see 1 instance of reward hacking with a reward of 10,000; what do you conclude? That N(100,15) is bullshit and the reward distribution is actually something much wilder like a lognormal or Pareto distribution or a mixture with (at least) 2 components. What is the true distribution? No one knows, least of all the DT model. OK, is the true 95th percentile reward more likely to be closer to 112... or to 10,000? Almost certainly the latter, because who knows how much higher scores go than 10,000 (how likely is it the first outlier was anywhere close to the maximum possible?), and your error will be much lower for almost all distributions & losses if you try to always aim for 10,000 and never try to do 112. Thus, the observed behavior will flip instantaneously.
but it's also driven by imperfections in its model of its (initially human-generated) training data
Aside from not being human-like exploration, which targets specific things in extended hypotheses rather than accidentally having trembling hands jitter one step, this also gives a reason why the quantilizing argument above may fail. It may just accidentally the whole thing. (Both in terms of a bit of randomness, but also if it falls behind enough due to imperfections, it may suddenly 'go for broke' to do reward hacking to reach the quantilizing goal.) Again, bad safety properties.
I continue to think you're wrong here, and that our disagreement on this point is due to you misunderstanding how an ODT works.
Your simple DT is not keeping an episodic buffer around to do planning over or something, it's just doing gradient updates. It doesn't "know" what the exact empirical distribution of the last 10,000 episodes trained on was nor would it care if it did
To be clear: an ODT does keep an episodic buffer of previous trajectories (or at least, that is the implementation of an ODT that I'm considering, which comports with an ODT as implemented in algorithm 1 of the paper). During the online training phase, the ODT periodically samples from this experience buffer and does gradient updates on how well its current policy retrodicts the past episodes. It seems like our disagreement on this point boils down to you imagining a model which works a different way.
More precisely, it seems like you were imagining that:
when in fact:
(with the obvious modifications when instead of conditioning on a single reward we condition on rewards being in some range ).
All of the reasoning in your first paragraph seems to be downstream of believing that an ODT works as in bullet point 1, when in fact an ODT works as in bullet point 2. And your reasoning in your second paragraph seems to be downstream of not realizing that an ODT is training off of an explicit experience buffer. I may also not have made sufficiently clear that the target reward for an ODT quantilizer is selected procedurally using the experience buffer data, instead of letting the ODT pick the target reward based on its best guess at the distribution of rewards.
(separate comment to make a separate, possibly derailing, point)
> If the answer were yes, then this policy would do a really bad job predicting 499 of the 500 past episodes with top 5% reward, so I conclude the answer probably isn't yes.
For safety, 'probably' isn't much of a property.
I mostly view this as a rhetorical flourish, but I'll try to respond to (what I perceive as) the substance.
The "probably" in my sentence was mainly meant to indicate out-of-model uncertainty (in the sense of "I have a proof that X, so probably X" which is distinct from "I have a proof that probably X"). I thought that I gave a solid argument that reward hacking strategies would not suddenly and decisively become common, and the probably was to hedge against my argument being flawed, not to indicate that the argument showed that reward hacking strategies would appear suddenly and decisively only 10% of of the time or whatever.
So I think the correct way to deal with that "probably" is to interrogate how well the argument holds up (as in the sister comment), not to dismiss it due to heuristics about worst-case reasoning.
Sounds like the argument for quantilizers. Issues with quantilizers still apply here - for example, taking a series of actions that are individually sampled from a human-like distribution will often end up constituting a long-term policy that's off-distribution. But, like, if those problems could be surmounted I agree that would be really good.
As to ODTs, I'm not super optimistic, but I'm also not very expert. It seems from a little thought like there are two types of benefit to ODT finetuning: One, a sort of "lowering expectations" so that the system only tries to do the behaviors it's actually learned how to do correctly, even if humans do more difficult things to get higher reward. Two, a random search through policies (local in the NN representation of the policy) that might make gradual improvements. I'm not confident in the safety properties of that second thing, for reasons similar to Steve Byrnes' here.
I really liked the post and the agenda of improving safety through generative modelling is close to my heart.
we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data
But you still need online access to our MDP (i.e. reward function and transition function), don't you? And it's access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itself to simulate reward function and transition model) and feeding them back into the model, you should expect any change (on average). Your gradients updates will cancel out, that's a consequence of the expected-grad-log-prob lemma ().
It gets more nuanced when you account for doing ancestral sampling, but it adds problems, not solves them:
https://arxiv.org/abs/2110.10819
Reproduce the “Learning to Summarize with Human Feedback” paper but with a frozen reward model.
On the other hand, in their follow-up work on instruction following, OpenAI claimed they used little online data (from fine-tuned policies):
https://arxiv.org/abs/2203.02155
It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions \) over actions conditional on partial trajectories
Levine derives that in his control-as-inference tutorial paper (section 2.3). Your expected exponential total reward is pretty close. Not that it acts a bit like an (exponentiated) Q function for your policy: it gives you exp-reward expected after taking action at state and following thereafter. The exponential works like a soft argmax, so it gives you something like soft Q-learning but not quite: argmax is also over environment dynamics, not only over policy. So it causes an optimism bias: your agent effectively assumes an optimal next state will sampled for it every time, however unlikely would that be. The rest of Levine's paper deals with that.
But you still need online access to our MDP (i.e. reward function and transition function), don't you?
Yep, that's right! This was what I meant by "the agent starts acting in its environment" in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory
of rewards-to-go, observation, and actions; then selects an action conditional on this partial trajectory; and then the environment provides a new reward (so that ) and observation . Does that make sense?
Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.
I think this is an interesting line of thinking. My main concern is whether the alignment tax might be too high for some use cases. I think a test case that it might do well for would be the NetHack challenge https://nethackchallenge.com/
I think that that's an interesting challenge because current SoTA is far from ceiling effects on it, and that imitating human games as a starting point seems intuitively like a good approach. To study the question of: how likely is the model to go problematic directions once off-distribution, you could modify the NetHack challenge environment to add some possible exploits which aren't in the real game, and see how likely it is for the model to find and use those exploits.
Summary: the online decision transformer is a recent approach to creating agents in which a decision transformer is pre-trained offline (as usual) before producing its own trajectories which are fed back into the model in an online finetuning phase. I argue that agents made with generative modeling have safety advantages – but capabilities disadvantages – over agents made with other RL approaches, and agents made with online generative modeling (like the online decision transformer) may maintain these safety advantages while being closer to parity in capabilities. I propose experiments to test all this. (There is also an appendix discussing the connections between some of these ideas and KL-regularized RL.)
I’ll start with some motivation and then introduce the decision transformer and the online decision transformer. If you’re already familiar with the decision transformer, you can probably skip to “Online generative modeling.” If you’re already familiar with the online decision transformer, you can skip to “Some remarks.”
Motivation
Suppose we want to get an AI system to produce a picture of a cat.
A naive approach is to take a trained image classifier M and optimize an image to maximally activate M's 'cat' classification. If you do this, you won't get anything that looks like a cat. Instead, you'll get some deep-dream-esque conglomeration of whiskers, cat ears, and fur which M happens to strongly classify as 'cat.' One way of describing what went wrong here is that the image of the cat you get is very off-distribution. You would rather get a normal-looking picture of a cat, i.e. one which is on-distribution for M's training data.
This can be taken as an analogy for alignment: if you have an agent optimize the world for a given utility function U, what you get won't look anything like a normal world, but rather a very weird-looking world which happens to maximize U. We would prefer to get a normal-looking world which also scores high according to U.
We've recently seen lots of reminders that modern ML does have ways to produce normal-looking pictures of cats. A key ingredient in all of these approaches is generative modeling. Recall that a generative model is something which is fed in data and learns to produce new data which "looks similar" to its training data; in other words, a generative model tries to produce new samples from its training distribution. Think GPT, which tries to produce text which looks like the text it's been trained on.
Generative models can also be conditioned on information which, roughly speaking, tells the generative model which part of its training distribution to sample from (e.g. conditioning GPT on a prompt tells it to sample from the part of its training distribution consisting of texts which start with that prompt; a caption given to DALL-E tells it to sample from the part of its training distribution consisting of images which would be given that caption). So to get a normal-looking picture of a cat, train a generative model on lots of images and captions, and then ask the generative model to sample an image from its training distribution, conditional on that image being one which would be captioned ‘cat.’
If "produce a normal-looking picture of a cat" is an analogy for alignment and generative models solve the “produce a normal-looking picture of a cat” problem, then what does an agent built via generative modeling look like?
Making agents via generative modeling
It looks like a decision transformer.
Recall that decision transformers work as follows. Suppose we want to train an agent to play the Atari game Breakout. We start by encoding the game states, rewards, and actions as tokens. We then treat playing Breakout as a sequence-modeling problem, which can be solved with a transformer. (In other words, instead of training a transformer to predict token sequences which correspond to characters in text (like GPT), you train the transformer to predict tokens which correspond to actions in Breakout.) A transformer used in this way is called a decision transformer.
Typically, the training data for a decision transformer is generated by humans playing Breakout (or whatever game we want our agent to play). If we ran this trained decision transformer without conditioning, it would just attempt to mimic typical human play. To do better, we condition the decision transformer on getting a high score[1]; the decision transformer will then try to play the game like a human who gets a high score.[2]
We can replace “transformer” in all of the above with any generative model which can generate completions of our task (in the transformer example, the task completions are token sequences which encode trajectories). So more generally, our scheme is: train a generative model on human-generated task completions; then ask the generative model to produce new task completions conditional on getting high reward. Let’s call agents made from generative models this way “GM agents.” In the section “Safety advantages of generative modeling” I’ll argue that GM agents have safety advantages over other types of RL agents
Online generative modeling
GM agents as described above should be able to perform tasks at the upper end of human capabilities. But, since their behavior is fundamentally based on mimicking human-generated training data, they won’t be able to do too much better or explore novel strategies.[3] In other words, GM agents seem to carry a large alignment tax. Ideally, we would like to boost the capabilities of GM agents without losing safety.
The obvious way to improve GM agents’ performance is to take the learned policy (which was trained to mimic top human performance, as described above) and finetune it with a policy gradient method to maximize our objective function (e.g. Breakout score). In other words, this boils down to training an agent with a standard RL technique, but starting from a policy learned by a generative model instead of from a random initial policy. This would fix the capabilities issue but would probably ruin the GM agent’s safety advantages.
Another approach to boosting capabilities which has been studied recently is what I call online generative modeling. The decision transformer version of this is the online decision transformer (ODT).
An ODT starts as a vanilla decision transformer trained offline on human-generated training data. But once it has learned to perform competently, we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data. Over time, the ODT will shift to mimicking some mixture of the original human-generated data and its self-generated data.[4]
As before, we can replace the transformer here with any generative model, so our general scheme is: train a generative model to mimic high-reward human task completions as before, then continue training the generative model online on tasks completions it produces. Call the agents made this way OGM agents.
Some remarks
Safety advantages of generative modeling
[Epistemic status: I view the things I write in this section as somewhere between “suggestive argument sketches” and “attempts to justify an intuition by pointing at facts that feel relevant.” Which is to say: I’m not very confident in all of this section’s reasoning, but I think the general picture has a large enough chance of being true to merit empirical investigation.]
Section summary: first, I sketch an optimistic case for GM agents’ safety. Ideally, I would then move on to discussing whether OGM agents can safely improve on GM agents’ capabilities. However, I felt that a serious discussion of that point would require describing a framework for OGM with human feedback, so I’ve deferred that discussion to a future post and only give some preliminary considerations here instead. I conclude with some caveats and counterpoints.
Given that GM agents are just trying to mimic the task completions of top humans, they have some clear safety advantages. To name a few:
On the other hand, the transition from GM agents to OGM agents – which was done to get around GM agents being capped at the capabilities of top humans – will result in novel behaviors, and we need to analyze how well these safety advantages will persist. Left online for long enough, OGM agents could start to view breaking vases, reward hacking, deception, etc. as both normal and high-reward.
In practice, this might be resolvable with human feedback (i.e. by making the objective function be a reward model trained online with human feedback). If the transition from non-deceptive to deceptive behavior is slow and stuttering, then humans may be able to give negative feedback to the OGM agent’s first attempts at deception, preventing deceptive behavior from ever starting to look on-distribution. Alternatively, there might be ways to ensure that the distribution OGM agents are modeling never shifts too far from the human-generated training distribution, or doesn’t shift too rapidly relative to our ability to give feedback. (One simple idea here is to ensure that some fixed proportion (e.g. 50%) of the task completions in the dataset which the OGM is trying to model are always drawn from the original human-generated training data.)
In a future post, I plan to sketch out a proposal for OGM with human feedback, as well as a more sophisticated scheme for preventing an OGM agent’s behavior from changing too rapidly relative to its understanding of our values (or more concretely, relative to the reward model’s loss).[6]
But since this isn’t that future post, I’ll instead generally outline some safety-relevant considerations for OGM agents:
Finally, some caveats and counterpoints:
Experiments
Here are some experiments which could be useful. For all of these, the OGM agents should be compared against agents made with various other RL techniques as well as to GM agents finetuned with a KL-divergence penalty. I'm most excited about having the OGM agents be OGM quantilizers as in remark 1(b) above.
Experiments which verify that GM agents have the safety properties I’d expect them to have and probe how rapidly these safety properties decay in online training:
Experiments which test the size of the alignment tax for OGM agents:
Conclusion
Generative modeling is an approach for creating agents which has safety advantages over other approaches to RL, but which results in agents which are capped at the performance of top humans. Online generative modeling allows for improvements in capabilities, and when combined with other techniques, such as human feedback, OGM agents might also retain the safety advantages of GM agents. In a future post, I plan to sketch out what OGM with human feedback could look like and present a scheme for ensuring that an OGM agent’s capabilities don’t improve too fast relative to its understanding of human preferences.
Thanks to Ben Edelman, Thibaut Horel, Holden Lee, Simeon Campos, Nuño Sempere, and Johannes Treutlein for the discussions that became this post. Thanks additionally to Gabriel Wu and Tom Shlomi for feedback on a draft.
Appendix: GM agents and KL-regularized RL
This appendix is due to a discussion with Ben Edelman. The mathematical content here is nearly identical to that of this recent post.
There is an intuitive connection between creating agents by conditioning generative models on high reward and finetuning the policy learned by an (unconditioned) generative model to maximize reward with a KL-divergence penalty. Namely, both methods aim to improve from a baseline policy learned by a generative model in a way that gets high reward without straying too far from the baseline.
In fact, we can go further than this intuitive connection. This appendix explains a more precise connection between a GM agent as implemented in remark 1(c) above and KL-regularized finetuning. The meat of the connection is the following fact:
Let π0(τ) be a baseline policy over trajectories τ and let R(τ) be a reward function. Then the following policies are the same:
To prove this fact, one observes that plugging the policy π(τ)∝π0(τ)exp(R(τ)) into Eτ∼π[R(τ)]−DKL(π||π0) gives logEτ∼π0[exp(R(τ))], which is provably maximal (by the argument in the appendix here).
The policy in (b) is what one gets by doing RL with reward function R and penalizing KL-divergence from π0. To sample from the policy in (a), suppose that we've trained a decision transformer on sequences R,τ1,τ2,…,τT where R is the reward for the whole trajectory τ=τ1…τT consisting of actions τi. Let π0 be the unconditioned policy (i.e. the policy one gets by not conditioning on any reward). Then, as in this paper, one can sample trajectories with probabilities proportional to π0(τ)exp(R(τ)) by first sampling a reward R with probability proportional to P(R)exp(R) (where P(R) is the probability of R from the training distribution, as output by the decision transformer), and then sampling trajectories from π0(⋅|R) by conditioning the decision transformer on reward R.[7]
It would be interesting to figure out a way to factorize the policy in (a) over timesteps, i.e. produce distributions π(⋅),π(⋅|τ1),π(⋅|τ1τ2),…,π(⋅|τ1…τT−1) over actions conditional on partial trajectories so that sampling trajectory τ=τ1…τT with probabilityπ(τ1)π(τ2|τ1)…π(τT|τ1…τT−1) is the same policy as in (a). Ben Edelman pointed out to me that if one can estimate at each timestep the expected exponential total reward Eτ>t∼π0(⋅|τ≤t)[exp(R(τ≤tτ>t))] over the randomness of π0, then one can produce this factorization by taking action τt with probability proportional to
π0(τt|τ<t)Eτ>t∼π0(⋅|τ<tτt)[exp(R(τ<tτtτ>t))].
That said, it's not clear to me that estimating this expected exponential reward is something that can natively done without introducing an auxiliary reward model separate from the generative model.
There are a few different approaches to doing this, which I’ll discuss below.
This is analogous to prompting GPT with “This is a transcript of a conversation with Steven Hawking” before asking it physics questions.
The original decision transformers paper notes that for some tasks, it’s possible to prompt the decision transformer with a reward higher than any reward in the human-generated data and get (boundedly)superhuman performance. But for nearly all tasks we should expect there to be some capabilities cap beyond which the agent can’t improve without exploring new strategies.
For an artful analogy, suppose Alice is training to become an artist. She might start out imitating the work of great artists; this is analogous to a decision transformer learning to imitate high-reward human task completions. But Alice will do some things differently than the artists she’s imitating. Over time, Alice’s art might shift to become a mixture of great artists’ work and Alice’s own past work. Eventually Alice will produce art which is very different from the great artists she started imitating (i.e. she will develop her own style); this is analogous to an ODT attaining superhuman performance by imitating its own past play.
In the ODT paper, exploration is forced by requiring the ODT’s policy to have not-too-small entropy. This can also be viewed as forcing the ODT to be an imperfect model of human behavior. It would be interesting to see how important this entropy constraint is to the improvement of the ODT – perhaps the exploration that arises from the natural imperfections in the ODT’s model of human behavior (or from the stochasticity of the policy) are enough to ensure improvement in the online phase.
In other words, the scheme aims to ensure that capabilities always lag in the capabilities vs. value-learning race.
We assumed a decision transformer here in order to be able to sample rewards with probability proportional to P(R)exp(R); decision transformers can do this because they explicitly represent the probabilities P(R) for each R. It's not obvious to me how to sample from this distribution for other generative models, but maybe there's a clever way to do so.