Looking for Goal Representations in an RL Agent - Update Post

CatGoddess

Summary

I’ve been working on a project aimed at finding goal representations in a small RL agent. I designed a setup where I train an agent, rapidly alternating between two very similar objectives.

I was able to consistently (across alternations and across different random seeds) get high average returns on both objectives, which I think suggests the agent contains modular representations that can flexibly represent different policies for completing each of the two objectives.

I plan to look for these representations and edit them to change what objective the agent pursues - potentially enabling it to complete a novel objective which it was never trained on.

Motivation

I am interested in Retargeting the Search as an alignment approach. I expect agents that behave competently in complex, diverse environments will need to find solutions to a broad range of challenging and novel problems, and that, as a result, they will internally implement general-purpose search.

Presuming this is true, you should in theory be able to change the “thing that parametrizes the search,” or “thing that determines what the search is currently directed towards.” Loosely, I would refer to this as the agent’s “goal.”

In order to use this for alignment, you’d need to solve at least two problems:

1. Figure out which parts of the agent parametrize the search process (i.e. where is the goal located?). This may require finding and mechanistically understanding the internal search process as a whole, but I don’t think it necessarily will.

2. Change the content in those parts, such that the agent now pursues goals that you’d like it to pursue.

The second problem is pretty complicated. You need to know what goal you want the agent to pursue (outer alignment), and you also need to know what encoding inside the agent corresponds to that goal (ontology identification).

Both of these seem like pretty difficult problems. You also probably need it to be the case that “human values” (either their explicit representation or some pointer to them) be easily representable in terms of the agent’s concepts.

This post won’t really explore these issues in detail, nor will it discuss other high-level considerations related to Retargeting the Search. Instead, I focus on problem 1: locating the “thing that parametrizes the search”/goal.

I decided to do this project so that I could work on finding goals inside agents in a toy setting. In the future I would like to locate and edit goals in larger models, such as sophisticated RL agents and LLMs (potentially ones finetuned to be more agentic), but I thought it would be good to start small and gain some knowledge/experience.

I think a nice demonstration would be if I trained a small RL agent in an environment with keys of different colors to e.g. collect red keys and collect blue keys, and then after analyzing the checkpoints was able to edit it without retraining such that it collected purple keys. This does involve figuring out the encoding for purple, but in a toy setting this might not be too hard.

Failing that, I would like to at least be able to splice the “goal” part of a blue-key-pursuing checkpoint into the red-key-pursuing checkpoint, such that the latter model would then collect blue keys. I might also be able to do some interpretability to better understand how the goal works and how it connects to the rest of the network.

Setup

I train two dense neural networks using PPO: an actor and a critic. The actor has four layers and the critic has three; both have a hidden dimension of 128.

I use a custom gridworld environment – a hybrid of Dynamic Obstacles and Fetch from MiniGrid.

Each instantiation has a 4x4 grid, which contains the agent (red triangle), three keys, and an obstacle (grey ball). The locations of the keys, agent, and obstacle are randomized each time the environment is reset.

At each timestep, the agent receives a 5x5 observation (highlighted/light grey region) - while this is larger than the size of the grid, it does mean the agent can’t see behind itself. 5x5 was the default observation size, and I haven’t experimented with changing it. In my implementation, the observation is one-hot encoded.

The obstacle moves in a random direction at each timestep. The agent is then able to output an action, including things like “move forward,” “turn left,” and “pick up object.”

I have two different reward schemes, corresponding to the two color objectives I train on. In both, the episode ends and the agent gets -1 reward if it runs into the obstacle, and 0 reward if it times out (reaches max steps before the episode otherwise terminates).

In the “purple” reward scheme, the agent gets a reward close to one (slightly lower if it takes longer) for picking up the purple key, and 0.1 reward if it picks up a differently colored key. Regardless, the episode terminates after any key is picked up.

The opposite is the case for the “blue” reward scheme; the agent gets high reward for picking up the blue key and low reward for picking up other keys.

There is always exactly one blue and one purple key (and one key of a random color that isn’t purple or blue).

I train for a small number of rollouts with the purple reward scheme, then some number with blue, then some with purple, and so on. A rollout is defined as a single phase of letting the agent interact with the environment - or environments if we’re running multiple in parallel - and collecting experiences (tuples of observation, action, reward, etc.) in the replay buffer. In between each rollout we sample those experiences to do some number of gradient updates.

I collect checkpoints for the actor network at the end of each training period, e.g. on the last rollout for the purple scheme before switching to the blue scheme.

Progress

I was able to use a small number of rollouts (15) between objective switches, where each learning period had 8 update epochs and 4 minibatches. I needed to do some hyperparameter tuning – notably, I needed to use weight decay - but I was able to train the agent to achieve high average reward on both objectives at the end of their respective training periods.

The results below are for a particular random seed (seed 99), but I consistently got similar results across all ten random seeds I tested:

The returns for the blue and purple objectives are plotted against the same “step” x-axis; however, only one of either blue or purple is being trained on at any given step. Each dot is a datapoint.

We see a cyclic pattern; taking the blue plotline, we see that, over the course of each blue training period (past an initial learning phase) the average returns tend to steadily increase. Visually, this corresponds to an upward slope from a valley to a peak.

The region between a peak and the next valley, on the other hand, corresponds to the period of time where we switch over to training with the purple reward scheme. When we finally switch back to training on blue again, the average return starts out low - the next blue datapoint is in a valley.

There are some exceptions to this pattern, but it mostly holds. In general, the peaks for both blue and purple tend to be between 0.9 and 0.95, meaning that at the end of the blue training phases the agent mostly succeeds at picking up the blue key, and at the end of the purple training phases it mostly succeeds at picking up the purple key. This happens quite consistently across training phases.

This is in spite of the fact that, at the beginning of training, the agent takes a while (in particular, longer than 15 rollouts, which is the length of a single training phase) before it achieves ~0.9 average return for either the blue or the purple objective. Hence, it cannot learn to perform well from scratch within the space of a single training phase.

The fact that the agent quickly switches between performing well on the two objectives suggests that there is some internal structure (in either/both of the actor and critic network) that is preserved between the blue and purple phases, and which is helpful for both objectives.

I would guess that something that one would intuitively think of as “modularity” is present in the network(s). I’d further hypothesize that the actor network has a modular goal representation – the environment is drawn from the same distribution during the blue and purple training phases, so the only difference is the reward scheme.

I should note that in some random seeds, the agent starts doing poorly after already learning to achieve high reward on both objectives; however, it always recovers after a few training phases. In one seed the poor performance begins at the very end, but this isn’t a major issue so long as you just look at checkpoints from before performance degrades.

That caveat aside, I think it’s interesting that the agent was able to perform well on both objectives. Even with weight decay, it wasn’t obvious to me that SGD would so consistently find sufficiently flexible/modular solutions during a blue training phase that it would be able to quickly find a solution in the next purple training phase.

Perhaps this is due to lack of knowledge/understanding on my part – maybe this result is obvious to others – but I’d guess that some readers will find this interesting.

Lastly, some notes/takeaways from my hyperparameter tuning:

As I already mentioned, I used weight decay. It was also important to lower the clip coefficient, which is used for the clipped surrogate objective. My understanding is that this helps reduce catastrophic forgetting, which is a common problem with PPO. I used a clip coefficient of 0.2.

It was also very helpful to increase the number of environments and number of steps per environment, which together increase the number of experiences saved to the replay buffer per rollout. Holding the number of minibatches constant, this increases the size of each minibatch, which I think increases the quality of the gradient (because there’s a higher likelihood and variety of informative experiences per gradient update).

This is basically what I did instead of model scaling; for reasons I don’t understand, just making the model bigger often didn’t improve performance in this RL setup. A brief search online seems to suggest that others have also run into this problem when conducting RL experiments. I set num_envs to 32 and num_steps to 512.

Overall, some hyperparameters didn’t seem particularly important (the value function coefficient, number of update epochs per update period, and number of minibatches were all left at their default values because adjusting them independently didn’t improve performance in early tuning), but others the weight decay needed to be a particular value, and num_envs and num_steps had to be sufficiently large. However, I have not conducted any rigorous experiments testing how specific each hyperparameter needs to be.

For a full list of hyperparameters, see my code. It is very messy, but my results should be reproducible if you follow the instructions in the README. If not, I should be happy to help you diagnose the issue; I’m also happy to share more of my data upon request (caveat that if you’re reading this significantly far in the future relative to the time of posting this is less likely to be true).

Next Steps

I plan to look at the model checkpoints and try to find goal representations in the near future. As mentioned in the Motivation section, I’d ideally like to edit the weights such that the model pursues an objective it was never trained on (e.g. picking up the yellow key). If this proves intractable, I’ll try to validate that I actually located the goals via some other means.

If anyone reading this would like to collaborate, please let me know! I’d especially appreciate help from people who have prior experience with interpretability.

LESSWRONG
LW

19