Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]
I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.
But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.
Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.
We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.
The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:
- Have an explicit representation of inside itself, which it would explicitly pursue.
- Pursue only , at the expense of everything else in the universe.
I'll defend them separately.
Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .
But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at in all such cases[2]. How can we do so?
Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .
It seems almost tautologically true, here, that the only way to keep an agent pointed at given this setup is to explicitly point it at . Nothing else would do!
Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).
Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?
This, more or less, also has to do with environment diversity, plus some instrumental convergence.
As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at (= high probability of receiving a strong update/correction after this episode ends).
Without knowing when such a circumstance would arise, how can we prepare our agent for this?
We can make it optimize for strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.
Every goal that isn't would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.
Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
What should we take away from this? What should we not take away from this?
- I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
- I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.
But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!
And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)
We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.
- ^
Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.
- ^
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
Oh, huh. Yes the thing you're calling the "reward circuitry", I would call the "reward function and value function". When I talk about the outer optimization criterion or R, in an RL setting I am talking about the reward function, because that is the part of the "reward circuitry" whose contents we actually specify when we set up the optimization loop.
The reward function is usually some fixed function (though it could also be learned, as in RLHF) that does not read from the agent's/policy's full mental state. Aside from some prespecified channels (the equivalent of like hormone levels, hardwired detectors etc.), that full mental state consists of hundreds/thousands/millions/billions of signals produced from learned weights. When we write the reward function, we have no way of knowing in advance what the different activation patterns in the state will actually mean, because they're learned representations and they may change over time. The reward function is one of the contributors to TD error calculation.
The value function is some learned function that looks at the agent's mental state and computes outputs that it contributes to TD error calculation. TD errors are what determine the direction and strength with which circuitry gets updated from moment to moment. There needs to be a learned component to the updating process in order to do immediate/data-efficient/learned credit assignment over the mental state. (Would take a bit of space to explain this more satisfyingly. Steve has some good writing on the subject.)
That's roughly my model of how RL works in animals, and how it will work in autonomous artificial agents. Even in an autonomous learning setup that only has prediction losses over observations and no reward, I would still expect the agent to develop something like intentions and something like updating pretty early on. The former as representations that assist it in predicting its future observations from its own computations/decisions, and the latter as a process to correct for divergences between its intentions and what actually happens[1].
By itself, this behavior-level reinforcement does not necessarily lead to parameter updates. If the only time when parameters get updated is when reward is received (this would exclude bootstrapping methods like TD for instance), and the only reward is at the end of the race, then yeah I agree, there's no preferential updating.
But behavior-level reinforcement definitely changes the distribution of experiences that the agent collects, and in autonomous learning, the parameter updates that the outer optimizer makes depend on the experiences that the agent collects[2]. So depending on the setup, I expect that this sort of extreme positive feedback loop may either effectively freeze the parameters around their current values, or else skew them based on the skewed distribution of experiences collected, which may even lead to more behavior-level reinforcement and so on.
Not sure off the top of my head. Let's see.
If the agent "wants" to make artful donuts, that entails there being circuits in the agent that bid for actions on the basis of some "donut artfulness"-related representations it has. Those circuits push the policy to make decisions on the basis of donut artfulness, which causes the policy to try to preferentially perform more-artful donut movements when considered, and maybe also suppress less-artful donut movements.
If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to "practice" its donuts within an episode. This would entail some form of learning that uses activations rather than weight changes, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it (like in-context learning). By the end, the agent has done a bunch of marginally-more-artful donuts, or its final few donuts are marginally more artful (if actions temporally closer to the reward are more heavily reinforced), or it donut artfulness is more consistent.
Now, if the agent is always doing donuts (like, it never ever breaks out of that feedback loop), and we're in the setting where the only way to get parameter updates is upon receiving a reward, then no the agent will never get better across episodes. But if it is not always doing donuts, then it can head to the end of the race after it completes this "practice". That should differentially reinforce the "practiced" more-artful donuts over less-artful donuts, right?
(To be clear, I don't think that the real CoastRunners boat agent was nearly sophisticated enough to do this. But neither was it sophisticated enough to "want" to do artful donuts, so I feel like it's fair to consider.)
Is there something specific you wanted to probe with this example? Again, I don't quite know how I should be relating this example to the rest of what we've been talking about.
The outer optimizer has no clear way to tell what those representations mean or what even constitutes a divergence from the agent's perspective.
I think many online learning, active learning, RL, and retrieval/memory-augmented setups fall into this category. 🤔