Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]
I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.
But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.
Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.
We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.
The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:
- Have an explicit representation of inside itself, which it would explicitly pursue.
- Pursue only , at the expense of everything else in the universe.
I'll defend them separately.
Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .
But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at in all such cases[2]. How can we do so?
Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .
It seems almost tautologically true, here, that the only way to keep an agent pointed at given this setup is to explicitly point it at . Nothing else would do!
Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).
Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?
This, more or less, also has to do with environment diversity, plus some instrumental convergence.
As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at (= high probability of receiving a strong update/correction after this episode ends).
Without knowing when such a circumstance would arise, how can we prepare our agent for this?
We can make it optimize for strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.
Every goal that isn't would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.
Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
What should we take away from this? What should we not take away from this?
- I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
- I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.
But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!
And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)
We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.
- ^
Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.
- ^
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
I don't think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent's behavioral capabilities (the actual business logic that carries out stuff like "recall the win conditions from relevantly-similar environments" and "do deductive reasoning" and "don't die"), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.
Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent's influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)
Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it's already really good at fooling the discriminator on. This is something that happens all the time, under the label of "mode collapse".
The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent "wants" and what it doesn't get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent "wants" and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it'll always happen in autonomous learning setups.
AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what "winning" means is different in each environment. The "goal generator" function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent's world model and produces contextually-relevant action recommendations (like "take such-and-such immediate action", or "set such-and-such as the current goal-image"), with this mapping having been learned from past reward events and self-supervised learning.
Not hard-coded heuristics. Heuristics learned through experience. I don't understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of "win the game" out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be "stand still for 30 seconds", or "gather all the guns into a pile and light it on fire"? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?
In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it's not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.
An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.
Depending on what you mean by OOD, I'm actually not sure if the sort of goal-generator you're describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we're choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.
When I say "decision-relevant factors in the environment" I mean something like seeing that you're in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other "team". Not sure what "context-independent correlate of G" is. Was that my phrase or yours? 🤔
Nah that's pretty similar to what I had in mind.
Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:
I think it's the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.