Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]
I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.
But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.
Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.
We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.
The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:
- Have an explicit representation of inside itself, which it would explicitly pursue.
- Pursue only , at the expense of everything else in the universe.
I'll defend them separately.
Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .
But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at in all such cases[2]. How can we do so?
Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .
It seems almost tautologically true, here, that the only way to keep an agent pointed at given this setup is to explicitly point it at . Nothing else would do!
Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).
Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?
This, more or less, also has to do with environment diversity, plus some instrumental convergence.
As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at (= high probability of receiving a strong update/correction after this episode ends).
Without knowing when such a circumstance would arise, how can we prepare our agent for this?
We can make it optimize for strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.
Every goal that isn't would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.
Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
What should we take away from this? What should we not take away from this?
- I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
- I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.
But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!
And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)
We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.
- ^
Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.
- ^
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
Why can't you? The activations from observations coming in from the environment and from the agent's internal state will activate some contextual decision-influences in the agent's mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT's mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of "acquire information about my action space" or something, I dunno.
The agent that has a context-independent goal of "win the race" is in a similar predicament: it has no way of knowing a priori what "winning the race" requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It's gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever "winning the race" looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto "win the race" as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn't be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.
So I don't see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.
I agree with this.
Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?
Can there be additional layers of "command structure" on top of that? Like, can the agent have arrived at the "reasoning from what will help it win the race" thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won't this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?
Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].
Consider a different claim that seems mechanistically analogous to me:
Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.
Yeah that may be a part of where our mental models differ. I don't expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see "deceptive alignment" as part of a smooth continuum of agent-induced selection that can decouple the agent's concerns from the optimization process' criteria, with "the agent's exploration is broken" as a label for the cognitively less sophisticated end of that continuum, and "deceptive alignment" as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make "the agent tends to be shaped to care about increasingly closer correlates of G" abstraction leak hard.
EDIT: Moved some stuff into a footnote.
Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?