Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]
I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.
But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.
Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.
We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.
The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:
- Have an explicit representation of inside itself, which it would explicitly pursue.
- Pursue only , at the expense of everything else in the universe.
I'll defend them separately.
Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .
But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at in all such cases[2]. How can we do so?
Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .
It seems almost tautologically true, here, that the only way to keep an agent pointed at given this setup is to explicitly point it at . Nothing else would do!
Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).
Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?
This, more or less, also has to do with environment diversity, plus some instrumental convergence.
As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at (= high probability of receiving a strong update/correction after this episode ends).
Without knowing when such a circumstance would arise, how can we prepare our agent for this?
We can make it optimize for strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.
Every goal that isn't would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.
Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
What should we take away from this? What should we not take away from this?
- I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
- I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.
But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!
And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)
We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.
- ^
Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.
- ^
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
Doesn't this sound weird to you? I don't think of the chisel itself being "shaped around" the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form. The chisel does not need to be shaped like the intended form for this to work! Recall that the reinforcement schedule is not a pure function of the reward/loss calculator, it is a function of both that and the way the policy behaves over training (the thing I was describing earlier as "The agent and its choices/computations"), which means that if we only specify the outer objective R, there may be no fact of the matter about which goal is "implied" as its natural / coherent extrapolation. It's a 2-place function and we've only provided 1 argument so far.
I get your point on some vibe-level. Like, humans and other animal agents can often infer what goal another agent is trying to communicate. For instance, when I'm training a dog to sit and I keep rewarding it whenever it sits but not when it lays down or stands, we can talk about how it is contextually "implied" that the dog should sit. But most of what makes this work is not that I used a reward criterion that sharply approximates some idealized sitting recognition function (it does need to bear nonzero relation to sitting); most of the work is done by the close fit between the dog's current behavioral repertoire and the behavior I want to train, and by the fact that the dog itself is already motivated to test out different behaviors because it likes my doggie treats, and by the way in which I use rewards as a tool to create a behavior-shaping positive feedback loop.
In practice I agree (I think, not quite sure if I get the disjunction bit). That is one reason I expect agents to not want to reconfigure themselves into wrapper-mind, because the agent has settled on many different overlapping goals all of which it endorses, and those goals don't form a total preorder over outcomes for it to become a wrapper-mind pursuing.
I agree with this. For modern humans, I would say that this is provided by our evolutionary history + our many years of individual cognitive development + our schooling.
This is where I step off the train. It is not true that the only (or even the most likely) way for creativity to arise is for that creativity to be directed towards the selection criterion or to point towards the intended goal. It is not true that the only way for useful creativity to be recognized is by us. Creativity can be recognized by the agent as useful for its own goals, because the agent is an active participant in shaping the course of training. For anything that the agent might currently want, learning creativity is instrumentally valuable, and the benefits of creative heuristic-generation should transfer well between doing well according to its own aims and doing well by the outer optimization process' criteria. Just like the benefits of creative heuristic-generation transfer well between problem solving in the savannah, problem solving in the elementary classroom, and problem solving in the workplace, because there is common structure shared between them (i.e. the world is lawful). I expect that just like humans, the agent will be improving its heuristic-generator across all sorts of sub(goals) for all sorts of reasons, leading to very generalized machinery for problem-solving in the world.
No, I think this is wrong as I understand it (as is the similar content in the closing paragraphs). The form of this argument looks like:
with
You need to claim something like "Y is required in order to produce sufficient Z for AGI", not just that it produces additional Z. And I don't buy that that's the case. But also, I actually disagree with the premise that agents whose heuristic-generators are pointed merely instrumentally at G will have less reinforced/worse heuristics-generators than ones whose heuristic-generators are pointed terminally at G. IMO, learning the strategies that enable flexibly navigating the world is convergently useful and discoverable, in a way that is mostly orthogonal to whether or not the agent is pursuing the outer selection criterion.
Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself. That's what I was hoping I was communicating with the 3 examples.
EDIT: I should add that I agree that, all else equal, the factors you listed in the section below are relevant to joint alignment + capability success: