Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]
I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.
But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.
Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.
We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.
The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:
- Have an explicit representation of inside itself, which it would explicitly pursue.
- Pursue only , at the expense of everything else in the universe.
I'll defend them separately.
Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .
But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at in all such cases[2]. How can we do so?
Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .
It seems almost tautologically true, here, that the only way to keep an agent pointed at given this setup is to explicitly point it at . Nothing else would do!
Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).
Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?
This, more or less, also has to do with environment diversity, plus some instrumental convergence.
As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at (= high probability of receiving a strong update/correction after this episode ends).
Without knowing when such a circumstance would arise, how can we prepare our agent for this?
We can make it optimize for strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.
Every goal that isn't would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.
Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.
Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.
What should we take away from this? What should we not take away from this?
- I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
- I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.
But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!
And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)
We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.
- ^
Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.
- ^
That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.
Thanks!
Okay, suppose we have a "chisel" that's more-or-less correctly shaped around some goal G that's easy to describe in terms of natural abstractions. In CoastRunners, it would be "win the race"[1]; with MuZero, "win the game"; with GPT-N, something like "infer the current scenario and simulate it" or "pretend to be this person". I'd like to clarify that this is what I meant by R — I didn't mean that in the limit of perfect training, agents would become wireheads, I meant they'd be correctly aligned to the natural goal G implied by the reinforcement schedule.
The "easiness of description" of G in terms of natural abstractions is an important variable. Some reinforcement schedules can be very incoherent, e. g. rewarding winning the race in some scenarios and punishing it in others, purely based on the presence/absence of some random features in each scenario. In this case, the shortest description of the reinforcement schedule is just "the reinforcement function itself" — that would be the implied G.
It's not completely unrealistic, either — the human reward circuitry is varied enough that hedonism is a not-too-terrible description of the implied goal. But it's not a central example in my mind. Inasmuch as there's some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals G1∧G2∧...∧Gn implicit in the reinforcement schedule.
Now, to get to AGI, we need autonomy. We need a training setup which will build a heuristics generator into the AGI, and then improve that heuristics generator until it has a lot of flexible capability. That means, essentially, introducing the AGI to scenarios it's never encountered before[2], and somehow shaping it to pass them on the first try (= for it to do something that will get reinforced).
As a CoastRunners example, consider scenarios where the race is suddenly in 3D, or in space and the "ship" is a spaceship, or the AGI is exposed to the realistic controls of the ship instead of WASD, or it needs to "win the race" by designing the fastest ship instead of actually racing, or it's not the pilot but it wins by training the most competent pilot, or there's a lot of weird rules to the race now, or the win condition is weird, et cetera.
Inasmuch as the heuristics generator is aligned with the implicit goal G, we'll get an agent that looks at the context, infers what it means to "win the race" here and what it needs to do to win the race, then start directly optimizing for that. This is what we "want" our training to result in.
In this, we can be more or less successful along various dimensions:
Thus, there's a correlated cluster of training parameters that increases our chances of getting an AGI: we have to put it in varied highly-adversarial scenarios to make creativity/autonomy necessary, we have to ramp up its "curiosity" to ensure it can invent creative solutions and be autonomous, and to properly reinforce all of this (and not just random behavior), we have to have a highly-coherent credit assignment system that's able to somehow recognize the instrumental value of weird creativity and reinforce it more than random loitering around.
To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.
And by creativity's nature of being weird, we can't just have a "reinforce creativity" function. We'd need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be G. And indeed, this creativity-recognizing property is correlated with the reinforcement schedule's coherency — inasmuch as R is well-described as shaped around G, it should reinforce (and not fail to reinforce) weird creativity that promotes G! Thus, we get a credit assignment system that effectively cultivates the features that'd lead to AGI (an increasingly advanced heuristics generator), but it's done at the "cost" of making those features accurately pointed at G[3].
And this, incidentally, are the exact parameters necessary to make the training setup more "idealized". Strictly specify G, build it into the agent, try to update away mesa-objectives that aren't G, make it optimize for G strongly, etc.
In practice, we'll fall short of this ideal: we'll fail to introduce variance enough to uniquely specify winning, we'll reinforce upstream correlates of winning and end up with an AGI that values lots of things upstream of winning, we'll fail to have enough adversity to counterbalance this and update its other goals away, and we won't get a perfect exploratory policy that always converges towards the actions R would reinforce the most.
But a training process' ability to result in an AGI is anti-correlated with its distance from the aforementioned ideal.
Thus, inasmuch as we're successful in setting up a training process that results in an AGI, we'll end up with an agent that's some approximation of a G-maximizing wrapper-mind.
Actually, no, apparently it's "smash into specific objects". How did they expect anything else to happen? Okay, but let's pretend I'm talking about some more clearly set up version of CoastRunners, in which the simplest description of the reinforcement schedule is "when you win the race".
More specifically, to scenarios it doesn't have a ready-made suite of shallow heuristics for solving. It may be because the scenario is completely novel, or because the AGI did encounter it before, but it was long ago, and it got pushed out of its limited memory by more recent scenarios.
To rephrase a bit: The heuristics generator will be reinforced more if it's pointed at G, so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at G, because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can't robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn't reinforce the heuristics generator a lot, and doesn't preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn't cultivate the capabilities needed for AGI, and therefore doesn't result in AGI.