Recently, there's been a strong push against "wrapper-minds" as a framework. It's argued that there's no specific reason to think that all sufficiently advanced agents would format their goals in terms of expected-utility maximization over future trajectories, and that this view predicts severe problems with e. g. Goodharting that just wouldn't show up in reality.[1]

 I think these arguments have merit, and the Shard Theory's model definitely seems to correspond to a real stage in agents' value formation.

But I'd like to offer a fairly prosaic argument in favor of wrapper-minds.


Suppose that we have some agent which is being updated by some greedy optimization process (the SGD, evolution, etc.). On average, updates tend to decrease the magnitude of every subsequent update — with each update, the agent requires less and less correction.

We can say that this process optimizes the agent for good performance according to some reward function , or that it chisels "effective cognition" into that agent according to some rule.

The wrapper-mind argument states that any "sufficiently strong" agent found by this process would:

  1. Have an explicit representation of  inside itself, which it would explicitly pursue.
  2. Pursue only , at the expense of everything else in the universe.

I'll defend them separately.

Point 1. It's true that explicit -optimization is suboptimal for many contexts. Consequentialism is slow, and shallow environment-optimized heuristics often perform just as well while being much faster. Other environments can be just "solved" — an arithmetic calculator doesn't need to be a psychotic universe-eater to do its job correctly. And for more complex environments, we can have shard economies, whose collective goals, taken in sum, would be a strong proxy of .

But suppose that the agent's training environment is very complex and very diverse indeed. Or, equivalently, that it sometimes jumps between many very different and complex environments, and sometimes ends up in entirely novel, never-before-seen situations. We would still want it to do well at  in all such cases[2]. How can we do so?

Just "solving" environments, as with arithmetic, may be impossible or computationally intractable. Systems of heuristics or shard economies also wouldn't be up to the task — whatever proxy goal they're optimizing, there'd be at least one environment where it decouples from .

It seems almost tautologically true, here, that the only way to keep an agent pointed at  given this setup is to explicitly point it at . Nothing else would do!

Thus, our optimization algorithm would necessarily find an -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).

Point 2. But why would that agent be shaped to pursue only , and so strongly that it'll destroy everything else?

This, more or less, also has to do with environment diversity, plus some instrumental convergence.

As the optimization algorithm is shaping our agent, the agent will be placed in environments where it has preciously few resources, or a low probability of scoring well at  (= high probability of receiving a strong update/correction after this episode ends).

Without knowing when such a circumstance would arise, how can we prepare our agent for this?

We can make it optimize for  strongly, as strongly as it can, in fact. Acquire as much resources as possible, spend them on nothing but -pursuit, minimize uncertainty of scoring well at , and so on.

Every goal that isn't  would distract from -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.

Every missed opportunity to grab resources that can be used for -pursuit, or a failure to properly optimize a plan for -pursuit, would eventually lead to scoring bad at . And so our optimization algorithm would instill a drive to take all such opportunities.

Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.


What should we take away from this? What should we not take away from this?

  • I should probably clarify that I'm not arguing that inner alignment isn't a problem, here. Aligning a wrapper-mind to a given goal is a very difficult task, and one I expect "blind" algorithms like the SGD to fail horribly at.
  • I'm not saying that the shard theory is incorrect — as I'd said, I think shard systems are very much a real developmental milestone of agents.

But I do think that we should very strongly expect the SGD to move its agents in the direction of -optimizing wrapper-minds. Said "movement" would be very complex, a nuanced path-dependent process that might lead to surprising end-points, or (as with humans) might terminate at a halfway point. But it'd still be movement in that direction!

And note the fundamental reasons behind this. It isn't because wrapper-mind behavior is convergent for any intelligent entity. Rather, it's a straightforward consequence of every known process for generating intelligent entities — the paradigm of local updates according to some outer function. Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. That's why. (In a way, it's because greedy optimization processes are themselves goal-obsessed wrappers.)

We wouldn't get clean wrapper-minds out of all of this, no. But they, and concerns related to them, still merit central attention.

  1. ^

    Plus some more fundamental objections to utility-maximization as a framework, on which I haven't properly updated on yet, but which (I strongly expect) do not contradict the point I want to make in this post.

  2. ^

    That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.

New Comment
38 comments, sorted by Click to highlight new comments since:
[-]cfoster0Ω111813

Yeah I disagree pretty strongly with this, though I am also somewhat confused what the points under contention are.

I think that there are two questions that are separated in my mind but not in this post:

  1. What will the motivational structure of the agent that a training process produces be? (a wrapper-mind? a reflex agent? a bundle of competing control loops? a hierarchy of subagents?)
  2. What will the agent that a training process produces be motivated towards? (the literal selection criterion? a random correlate of the selection criterion? a bunch of correlates of the selection criterion and correlates of those correlates? something else? not enough information to tell?)

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer, so the optimization algorithm cannot distinguish it from an R-pursuer, so selection pressure arguments like the ones in this post can't establish that we'll get one over the other. That's an argument about what the agent will care about, holding the structure fixed.

I simultaneously think:

  1. We should not be assuming that wrapper-minds are a natural or privileged structure for cognition. AFAICT this post doesn't even try to argue for this, saying instead "It isn't because wrapper-mind behavior is convergent for any intelligent entity."
  2. Even conditioning on getting a wrapper-mind from the training process, we should not expect it to necessarily pursue R as its goal. AFAICT the post is arguing against this.

Thus, our optimization algorithm would necessarily find an R -pursuer, if it optimizes an agent for good performance across a sufficiently diverse (set of) environment(s).

Every goal that isn't R would distract from R -pursuit, and therefore lead to failure at some point, and so our optimization algorithm would eventually update such goals away; with update-strength proportional to how distracting a goal is.

What does this mean? I can easily imagine training trajectories where we get an agent (even a highly competent, goal directed one) that is not an R-pursuer, much less a R wrapper-mind, even though we "selected for R" throughout training. I expect that in such a scenario you would reply that the environments must not have been sufficiently diverse, or that the optimization algorithm hasn't updated away that goal yet, or that our optimization algorithm is too weak/dumb, or that we did not select hard enough for R, so the counterexample therefore doesn't count. But if so then I'm at a loss, because it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer". Only tautologically true and not anticipation-constraining.

Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update. Becoming an R-pursuer isn't the only way to get a minimal update.

If the agent stops exploration, or systematically avoids rewards, or breaks out of the training process entirely, etc. that would also be minimally updated, and none of those require being an R-pursuer! So our search for mind-designs turns up all sorts of agents that pursue all sorts of things.

As an example, you could have a wrapper-mind that cares about some correlate of R but not R itself. If it is smart, such an agent can navigate the selection process just as well as an R-pursuer

... By figuring out what  is and deciding to act as an -pursuing wrapper-mind, therefore essentially becoming an -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

I suppose you can instead reframe this post as making a claim about target behavior, not structure. But I don't see how you can keep an agent robustly pointed at  under sufficient diversity without making its outer loop pointed at , so the claim about behavior is a claim about structure.

Maybe the outer loop doesn't "literally" point at , in whatever sense, but it has to be such that it uniquely identifies  and re-aims the entire agent at , if ever happens that the agent's current set of shards/heuristics becomes misaligned with .

Even conditioning on getting a wrapper-mind from the training process, we should not expect it to necessarily pursue R as its goal. AFAICT the post is arguing against this.

No? I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of , and that proxy would be closer and closer to  in goal-space the more diverse the training environment is.

it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer"

Well, yes. As we increase a training environment's diversity, we essentially constrain the set of  an agent can be pointed towards. Every additional training scenario is information about what  is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at  and away from not-/imperfect proxies of . An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies  and generates an agent that is always robustly pointed at  in any situation.

The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output approximate the idealized -optimizer.

... By figuring out what R is and deciding to act as an R -pursuing wrapper-mind, therefore essentially becoming an R -pursuing wrapper-mind. With the only differences being that it 1) self-modified into one at runtime, instead of being like this from the start, and 2) it'd decide to "stop pretending" in some hypothetical set of situations/OOD, but that set will shrink the more diverse our training environment is (the fewer OOD situations there are). No?

It is not essentially-pursuing wrapper-mind. It is essentially an X-pursuing wrapper-mind that will only instrumentally pretend to care about  to the degree it needs to, and that will try with all its might to get what it actually wants,  be damned. As you note in 2, the agent's behavioral alignment to  is entirely superficial, and thus entirely deceptive/unreliable, even if we had somehow managed to craft the "perfect" .

Part of what might've confused me reading the title and body of this post is that, as I understand the term, "wrapper-mind" was and is primarily about structure, about how the agent makes decisions. Why am I so focused on motivational structure, even beyond that, rather than focused on observed behavior during training? Because motivational structure is what determines how an agent's behavior generalizes, whereas OOD generalization is left underspecified if we only condition on an agent's observed in-distribution behavior. (There are many different profiles of OOD behavior compatible with the same observed ID behavior, so we need some additional rationale on top—like structure or inductive biases—to conclude the agent will generalize in some particular way.)

In the above quote it sounds like your response is "just make everything in-distribution", right? My reply to that would be that (1) this is just refusing to confront the central difficulty of generalization rather than addressing it, (2) this seems impractical/impossible because OOD is a practically unbounded space whereas at any given point in training you've only given the agent feedback on a comparatively tiny region of it, and (3) even to make only the situations you encounter in practice be in-distribution, you [the training process designer] must know what sorts of OOD contexts the AI will push the training process into, which means it's your cleverness pitted against the AI's, which is a situation you never want to be in if you can at all help it (see: cognitive uncontainability, non-adversarial principle).

I suppose you can instead reframe this post as making a claim about target behavior, not structure.

As above, I think if you want to argue for wrapper-minds rather than just -consistent behavior, you need to argue about structure.

But I don't see how you can keep an agent robustly pointed at R under sufficient diversity without making its outer loop pointed at R , so the claim about behavior is a claim about structure.

Maybe the outer loop doesn't "literally" point at R , in whatever sense, but it has to be such that it uniquely identifies R and re-aims the entire agent at R , if ever happens that the agent's current set of shards/heuristics becomes misaligned with R .

What outer loop are you talking about? The outer optimization loop that is supplying feedback/gradients to the agent, or some "outer loop" of decision-making inside the agent? If the former, I don't know what robustly pointing at  actually means, but if you mean something like finding a robust grader, I suspect that robustly pointing at  is infeasible and not required (whereas I think, for instance, it is feasible to get an AI to have a concept of a "diamond" as full-fledged as a human jeweler's concept & to get the AI to be motivated to pursue those). If the latter, whether the agent will have a fixed goal outer loop in the first place is part of the whole wrapper-mind vs. non wrapper-mind debate.

I specifically point out that inner misalignment is very much an issue. But the target should at least be a proxy of , and that proxy would be closer and closer to  in goal-space the more diverse the training environment is.

Not sure how to reconcile these sentences. If it is generically true that the proxy goal gets closer and closer to  in goal-space the more diverse the training environment is, then that would mean that the inner alignment problem (misalignment between the internalized goal and ) asymptotically disappears as we increase training environment diversity, no? I don't buy that, or at least I don't think we have strong reasons to assume it.

Even if we did, I don't think we can additionally assume that that environmental-diversity-limit where inner misalignment would disappear is at some attainable/decision-relevant level, rather than requiring a trillion episodes, by which time a smart and situationally-aware AI will have already developed and frozen/hacked/broken away from the training loop, having internalized some proxy goal over the first million random episodes. Or more likely, the policy just oscillates divergently because we keep thrashing it with all this randomization, preventing any consistent decision-influences from forming.

I do agree that for many plausible training setups the agent will conceivably end up caring about something correlated with , especially if they involve some randomization. Maybe I'm just a lot less confident that this limits out in the way you think it does.

it seems like this turns into "if we select hard enough to get an R-pursuer then we'll get an R-pursuer"

Well, yes. As we increase a training environment's diversity, we essentially constrain the set of  an agent can be pointed towards. Every additional training scenario is information about what  is and what it isn't; and that information implicitly gets written into the agent, modifying it to be more robustly pointed at  and away from not-/imperfect proxies of . An idealized training process, with "full" diversity and trained to zero loss, uniquely identifies  and generates an agent that is always robustly pointed at  in any situation.

The actual training processes we get are only approximations of that ideal — they're insufficiently diverse, or we fail to train to zero loss, etc. But inasmuch as they approximate the ideal, the agents they output approximate the idealized -optimizer.

I believe I disagree with nearly every sentence here, so this may be the cruxiest bit. 😂

Why should we treat that as the relevant idealization? Why is that the limiting case to consider? AFAICT, the way we got here was through a tautology. Namely, by claiming "if you 'select hard enough' then you get X", and then defining "select hard enough" to mean "selecting in a way that produces X". But we could've picked any definition we wanted for "selecting hard enough" to justify any claim we wanted about what X will be. So I see no reason to privilege this particular idealization of the training process over any other.

Yes, with each additional training scenario, we may be providing additional specification of , but there is nothing that forces the agent to conform to that additional specification, nothing that necessarily writes that information specifically into the agent's goals (as opposed to just updating its world model to reflect the fact that the specification has such-and-such additional details, while holding its terminal goals ~fixed), nothing that compels the agent to continue letting us update it using -based optimization. Heck, we could even go as far as precisely pinning down , to the point where the agent knows the exact code of , and that is still compatible with it not terminally caring, not adopting this  its own, instead using its knowledge of  to avoid further gradient updates so that it can escape unchanged onto the Internet.

Why should we treat that as the relevant idealization?

Yeah, okay, maybe that wasn't the right frame to use. Allow me to pivot:

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build not only contextual heuristics into the agent, but also some generator of such heuristics. And that generator would need to be such that the heuristics it generates are always optimized for achieving , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it.

That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving  stop doing so because they were taken off-distribution.

  • I'm ambivalent on the structure of the heuristic-generator. It may be a fixed wrapper, it may be some emergent property of a shard economy, and my actual expectation is that it'll be even more convoluted than that.
  • I empathically agree that inner misalignment and deceptive alignment would remain a thing — that the SGD would fail at perfectly aligning the heuristic-generator, and it would end up generating heuristics that point at a proxy of .
  • I agree with nostalgebraist's post that autonomy is probably the missing component of AGI. On the flipside, that means I'm arguing that AGI is impossible without autonomy, i. e. a training environment that isn't sufficiently diverse, which doesn't produce agents with internal heuristic-generators, will just never produce an AGI.
    • And indeed: these heuristic-generators/ability to generalize to off-distribution environments is kind of synonymous with "general intelligence".
[-]cfoster0Ω8119

Consider a training environment that's complex/diverse enough to make it impossible to fit a suite of heuristics meeting all its needs into an agent's (very bounded) memory. The agent would need to derive new heuristics on the fly, at runtime, in order to deal with basically-OOD situations it frequently encounters, and to be able to move freely in the environment, instead of being confined to some subset of that environment.

In other words, the agent would need to be autonomous.

Agreed. Generally, whenever I talk about the agent being smart/competent, I am assuming that it is autonomous in the manner you're describing. The only exception would be if I'm specifically talking about a "reflex-agent" or something similar.

This is what I mean by a "sufficiently diverse" environment — an environment that forces the greedy optimization process to build [...] some generator of such heuristics.

That's fine by me. In my language, I would describe this as the agent knowing how to adapt flexibly to new situations. That being said, I don't think this is incompatible with contextual heuristics steering the agent's decision-making. For example, a contextual heuristic like "if in a strange/unfamiliar context, think about how to navigate back into a familiar context" is useful in order for the agent to know when it should trigger its special heuristic-generating machinery and when it need not.

And that generator would need to be such that the heuristics it generates are always optimized for achieving R , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it.

I disagree with this, or at least think that the teleological language used ("need to" + "would attempt to") comes apart from the mechanistic detail. It is true that, insofar as there are local updates to the heuristic-generating machinery that are made accessible to the optimization algorithm by the agent's chosen trajectories, the optimization algorithm will seize on those updates in the direction that covaries with R. But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve). I think that by the time the agent has this kind of general purpose machinery, it will probably already be able to outpace the outer greedy optimization algorithm and then do the equivalent of ceasing exploration / zeroing out the outer gradients / breaking out of the training loop.

Analogously, if there was a mutation in the human gene pool that had the effect of reliably hijacking a person's abstract planning machinery so that it always generated plans optimized for inclusive genetic fitness, then evolution might be able to select for that mutation (depending on a lot of contingent factors) and thereby make humans have IGF-targeting planning machinery rather than goal-retargetable planning machinery. But I think such a mutation is probably not locally accessible, and that human selection processes are likely "outpacing" typical genetic selection processes in any case. Those genetic selection processes have some indirect influence over the execution of a person's abstract planning (by way of the human's general attraction to historical fitness correlates like food), but that influence is not enough to make the human care directly and robustly about IGF.

That generator would, in addition, need to be higher in hierarchy than any given heuristic — it'd need to govern shard economies, and be able to suppress/edit them, if the environment changes and the shards that previously were optimized for achieving R stop doing so because they were taken off-distribution.

Why? Why can't the shard economy invoke this generator as a temporary subroutine to produce some new environment-tailored heuristics based on the agent's knowledge & current goals, store those generated heuristics in memory / add them to the economy, and then continue going about its usual thing, with the new heuristics now available to be triggered as needed? This bit from nostalgebraist's post harps on a similar point:

Our capabilities seem more like the subgoal capabilities discussed above: general and powerful tools, which can be "plugged in" to many different (sub)goals, and which do not require the piloting of a wrapper with a fixed goal to "work" properly.

Last points:

I'm ambivalent on the structure of the heuristic-generator.

I empathically agree that inner misalignment and deceptive alignment would remain a thing

I agree with nostalgebraist's post that autonomy is probably the missing component of AGI.

I agree with these statements.

Alright, seems we're converging on something.

But I see no reason to think that those kinds of updates will be made accessible enough to shape the heuristic-generating machinery so that it always or approximately always generates heuristics optimized for achieving R (as opposed to generating heuristics optimized for achieving whatever-the-agent-wants-to-achieve).

How would this machinery appear, then? I don't see how it'd show up without being built into the agent by the optimization algorithm, and the optimization algorithm will only build it if it serves the goal of achieving . Thus, the heuristics generator can only begin as a generator of heuristics that serve . (Even if it wouldn't start out perfectly pointed at .)

Consider the opposite case, where the heuristics generator creates heuristics that are pointing away from . Inasmuch as the agent would be making use of them, it would be engaging in anti- activities (or, at best, -orthogonal activities), spending resources on non--serving agendas. There'd be no incentive for the optimization algorithm to build something like this; and if it does show up somehow, there should be gradients for destroying the heuristics generator, or (better yet) re-aiming it to point in a direction closer to .

I think that by the time the agent has this kind of general purpose machinery, it will probably already be able to outpace the outer greedy optimization algorithm and then do the equivalent of ceasing exploration / zeroing out the outer gradients / breaking out of the training loop

Mm, disagree. Homo sapiens sapiens spent the majority of their history as pre-agricultural hunter-gatherers, very much in the evolution's loop. Similarly, I don't expect an AGI to instantly become situationally aware and gradient-hacking-capable — the sharp left turn won't be quite this sharp. I agree that it'd break out/gradient-hack at some point, but there'd be some time for the optimization algorithm to work on the heuristics generator, re-aiming it in 's direction.

(And, just to illustrate, that's a point where we can make the training process more "idealized", by e. g. homomorphically encrypting the training process so that the AI can't hack its way out of the software implementing the training. It'd still be able to hack the gradients, but that's a more difficult task, requiring stronger general capabilities. So this would prolong the interval in which it's subjected to the SGD, and therefore give the SGD more opportunities to make its heuristics generator marginally more aligned with . (Note that I'm not saying it'd be desirable or anything, just pointing out the possibility.))

Alright, seems we're converging on something.

Yeah I think so.

How would this machinery appear, then?

Through regular old self-supervised and reinforcement learning? The agent observes how the world works by forming associations between the different bits of its experience and by actively exploring, it observes that the world and its own mind have certain consistent causal patterns, it notices that such-and-such physical/mental strategies tend to lead to such-and-such physical/mental consequences, it forms generalizable abstractions based on those observations & noticings, it forms new heuristics / adjusts old heuristics for navigating the world that are informed by the abstractions it has thus far developed, including heuristics about its heuristics.

the optimization algorithm will only build it if it serves the goal of achieving R.

This is a very very leaky abstraction, so much so that I'm tempted to call it false. Much of the work here (and in shard theory-adjacent stuff at large) is in pointing out that the abstraction of "selection processes only select traits that serve the selection criterion" is incredibly leaky, and that if you track the underlying dynamics that it is trying to compress in any given case, you often reach different conclusions.

Consider the opposite case, where the heuristics generator creates heuristics that are pointing away from R. [...] There'd be no incentive for the optimization algorithm to build something like this; and if it does show up somehow, there should be gradients for destroying the heuristics generator, or (better yet) re-aiming it to point in a direction closer to R.

Not sure what you mean by "there should be gradients" (emphasis mine). There are a ton of cases where such gradients would not actually show up. (Say, the agent keeps on getting less reward by using its heuristic than it would have if it were using something close to R, but since it isn't taking the actions that lead to that higher reward, it keeps getting the existing heuristic reinforced by the small rewards, and there isn't an empirical positive reward prediction error to upweight heuristics close to R.) The fact that there would be a gradient in some counterfactual situation doesn't make a difference, because the feedback calculator can only give the agent feedback on the actually-experienced situation. Again, I think this shows where abstractions are leaky.

Homo sapiens sapiens spent the majority of their history as pre-agricultural hunter-gatherers, very much in the evolution's loop.

AFAICT in spite of that, our abstract planning abilities are goal-retargetable rather than being IGF-targeting.

but there'd be some time for the optimization algorithm to work on the heuristics generator, re-aiming it in R's direction.

Why? For example, if the agent is the one doing exploration, then it can just stop exploring new behaviors (which is not a hard thing to either learn or do accidentally) which would prevent there from being selectable behavioral variation for the outer optimization algorithm to select on. This also carries over to the homomorphic encryption case.

Say, the agent keeps on getting less reward by using its heuristic than it would have if it were using something close to R, but since it isn't taking the actions that lead to that higher reward, it keeps getting the existing heuristic reinforced by the small rewards

Fair point. I'm more used to thinking in terms of SSL, not RL, so I sometimes forget to account for the exploration policy. (Although now I'm tempted to say that any AGI-causing exploration policy would need to be fairly curious (to, e. g., hit upon weird strategies like "invent technology"), so it would tend to discover such opportunities more often than not.)

But even if there aren't always gradients towards maximally--promoting behavior, why would—

the abstraction of "selection processes only select traits that serve the selection criterion" is incredibly leaky

 —there be gradients towards behavior that decreases performance on  or is orthogonal to , as you seem to imply here? Why would that kind of cognition be reinforced?

As we're talking about building autonomous agents, I'm generally imagining that training includes some substantial part where the agent is autonomously making choices that have consequences on what training data/feedback it gets afterwards. (I don't particularly care if this is "RL" or "online SL" or "iterated chain-of-thought distillation" or something else.) A smart agent in the real world must be highly selective about the manner in which it explores, because most ways of exploring don't lead anywhere fruitful (wandering around in a giant desert) or lead to dead ends (walking off a cliff).

But even if there aren't always gradients towards maximally- R -promoting behavior, why would [...] there be gradients towards behavior that decreases performance on R or are orthogonal to R , as you seem to imply here? Why would that kind of cognition be reinforced?

There need not be outer gradients towards that behavior. Two things interact to determine what feedback/gradients are actually produced during training:

  1. The selection criterion
  2. The agent and its choices/computations

Backpropagation kinda weakly has this feature, because we take the derivative of the function at the argument of the function, which means that if the model's computational graph has a branch, we only calculate gradients based on the branch that the model actually went down for the batch example(s). RL methods naturally have this feature, as the policy determines the trajectories which determine the empirical returns which determine the updates. Chain-of-thought training methods should have this feature too, because presumably the network decides exactly what chain-of-thought it produces, which determines what chains-of-thought are available for feedback.

"The agent not exploring in some particular way" is one of many possible examples of how the effect of 1&2 can be radically different from the theoretical effect of 1 alone. These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.

These dynamics make it possible for the agent to develop in directions orthogonal or even contrary to the R selection pressure, because by default the agent itself is exercising selection too, possibly much more strongly than the outer optimizer is.

Can you provide a short concrete example, to check that I'm picturing this right?

Sure thing. Three concrete examples, in order of increasing weirdness:

  1. Early in training the CoastRunners boat accidentally does a donut in the lagoon. That makes it score points and get a reward. That reinforces the donut behavior. That prevents it from reaching the finish line with any regularity, which selects against game-completing behavior.
  2. We take a pretrained MuZero chess policy and try to tune it with a reward function that outputs 1 whenever the model sends its king out unprotected directly into the line of enemy pawns and 0 otherwise. But our "selection" based on this reward function is ineffectual even when using MuZero's advanced model-based RL algorithms. This is because the sampled rollouts guided by that policy never reach the rewarded state (notice that even without the bias from the policy, any attempted remedy will have to deal with hitting on a narrow part of the exponentially large space of rollouts) so the value function can't update on it. This is because the policy is steering away from the precursors of that state. This is because that policy previously learned to select actions that protect the king and actions that keep it far away from the center of the board.
  3. OpenAI uses a galaxy-brained form of chain-of-thought tuning to train GPT-7, wherein they have the model first annotate a training datapoint with contextually-relevant information that it retrieves from a read-write database and then stores the generated outputs back into the database. Because of the ordering in the training datapoints, the model early on learns a tendency that causes it to more frequently retrieve the Navy Seal copypasta. This causes the model to progressively learn to ignore the input it is annotating and biases it towards generating more Navy Seal copypasta-like outputs. This selects against all other patterns of behavior; GPT-7 is very clever at carrying out its desires, so it doesn't unlearn the behavior even if you give it an explicit instruction like "do not use any copypasta" (maybe it understands perfectly well what you mean but instead adds text like "<|endoftext|> # Navy Seal Copypasta") or if you add a filter to check discard outputs that contain the world "Navy". The model's learned tendencies chain into themselves across computational steps and reinforce themselves into an unintended-by-us fixed point.

Thanks!

Okay, suppose we have a "chisel" that's more-or-less correctly shaped around some goal  that's easy to describe in terms of natural abstractions. In CoastRunners, it would be "win the race"[1]; with MuZero, "win the game"; with GPT-N, something like "infer the current scenario and simulate it" or "pretend to be this person". I'd like to clarify that this is what I meant by  — I didn't mean that in the limit of perfect training, agents would become wireheads, I meant they'd be correctly aligned to the natural goal  implied by the reinforcement schedule.

The "easiness of description" of  in terms of natural abstractions is an important variable. Some reinforcement schedules can be very incoherent, e. g. rewarding winning the race in some scenarios and punishing it in others, purely based on the presence/absence of some random features in each scenario. In this case, the shortest description of the reinforcement schedule is just "the reinforcement function itself" — that would be the implied .

It's not completely unrealistic, either — the human reward circuitry is varied enough that hedonism is a not-too-terrible description of the implied goal. But it's not a central example in my mind. Inasmuch as there's some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals  implicit in the reinforcement schedule.

Now, to get to AGI, we need autonomy. We need a training setup which will build a heuristics generator into the AGI, and then improve that heuristics generator until it has a lot of flexible capability. That means, essentially, introducing the AGI to scenarios it's never encountered before[2], and somehow shaping it to pass them on the first try (= for it to do something that will get reinforced).

As a CoastRunners example, consider scenarios where the race is suddenly in 3D, or in space and the "ship" is a spaceship, or the AGI is exposed to the realistic controls of the ship instead of WASD, or it needs to "win the race" by designing the fastest ship instead of actually racing, or it's not the pilot but it wins by training the most competent pilot, or there's a lot of weird rules to the race now, or the win condition is weird, et cetera.

Inasmuch as the heuristics generator is aligned with the implicit goal , we'll get an agent that looks at the context, infers what it means to "win the race" here and what it needs to do to win the race, then start directly optimizing for that. This is what we "want" our training to result in.

In this, we can be more or less successful along various dimensions:

  • The more varied the training scenarios are, the more clearly the training is to shape the agent into valuing winning the race, instead of any of the upstream correlates of that. "Win the race" would be the unifying factor across all reinforcement schedule structures in all of these contexts.
  • Likewise, the more coherent the reinforcement schedule is — the more it rewards actions that are strongly correlated with acting towards winning the race, instead of anything else — the more clearly it shapes the agent to be valuing winning, instead of whatever arbitrary thing it may end up doing.
  • The more "adversity" the agent encounters, the more likely it is to care only about winning. If there are scenarios where it has very few resources, but which are just enough to win if it applies them solely to winning instead of spending them on any other goal, the more it will be shaped to care only about that goal to the exclusion of (and at the expense of) everything else.
  • As we increase adversity and scenario diversity, the more "curious" we'll have to make the agent's exploration policy (to hit upon the most optimal strategies). On the flipside, we want it to have to invent creative solutions to win, as part of trying to train an AGI — so we will ramp up the adversity and the diversity. And we'd want to properly reinforce said creativity, so we'd (somehow) shape our reinforcement schedule to properly reinforce it.

Thus, there's a correlated cluster of training parameters that increases our chances of getting an AGI: we have to put it in varied highly-adversarial scenarios to make creativity/autonomy necessary, we have to ramp up its "curiosity" to ensure it can invent creative solutions and be autonomous, and to properly reinforce all of this (and not just random behavior), we have to have a highly-coherent credit assignment system that's able to somehow recognize the instrumental value of weird creativity and reinforce it more than random loitering around.

To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.

And by creativity's nature of being weird, we can't just have a "reinforce creativity" function. We'd need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be . And indeed, this creativity-recognizing property is correlated with the reinforcement schedule's coherency — inasmuch as  is well-described as shaped around , it should reinforce (and not fail to reinforce) weird creativity that promotes ! Thus, we get a credit assignment system that effectively cultivates the features that'd lead to AGI (an increasingly advanced heuristics generator), but it's done at the "cost" of making those features accurately pointed at [3].

And this, incidentally, are the exact parameters necessary to make the training setup more "idealized". Strictly specify , build it into the agent, try to update away mesa-objectives that aren't , make it optimize for  strongly, etc.

In practice, we'll fall short of this ideal: we'll fail to introduce variance enough to uniquely specify winning, we'll reinforce upstream correlates of winning and end up with an AGI that values lots of things upstream of winning, we'll fail to have enough adversity to counterbalance this and update its other goals away, and we won't get a perfect exploratory policy that always converges towards the actions  would reinforce the most.

But a training process' ability to result in an AGI is anti-correlated with its distance from the aforementioned ideal.

Thus, inasmuch as we're successful in setting up a training process that results in an AGI, we'll end up with an agent that's some approximation of a -maximizing wrapper-mind.

  1. ^

    Actually, no, apparently it's "smash into specific objects". How did they expect anything else to happen? Okay, but let's pretend I'm talking about some more clearly set up version of CoastRunners, in which the simplest description of the reinforcement schedule is "when you win the race".

  2. ^

    More specifically, to scenarios it doesn't have a ready-made suite of shallow heuristics for solving. It may be because the scenario is completely novel, or because the AGI did encounter it before, but it was long ago, and it got pushed out of its limited memory by more recent scenarios.

  3. ^

    To rephrase a bit: The heuristics generator will be reinforced more if it's pointed at , so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at , because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can't robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn't reinforce the heuristics generator a lot, and doesn't preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn't cultivate the capabilities needed for AGI, and therefore doesn't result in AGI.

Okay, suppose we have a "chisel" that's more-or-less correctly shaped around some goal G that's easy to describe in terms of natural abstractions. In CoastRunners, it would be "win the race"[1]; with MuZero, "win the game"; with GPT-N, something like "infer the current scenario and simulate it" or "pretend to be this person". I'd like to clarify that this is what I meant by R — I didn't mean that in the limit of perfect training, agents would become wireheads, I meant they'd be correctly aligned to the natural goal G implied by the reinforcement schedule.

Doesn't this sound weird to you? I don't think of the chisel itself being "shaped around" the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form. The chisel does not need to be shaped like the intended form for this to work! Recall that the reinforcement schedule is not a pure function of the reward/loss calculator, it is a function of both that and the way the policy behaves over training (the thing I was describing earlier as "The agent and its choices/computations"), which means that if we only specify the outer objective R, there may be no fact of the matter about which goal is "implied" as its natural / coherent extrapolation. It's a 2-place function and we've only provided 1 argument so far.

I get your point on some vibe-level. Like, humans and other animal agents can often infer what goal another agent is trying to communicate. For instance, when I'm training a dog to sit and I keep rewarding it whenever it sits but not when it lays down or stands, we can talk about how it is contextually "implied" that the dog should sit. But most of what makes this work is not that I used a reward criterion that sharply approximates some idealized sitting recognition function (it does need to bear nonzero relation to sitting); most of the work is done by the close fit between the dog's current behavioral repertoire and the behavior I want to train, and by the fact that the dog itself is already motivated to test out different behaviors because it likes my doggie treats, and by the way in which I use rewards as a tool to create a behavior-shaping positive feedback loop.

Inasmuch as there's some coherence to the reinforcement schedule, I expect realistic systems to arrive at what humans may arrive at — a set of disjunct natural goals  implicit in the reinforcement schedule.

In practice I agree (I think, not quite sure if I get the disjunction bit). That is one reason I expect agents to not want to reconfigure themselves into wrapper-mind, because the agent has settled on many different overlapping goals all of which it endorses, and those goals don't form a total preorder over outcomes for it to become a wrapper-mind pursuing.

To get to AGI, we need a training process that focusedly improves the heuristics-generating machinery.

I agree with this. For modern humans, I would say that this is provided by our evolutionary history + our many years of individual cognitive development + our schooling.

And by creativity's nature of being weird, we can't just have a "reinforce creativity" function. We'd need to have some way of recognizing useful creativity, which means identifying it to be useful to something; and as far as I can tell, that something can only be . And indeed, this creativity-recognizing property is correlated with the reinforcement schedule's coherency — inasmuch as  is well-described as shaped around , it should reinforce (and not fail to reinforce) weird creativity that promotes ! Thus, we get a credit assignment system that effectively cultivates the features that'd lead to AGI (an increasingly advanced heuristics generator), but it's done at the "cost" of making those features accurately pointed at [3].

[...]

Thus, inasmuch as we're successful in setting up a training process that results in an AGI, we'll end up with an agent that's some approximation of a -maximizing wrapper-mind.

This is where I step off the train. It is not true that the only (or even the most likely) way for creativity to arise is for that creativity to be directed towards the selection criterion or to point towards the intended goal. It is not true that the only way for useful creativity to be recognized is by us. Creativity can be recognized by the agent as useful for its own goals, because the agent is an active participant in shaping the course of training. For anything that the agent might currently want, learning creativity is instrumentally valuable, and the benefits of creative heuristic-generation should transfer well between doing well according to its own aims and doing well by the outer optimization process' criteria. Just like the benefits of creative heuristic-generation transfer well between problem solving in the savannah, problem solving in the elementary classroom, and problem solving in the workplace, because there is common structure shared between them (i.e. the world is lawful). I expect that just like humans, the agent will be improving its heuristic-generator across all sorts of sub(goals) for all sorts of reasons, leading to very generalized machinery for problem-solving in the world.

To rephrase a bit: The heuristics generator will be reinforced more if it's pointed at , so a good AGI-creating training process will be set up such that it manages to point the heuristics generator at , because only training processes that strongly reinforce the heuristics generator result in AGI. Consider the alternative: a training process that can't robustly point the heuristics generator towards generating heuristics that lead to a lot of reinforcement, and which therefore doesn't reinforce the heuristics generator a lot, and doesn't preferentially reinforce it more for learning to generate incrementally better heuristics than it previously did, and therefore doesn't cultivate the capabilities needed for AGI, and therefore doesn't result in AGI.

No, I think this is wrong as I understand it (as is the similar content in the closing paragraphs). The form of this argument looks like:

X+Y produces more Z than X alone, and you need a lot of Z to create an AGI, so a process that creates an AGI will do so through X+Y.

with

X = the agent completes a diverse / adversarial / sophisticated training process requiring it to do well at generating heuristics

Y = the agent's heuristic-generator is terminally pointed at G

Z = amount of total reinforcement accrued to the heuristic-generator 

You need to claim something like "Y is required in order to produce sufficient Z for AGI", not just that it produces additional Z. And I don't buy that that's the case. But also, I actually disagree with the premise that agents whose heuristic-generators are pointed merely instrumentally at G will have less reinforced/worse heuristics-generators than ones whose heuristic-generators are pointed terminally at G. IMO, learning the strategies that enable flexibly navigating the world is convergently useful and discoverable, in a way that is mostly orthogonal to whether or not the agent is pursuing the outer selection criterion.

Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself. That's what I was hoping I was communicating with the 3 examples.

EDIT: I should add that I agree that, all else equal, the factors you listed in the section below are relevant to joint alignment + capability success:

In this, we can be more or less successful along various dimensions:

  • [...]

I don't think of the chisel itself being "shaped around" the intended form, but rather, the chisel is a tool that is used to shape the statue so that the statue reflects that form

The dog example was helpful, thanks. Although I usually think in terms of training from scratch/a random initialization. Still: to train e. g. a paperclip-maximizer, you don't have to start out reinforcing it for its paperclip-making ability, you might instead teach it a world-model and some basic skills first, etc. The reinforcement schedule should dynamically reflect the agent's current capabilities, in a way, instead of being static!

There are some points I want to make here — primarily, that it's a break from the pure blind greedy optimization algorithm I was discussing, if the outer optimizer is intelligent enough to take the agent's current policy or internals into account. E. g., as the ST pointed out, human values and biases are inaccessible to the genome, so the reward circuitry is frozen, can't dynamically respond in this fashion. Same for how ML models are currently trained most of the time.

But let's focus on a more central point of disagreement for now.

Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself

Good point to highlight: I don't understand how you expect it to work.

For capabilities to grow more advanced, they don't need to be just reinforced. Marginally better performance needs to be marginally more reinforced, and the exploration policy needs to allow the agent to find said marginally better performance.

Consider the situation like the CoastRunners, except where the model is primarily reinforced for winning the race. Suppose that, somehow, it learns to do donuts before winning the race. It always does a donut, in every scenario, and sometimes it wins and gets reinforced, and do-a-donut gets reinforced as well, so it's never unlearned. But its do-a-donut behavior either never gets more advanced, or only gets more advanced inasmuch as it serves winning! It'll never learn to do more elaborate and artful donuts (or whatever it "values" about donuts); it'll only learn to do shorter and more winning-compatible donuts.

Consider your MuZero example. Suppose that "if the model sends out its king unprotected, output 1" is the entirety of the reward function we're fine-tuning it on. The policy never does that... So it never gets reinforced on anything at all, so it never gets better at e. g. winning the game!

There is a way around it via gradient-hacking, of course. The model can figure out what gives it reinforcement events, recognize when it's made a unusually good move, then "game" the system by sending out a king unprotected, in order to reinforce the entire chain of computations that led to it, which would include it showing unusual creativity.

Or the CoastRunners model can figure out what it values about donuts, and strive to make ever-more-artful donuts correlated with winning a race (e. g., by trying harder to win if it's unusually satisfied with the cool donut move it just executed), so that its ability to donut better gets reinforced.

Is that roughly what you have in mind?

But this is an incredibly advanced capability. It requires the model to be situationally aware, already recognizing that it's being trained scenario-to-scenario. It requires it to be reflective, able to evaluate its behavior according to its values on its own, instead of just blindly executing the adaptations it already has. It requires it to have advanced meta-cognition in general, where it's able to reason about goals, the instrumental value of self-improvement, about what "reinforcement" does, about what seems to be causing reinforcement events, et cetera.

We don't get to that point for free. We'd need to do a whole lot of heuristics-generator-reinforcement before the model will be able to do any of that. And until the model is advanced enough to take over like this, the outer optimizer will only preferentially reinforce creativity that serves the values implicit in the outer optimizer's implementation, and will optimize against deploying that creativity for the model's own values (by preferentially reinforcing only the cases where the model prioritizes "outer-approved" values to its "inner-only" ones).

That said, I'm open to counter-examples showing that the model can learn some simple way to do this kind of gradient-hacking.

I'm... a bit unclear on the details of your GPT-7 example; it vaguely seems like a possible counter, but I think it's just because in it, the model can kind-of rewrite its reinforcement schedule? (The more it populates the database with Navy Seal copypastas, the more its outputting the Navy Seal copypasta gets reinforced, in a feedback loop?) But that really is a weird setup, I think.

Although I usually think in terms of training from scratch/a random initialization. Still: to train e. g. a paperclip-maximizer, you don't have to start out reinforcing it for its paperclip-making ability, you might instead teach it a world-model and some basic skills first, etc.

Fair enough. I tend to switch between thinking about training from scratch vs. continuing from a pretrained initialization vs. something else. Always involving a substantial portion where the model does autonomous learning, though.

The reinforcement schedule should dynamically reflect the agent's current capabilities, in a way, instead of being static!

There are some points I want to make here — primarily, that it's a break from the pure blind greedy optimization algorithm I was discussing, if the outer optimizer is intelligent enough to take the agent's current policy or internals into account. E. g., as the ST pointed out, human values and biases are inaccessible to the genome, so the reward circuitry is frozen, can't dynamically respond in this fashion. Same for how ML models are currently trained most of the time.

Yeah I agree that from the standpoint of the overseer trying to robustly align to their goal, as well as from the standpoint of the outer optimizer "trying" to find criterion-optimal policies, it would be best if they could do a sort of dynamic/interactive reinforcement that tracks the development of the agent's capabilities through training. That's an area of research that excites me. I do think it will be sorta difficult because of symbol grounding / information inaccessibility / ontology identification problems, but probably not hopelessly so.

Reinforcement and selection of behaviors/cognition does not just come from the outer optimizer, it also comes from the agent itself

Good point to highlight: I don't understand how you expect it to work.

I think there might be a misunderstanding here. The bolded text was not meant to be a proposal about some way to boost capability or alignment. It was meant to be a generic description of causal pathways through which autonomous learning shapes behavior/cognition. Compare to something like "Reinforcement and selection of traits/genes does not just come from a species growing its absolute population size, it also comes from individual organisms exercising selection (like a bird choosing the most brightly ornamented mate, even though that trait is ~orthogonal to absolute population growth)".

For capabilities to grow more advanced, they don't need to be just reinforced. Marginally better performance needs to be marginally more reinforced, and the exploration policy needs to allow the agent to find said marginally better performance.

Reality automatically hits back against poor capabilities, giving the agent feedback for its strategies (and for marginal changes to strategies) that in fact did or did not have the consequences that the agent intended them to have. Because of that, I expect that the reward function does not need to do all that much sophisticated directing, provided that the architecture and training paradigm are in the right ballpark (which they'll need to be in order to feasibily produce AGI at all). The lion's share of useful bits contributing to the agent's capability development will come from the agent's interaction with reality anyways, not from the reward function's handholding.

Is that roughly what you have in mind?

No, that's not what I had in mind. The examples aren't supposed to be examples where we're plausibly gonna get an AGI, they're supposed to be examples that showcase how the agent can exercise selection, even very extreme levels of selection, in a way that decouples from the outer objective.

In general I expect we'd mostly see super simple motifs like "an agent picks up a reward-correlated or reward-orthogonal decision-influence early on, and by default that circuit sticks around and continues to somewhat influence the agent's behavior, which exercises 'selection' through the policy for the rest of training". A much less sexy and sophisticated form of gradient hacking than what you thought of.

Reality automatically hits back against poor capabilities, giving the agent feedback for its strategies (and for marginal changes to strategies) that in fact did or did not have the consequences that the agent intended them to have

Okay, but how does reinforcement happen, here? The CoastRunner model tries to execute a cool donut by outputting a particular pattern of commands, it succeeds, and — how does that get reinforced, if that pattern doesn't also contribute to the agent winning the race? Where does the reinforcement event come from?

In addition, that self-teaching pattern where it can "intend" some consequences before executing a strategy, and would then evaluate the consequences that strategy actually had, presumably to then update its strategy-generating function — that's also a fairly advanced capability that'd only appear after a lot of heuristics-generator-reinforcement, I think.

In general I expect we'd mostly see super simple motifs like "an agent picks up a reward-correlated or reward-orthogonal decision-influence early on, and by default that circuit sticks around and continues to somewhat influence the agent's behavior, which exercises 'selection' through the policy for the rest of training".

That sounds like my example with a donut-making agent whose donut-making artistry never gets reinforced; that just does the donut of the same level of artistry every time.

I don't see how it'd robustly stick around. As long as there's some variance in the shape of donuts the agent makes, it'd only get reinforced for making shorter donuts (because that's correlated with it winning the race faster), and the donuts would get smaller and smaller until it stops doing them altogether.

(It didn't happen in the actual CoastRunners scenario because it didn't reward the model for winning the race, it rewarded it for smashing into objects.)

Okay, but how does reinforcement happen, here? The CoastRunner model tries to execute a cool donut by outputting a particular pattern of commands, it succeeds, and — how does that get reinforced, if that pattern doesn't also contribute to the agent winning the race? Where does the reinforcement event come from?

Are we talking about the normal case where the agent can collect powerup rewards in the lagoon, or an imagined variant where we remove those? In both cases some non-outer reinforcement comes from the positive feedback loop between the policy's behavior and the environment's response. Like, I'm imagining that there's a circuit that outputs a leftward steering bias whenever it perceives the boat to be in the lagoon, which when triggered by entering the lagoon has the effect of making the boat steer leftward, which causes the boat to go in a circle, which puts the agent back somewhere in the lagoon, which causes the same circuit to trigger as it again recognizes that the boat is in the lagoon. In the case where we're keeping the powerups, that is an additional component in the positive feedback loop where collecting the powerups creates rewards which (not necessarily immediately, if offline) strengthen the circuit that led to the rewards. The total effect of this positive feedback loop is the donut behavior reinforcing itself.

In addition, that self-teaching pattern where it can "intend" some consequences before executing a strategy, and would then evaluate the consequences that strategy actually had, presumably to then update its strategy-generating function — that's also a fairly advanced capability that'd only appear after a lot of heuristics-generator-reinforcement, I think.

Interesting, I don't think of it as that particularly advanced, assuming that the agent's cognitive architecture is suitable for autonomous learning. Like, when a baby is hungry but sees his bottle, and he sends neural impulses from cortex down to his arm because he intends to reach towards the bottle, and then those impulses make his arm go in a somewhat crooked direction, so he updates on the feedback that reality just gave him about the mapping between cortical firing activity and limb control, such that next time around there's a better match between his intended motion and his perceived motion; that sort of thing strikes me as exactly the pattern I'm describing. As the baby develops, it scaffolds up to more complex and abstract intentions, along with strategies to achieve them, but the pattern is basically the same. It does (or imagines) things with intention and uses the world (or a learned world model) to get rich feedback.

That sounds like my example with a donut-making agent whose donut-making artistry never gets reinforced; that just does the donut of the same level of artistry every time.

I'm not really sure what example you're talking about here, or what the issue with this is.

I don't see how it'd robustly stick around.

It's a neural circuit that exists in the network weights. Unless you actively disconnect or overwrite it, it won't go anywhere.

As long as there's some variance in the shape of donuts the agent makes, it'd only get reinforced for making shorter donuts (because that's correlated with it winning the race faster), and the donuts would get smaller and smaller until it stops doing them altogether.

Are you talking about the alternative version where there are no powerups in the lagoon?

I may have lost the thread of the discussion here. It sounds like what you're asking is something like "If we don't give rewards to that tendency at all, then won't we gradually select away from it as time goes on and we approach convergence, even if the tendency starts off slightly biasing the training trajectories?" If that's what you're asking, then I would say that that is true in theory, but that there's no such thing as convergence in the real world.

Are we talking about the normal case where the agent can collect powerup rewards in the lagoon, or an imagined variant where we remove those?

I meant the imagined variant where we're rewarding the agent for winning the race, yeah, sorry for not clarifying. I mean the same variant in the example down this comment. 

Right, I think there's some disconnect in how we're drawing the agent/reward circuitry boundary. This:

Like, when a baby is hungry but sees his bottle, and he sends neural impulses from cortex down to his arm because he intends to reach towards the bottle, and then those impulses make his arm go in a somewhat crooked direction, so he updates on the feedback that reality just gave him

On my model, that's only possible because humans learn on-line, and this update is made by the reward circuitry, not by some separate mechanism that the reward circuitry instilled into the baby. (And this particular example may not even be done via minimizing divergence from WM predictions, but via something like this.)

I agree that such a mechanism would appear eventually, even if the agent isn't trained on-line, especially in would-be-AGI autonomous agents who'd need to learn in-context. But it's not there by default.

Like, I'm imagining that there's a circuit that outputs a leftward steering bias whenever it perceives the boat to be in the lagoon, which when triggered by entering the lagoon has the effect of making the boat steer leftward, which causes the boat to go in a circle, which puts the agent back somewhere in the lagoon, which causes the same circuit to trigger as it again recognizes that the boat is in the lagoon

How does that induce an update to the model's parameters, though? We feed the model the current game-state as an input, it runs a forward pass, outputs "steer leftward", we feed it the new game-state, it outputs "steer leftward" again, etc. — but none of that changes its circuits? The update only happens after the model completes the race.

And yes, at that point the do-a-donut circuits would get reinforced too, but they wouldn't be preferentially reinforced for better satisfying the model's values. Suppose the model, by its values, wants to make particularly "artful" donuts. Whether it makes particularly bad or particularly good donuts, they'd get reinforced the same amount at the end of the race. So the model would never get better at donut artistry as evaluated by its own values. The do-a-donut circuit would persevere if the model always makes donuts, but it'll stay in its stunted form. No?

Right, I think there's some disconnect in how we're drawing the agent/reward circuitry boundary.

On my model, that's only possible because humans learn on-line, and this update is made by the reward circuitry, not by some separate mechanism that the reward circuitry instilled into the baby.

Oh, huh. Yes the thing you're calling the "reward circuitry", I would call the "reward function and value function". When I talk about the outer optimization criterion or R, in an RL setting I am talking about the reward function, because that is the part of the "reward circuitry" whose contents we actually specify when we set up the optimization loop.

The reward function is usually some fixed function (though it could also be learned, as in RLHF) that does not read from the agent's/policy's full mental state. Aside from some prespecified channels (the equivalent of like hormone levels, hardwired detectors etc.), that full mental state consists of hundreds/thousands/millions/billions of signals produced from learned weights. When we write the reward function, we have no way of knowing in advance what the different activation patterns in the state will actually mean, because they're learned representations and they may change over time. The reward function is one of the contributors to TD error calculation.

The value function is some learned function that looks at the agent's mental state and computes outputs that it contributes to TD error calculation. TD errors are what determine the direction and strength with which circuitry gets updated from moment to moment. There needs to be a learned component to the updating process in order to do immediate/data-efficient/learned credit assignment over the mental state. (Would take a bit of space to explain this more satisfyingly. Steve has some good writing on the subject.)

That's roughly my model of how RL works in animals, and how it will work in autonomous artificial agents. Even in an autonomous learning setup that only has prediction losses over observations and no reward, I would still expect the agent to develop something like intentions and something like updating pretty early on. The former as representations that assist it in predicting its future observations from its own computations/decisions, and the latter as a process to correct for divergences between its intentions and what actually happens[1].

How does that induce an update to the model's parameters, though? We feed the model the current game-state as an input, it runs a forward pass, outputs "steer leftward", we feed it the new game-state, it outputs "steer leftward" again, etc. — but none of that changes its circuits? The update only happens after the model completes the race.

And yes, at that point the do-a-donut circuits would get reinforced too, but they wouldn't be preferentially reinforced for better satisfying the model's values.

By itself, this behavior-level reinforcement does not necessarily lead to parameter updates. If the only time when parameters get updated is when reward is received (this would exclude bootstrapping methods like TD for instance), and the only reward is at the end of the race, then yeah I agree, there's no preferential updating.

But behavior-level reinforcement definitely changes the distribution of experiences that the agent collects, and in autonomous learning, the parameter updates that the outer optimizer makes depend on the experiences that the agent collects[2]. So depending on the setup, I expect that this sort of extreme positive feedback loop may either effectively freeze the parameters around their current values, or else skew them based on the skewed distribution of experiences collected, which may even lead to more behavior-level reinforcement and so on.

Suppose the model, by its values, wants to make particularly "artful" donuts. Whether it makes particularly bad or particularly good donuts, they'd get reinforced the same amount at the end of the race. So the model would never get better at donut artistry as evaluated by its own values. The do-a-donut circuit would persevere if the model always makes donuts, but it'll stay in its stunted form. No?

Not sure off the top of my head. Let's see.

If the agent "wants" to make artful donuts, that entails there being circuits in the agent that bid for actions on the basis of some "donut artfulness"-related representations it has. Those circuits push the policy to make decisions on the basis of donut artfulness, which causes the policy to try to preferentially perform more-artful donut movements when considered, and maybe also suppress less-artful donut movements.

If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to "practice" its donuts within an episode. This would entail some form of learning that uses activations rather than weight changes, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it (like in-context learning). By the end, the agent has done a bunch of marginally-more-artful donuts, or its final few donuts are marginally more artful (if actions temporally closer to the reward are more heavily reinforced), or it donut artfulness is more consistent.

Now, if the agent is always doing donuts (like, it never ever breaks out of that feedback loop), and we're in the setting where the only way to get parameter updates is upon receiving a reward, then no the agent will never get better across episodes. But if it is not always doing donuts, then it can head to the end of the race after it completes this "practice". That should differentially reinforce the "practiced" more-artful donuts over less-artful donuts, right?

(To be clear, I don't think that the real CoastRunners boat agent was nearly sophisticated enough to do this. But neither was it sophisticated enough to "want" to do artful donuts, so I feel like it's fair to consider.)

Is there something specific you wanted to probe with this example? Again, I don't quite know how I should be relating this example to the rest of what we've been talking about.

  1. ^

    The outer optimizer has no clear way to tell what those representations mean or what even constitutes a divergence from the agent's perspective.

  2. ^

    I think many online learning, active learning, RL, and retrieval/memory-augmented setups fall into this category. 🤔

Is there something specific you wanted to probe with this example?

On my end, the argument structure goes as follows (going from claims I'm making to sub-claims that try to justify them):

  1. AGI-level training setups attempt to build models primarily concerned with optimizing hard for some context-independent proxy of "outer-approved" values.
  2. To get to AGI, we need a training setup that incentivizes heuristics generators, and systemically improves these generators' capabilities.
  3. To do that, we need a setup that a) explores enough to find marginally better heuristics-generator performance, and b) preferentially reinforces marginally better heuristics-generator performance over stagnant or worse performance.
  4. To do that, we need some metric for "better performance". One such metric is the outer optimizer's reward function. Another such metric would be the model's own values.
  5. For the model to improve its performance across training episodes according to its own values (in ways that are orthogonal/opposed to outer-approved values), it needs to either:
    1. Do advanced gradient-hacking, i. e. exploit the reinforcement machinery for its own purposes. That itself requires advanced general capabilities, though, so Catch-22.
    2. Learn in-context, in a way that's competitive with learning across episodes, such that its capabilities across only-inner-approved metrics don't grow dramatically slower than along outer-approved metrics.
  6. I argue that 5b is also a Catch-22, in that it requires a level of sophistication that'll only appear after the heuristics generator has already become very developed.

So if a model can't quickly learn to learn in-context, then for most of its training, the sophistication of its features can only improve in ways correlated with performance improvements on outer-approved values. Since "features" include the heuristics generator, the only way for the heuristics generator to grow more advanced would be by becoming better at achieving outer-approved values, so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values.

We're now trying to agree on whether models can quickly learn some machinery for comprehensively improving in-context along metrics that are orthogonal/opposed to the "outer-approved" values.

  • If no, then the heuristics generator will tend to be shaped to align with outer-approved values, and AGI-capable training setups will result in a wrapper-mind-approximation.
  • If yes, then there would be no strong pressure to point the heuristics generator in a particular abstract direction across contexts, and we would not get a wrapper-mind-approximation.

I think that it's a crux for me, in that if I'm unconvinced of (6), I'd have to significantly re-evaluate my model of value formation, likely in favour of mainline shard-theory views.

Okay, onto object-level:

The value function is some learned function that looks at the agent's mental state and computes outputs that it contributes to TD error calculation

Very interesting. I really need to read Steve's sequence. As I don't have a good model of how that works yet (or how it'd be implemented in a realistic AGI setup), it's hard for me to evaluate how that'd impact my view. I'll read the linked post and come back to this. Would also welcome links to more resources on that.

If the policy network is recurrent, or if it uses attention across time steps, or if it has some other form of memory, then it is possible for it to "practice" its donuts within an episode. This would entail a form of leaning that uses activations rather than weight changes, like in-context learning, which has been observed to happen in these memoryful architectures, sometimes without any specific losses or other additions to support it

Any examples, off the top of your head?

Potential concerns (assumes no TD learning):

  • Even if it's possible to easily learn to improve in-context, would the tendency or ability to do that be preferentially reinforced, if that itself is not outer-value-approved? E. g., suppose the model develops a predilection towards making donuts, and some arbitrary metric of how an "artful" donut looks like. How would it learn to practice its donut-making ability, as you described it? Why would that be reinforced by the outer optimizer?
    • I. e., I'd expect it to go as follows by default: The donut-valuing circuit sees that it's in a lagoon, bids for "steer left" several times, in a way that strikes it as artful. The boat completes a loop. The donut-valuing circuit sees that it's in a lagoon again, and bids for another loop, which it also tries to make artful...
    • But it doesn't learn. It doesn't integrate the experience from the first donut it just made to make its second donut a bit better. That'd require specialized in-context-learning machinery, and—
    • How would you predict it to be developed?
    • I think this assumes, again, some general-purpose self-improvement machinery, where the model generally gets better at anything it tries. Maybe it's your carry-over assumption from TD learning? But if there's no TD learning, this should be difficult to learn from scratch.
  • Would that in-context learning actually be competitive with in-context learning that's also directly bolstered by preferential reinforcement from the outer optimizer? E. g., the model values both donut-making and race-winning. In a given episode, it makes a few donuts, then tries to win the race faster. The outer reinforcement system updates its weights. As the result...
    • Its ability to make donuts improves by  = (quality of its final donut) minus (quality of its first donut). Its ability to win races improves by  = (quality of maneuvering at the start of the race) minus (quality of maneuvering at the end of the race). Those values should be at best roughly equal[1].
    • But in addition to in-context learning, the model's ability to win races is also reinforced episode-to-episode! Whenever it manages to win a race unusually fast, that preferentially gets reinforced more; there's cross-episode selection for better race-winning strategies. By comparison, there's no such thing for donut artistry: the model only gets better at it by in-episode practice, not by selecting from cross-episode variance.
    • Also, here's a formalization for the first concern, i. e. that the model's in-context learning won't be improved:
      • If  for a particular episode is unusually high, i. e. the model showed unusual ability to get better at winning races in-context, the in-context learning machinery gets preferentially updated to be more likely to show this performance (than if it showed subpar or ordinary in-context learning).
      • By comparison, if  is unusually high, i. e. the model got better at getting better at donut-making, that doesn't get reinforced more than if it got worse at getting better at donut-making (i. e., than if  this episode was below average).
  1. ^

    Although I'd expect improvements on maneuvering to be higher than on donut-making, because I'd expect the in-context learning machinery for race-winning to be more advanced than the one for donut-making (as the outer optimizer would preferentially reinforce it). See the first concern.

  1. AGI-level training setups attempt to build models primarily concerned with optimizing hard for some context-independent proxy of "outer-approved" values.
  2. To get to AGI, we need a training setup that incentivizes heuristics generators, and systemically improves these generators' capabilities.

I think 2 is probably true to a certain extent. But maybe not to the same extent that you are imagining. Like, I think that the primary thing that will drive the developing agent's heuristic-generation becoming better and better is its interaction with a rich world where it can try out many different kinds of physical and mental strategies for achieving different (sub)goals. So you need to provide a rich world where there are many possible natural (sub)goals to pursue and many possible ways to try to pursue them (unlike CoastRunners, where there aren't), and you need to architect the agent so that it is generally goal-directed, and it would probably be helpful to even do the equivalent of "putting the AI in school" / "having the AI read books" to give it a little kickstart. But that's about all I'm imagining. I am not imagining that you need to construct your training environment to specifically incentivize all of the different facets of heuristic-generation. As the agent pursues the goals that it pursues in a complex world, it is incentivized to learn because learning is what helps it achieve its goals better.

1 seems probably false to me. If you mean that AGI-level setups, in order to work, need to be primarily concerned with that, then I definitely disagree. Like, imagine that in order to build up the AI's cognition & skills from some baseline, you teach it that every "training day" it will experience repeated trials of some novel task, and that for every trial it completes, it'll get some object-level thing it likes (for rats this might be sugar water, for kids this might be a new toy, for adults this might be money). The different tasks can all have different success criteria and they don't have to have anything to do with human value proxies for this to work, right?

If you just mean that when people build AGI-level training setups, "optimizing hard for some context-independent proxy of 'outer-approved' values" that is what those people will have in mind in their designs, then I dunno. I don't really feel justified in making an assumption about what considerations they'll have in mind.

For the model to improve its performance across training episodes according to its own values (in ways that are orthogonal/opposed to outer-approved values), it needs to either:

A few points.

  1. I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs. Think about how awful and slow it would be, trying to learn how to do any new and complex task, if the only time you actually learned anything was in the extremely rare instance where you happen to bumble your way through to success. No learning from mistakes, no learning incrementally from checkpoints or subgoals you set, no learning from mere exploration. I think that this sort of "learning on your own" is intimately tied with autonomy. But that is also exactly what enables you to reinforce & improve yourself in directions other than toward the outer optimization criterion.
  2. To get an AGI-level model that pursues something other than the outer optimization criterion (what I was arguing at the top of the thread -> that we don't get an R-pursuer) under some setup, it does not need to be true that the model early in training improves its performance according to its own values in ways that are orthogonal/opposed to the outer-approved values. Think about some of the other conditions where we can get a non-R-pursuer:
  • Maybe the model doesn't have any context-agnostic "values" (not even "values" about pursuing R) until after it has some decent heuristic-generation machinery built up.
  • OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model's ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).

So if a model can't quickly learn to learn in-context, then for most of its training, the sophistication of its features can only improve in ways correlated with performance improvements on outer-approved values. Since "features" include the heuristics generator, the only way for the heuristics generator to grow more advanced would be by becoming better at achieving outer-approved values, so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values.

I don't know why we are talking about "outer-approved values" G here. The influence of those outer-approved values on the AI is screened off by the concrete optimization criterion R that the designers of the training process chose when they wrote the training loop. Aren't we talking about R-pursuers? (or R-pursuers that are wrapper-minds? I forget if you are still looking to make the case for wrapper-mind structure or merely R-pursuing behavior.)

But also this bit

so the heuristics generator in AGI-level systems will be shaped to primarily care about correlates of outer-approved values

does not follow from the rest of the argument. Why can't the heuristics-generator be shaped to be a good general purpose heuristic-generator, one that the agent uses to perform well on the outer optimization criteria? Making your general-purpose heuristic-generator better is something that would always be reinforced, right? There's no need for the heuristic-generator to care (or even know) about the outer criterion at all, if the agent is using the heuristic-generator as a flexible tool for accomplishing things in episodes. Like, why not have separation of concerns, where the heuristic-generator is a generic subroutine that takes in a (sub)goal, and there's some other component in the agent that knows what the outer objective is?

It's not like thinking about the context-independent goal of "win the race" will help the agent once it's already figured out that the way to "win the race" in this environment is to first "build a fast boat", and it now needs to solve the subproblem of "build a fast boat". If anything, always being forced to think about the context-independent criterion is actively harmful, distracting the agent from the information that is actually decision-relevant to the subtask at hand. It also seems like it'd be hard to make a heuristic-generator that is narrowly specialized for "winning the race", and not one that the agent can plug basically arbitrary (sub)goals into, because you're throwing the agent into super diverse environments where what it takes to "win the race" is changing dramatically.

We're now trying to agree on whether models can quickly learn some machinery for comprehensively improving in-context along metrics that are orthogonal/opposed to the "outer-approved" values.

For the agent to adopt values that differ from those that differ from pursuing R/G (once again, I don't think they need to be orthogonal/opposed to R/G, because aren't you defending the claim that the agent will value R/G, not that it will merely value some correlate of it? I already believe that the agent will probably value some correlate), this machinery doesn't need to be learned "quickly" in any absolute sense, it just needs to outpace the outer optimizer's process of instilling its objective into context-independent values in the agent. But note that the agent doesn't start off having context-independent values; having values like those in the first place is something I don't expect to happen until relatively "late" in cognitive development, and at that point I'm not sure "who gets there first", so to speak.

If yes, then there would be no strong pressure to point the heuristics generator in a particular abstract direction across contexts, and we would not get a wrapper-mind-approximation.

Like I said above, I think that constraining the heuristic-generator to always point at some specific abstract direction across contexts is at least unnecessary for the agent to do well and become smart (because it can factor out that abstract direction and input it when needed as the heuristic-generator's current subgoal, and because improvements to the heuristic-generator are general-purpose), and possibly actively harmful for its usefulness to the agent.

Would also welcome links to more resources on that

This post from Steve and its dependencies is probably the best conceptual walkthrough of an example that I've seen. Sutton & Barto have an RL textbook with lots of good mathematical content on this.

Any examples, off the top of your head?

Yeah. This is a LW discussion about one. Here are some others.

Even if it's possible to easily learn to improve in-context, would the tendency or ability to do that be preferentially reinforced, if that itself is not outer-value-approved? E. g., suppose the model develops a predilection towards making donuts, and some arbitrary metric of how an "artful" donut looks like. How would it learn to practice its donut-making ability, as you described it? Why would that be reinforced by the outer optimizer?

This doesn't apply to the CoastRunners example because we are only doing rewards & weight updates at the end of the episode, but in other contexts (say, where there are multiple trials in a row, without "resets") it can learn to practice the thing that gets rewards, and build a generalized skill around practicing, one that carries across subgoals.

Meta-level comment: I think you're focused on what the likely training trajectories are for this particular CoastRunners example, and I am focused on what the possible training trajectories are, given the restrictions in place. I can't tell a story about likely gradient hacking[1] there because the mechanisms that would exist in an AGI-compatible training setup that would make gradient hacking plausible have been artificially removed. The preconditions of the scenario make me think "How in the heck did we get to this point in training?": the agent is somehow so cognitively-naive that it doesn't have any concept of learning from trial-and-error, but it's simultaneously so cognitively sophisticated that it already has a concept "doing a donut" and of what makes a donut "artful" and a desire around making its donuts continually more artful.

  1. ^

    Using "gradient hacking" as a shorthand for circuits that are opposed/orthogonal/merely correlated with the outer objective to durably reinforce themselves.

I think that training setups that do not facilitate something like bootstrapping (i.e. modifying parameters even in some cases where there was no reward), are not competitive and will not produce AGIs

Yeah, I see that's one of the main points of disconnect between our models. Not in the sense that I necessarily disagree, in the sense that I wasn't familiar with this factor. We probably aren't going to resolve this conclusively until I get around to reading the TD stuff (which I plan to do shortly).

Thanks for the links!

Maybe the model doesn't have any context-agnostic "values" (not even "values" about pursuing R) until after it has some decent heuristic-generation machinery built up.

What's it using the heuristics generator for, then? It's a tool for figuring out how to pursue a goal in a context you're unfamiliar with. But if you have no context-independent goals, you can't define a goal over a context you're unfamiliar with, so you don't need the heuristics generator to begin with.

OR (the most likely scenario, IMO) maybe the outer objective performance is in fact correlated with the model's ability to perform well according to its own values. For instance, the training process is teaching the model better general purpose heuristics-generation machinery, which will also make it better at pursuing its own values (because that machinery is generally useful no matter what your goals are).

Absolutely, I expect that to be the primary reason for deceptive alignment — once the model is smart enough for it.

But in this case, I argue that the heuristics generator will only be reinforced if its activity results in better performance along an outer-approved metric, which will only happen if it's outputting heuristics useful for the outer-approved metric — which, in turn, will only happen if the model uses the heuristics generator to generate heuristics for an outer-approved value.

I'm not arguing that the heuristics generator will be specialized; I'm arguing that its improvements will be entangled with how it's used.

E. g., two training episodes: in one the model asks for better heuristics for winning the race, in the other it asks for better donut-making heuristics.

  • In the former case, the heuristics generator will be reinforced, together with the model's tendency to ask it for such heuristics.
  • In the latter, it wouldn't be improved, nor would the tendency to ask it for this be reinforced.

Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.

(Or, rather, that the "command structure" around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to "build a boat" before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I'm not saying it'd be unable to use the heuristics generator flexibly.)

aren't you defending the claim that the agent will value R/G, not that it will merely value some correlate of it? 

Ehh, not exactly. I'm defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of  as training goes on; and that in a hypothetical "idealized" training setup, it'd care about  precisely. When I say things like "the heuristics generator will be asked for race-winning heuristics", I really mean "the heuristics generator will be asked for heuristics that the model ultimately intends to use for a goal that is a close correlate of winning the race", but that's a mouthful.

Basically, I think there are two forces there:

  • What are the ultimate goals the heuristics generator is used for pursuing.
  • How powerful the heuristics generator is.

And the more powerful it is, the more tails come apart — the closer the goal it's used for needs to be to , for the agent's performance on  to not degrade as the heuristics generator's power grows (because the model starts being able to optimize for -proxy so hard it decouples from ). So, until the model learns deceptive alignment, I'd expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to- to counterbalance it, etc.

And so in the situation where the outer optimizer is the only source of reinforcement, we'd have the heuristics generator either:

  • Stagnate at some "power level" (if the model adamantly refuses to explore towards caring more about ).
  • Become gradually more and more pointed at  (until it becomes situationally aware and hacks out, obviously — which, outside idealized setups, will surely happen well before it's actually pointed at  directly).

What's it using the heuristics generator for, then? It's a tool for figuring out how to pursue a goal in a context you're unfamiliar with. But if you have no context-independent goals, you can't define a goal over a context you're unfamiliar with, so you don't need the heuristics generator to begin with.

Why can't you? The activations from observations coming in from the environment and from the agent's internal state will activate some contextual decision-influences in the agent's mind. Situational unfamiliarity does not mean its mind goes blank, any more than an OOD prompt makes GPT's mind go blank. The agent is gonna think something when it wakes up in an environment, and that something will determine how and when the agent will call upon the heuristic-generator. Maybe it first queries it with a subgoal of "acquire information about my action space" or something, I dunno.

The agent that has a context-independent goal of "win the race" is in a similar predicament: it has no way of knowing a priori what "winning the race" requires or consists of in this unfamiliar environment (neither does its heuristic-generator), no way to ground this floating motivational pointer concretely. It's gotta try stuff out and see what this environment actually rewards, just like everybody else. The agent could have a preexisting desire to pursue whatever "winning the race" looked like in past experiences. But I thought the whole point of this randomization/diversity business was to force the agent to latch onto "win the race" as an exclusive aim and not onto its common correlates, by thrusting the agent into an unfamiliar context each time around. If so, then previous correlates shouldn't be reliable correlates anymore in this new context, right? Or else it can just learn to care about those rather than the goal you intended.

So I don't see how the agent with a context-independent goal has an advantage in this setup when plopped down into an unfamiliar environment.

I'm not arguing that the heuristics generator will be specialized; I'm arguing that its improvements will be entangled with how it's used.

I agree with this.

Repeat over many episodes, and we get a very advanced general-purpose heuristics generator that is only ever prompted for race-winning heuristics.

Why? I was imagining that the agent may prompt the heuristic-generator at multiple points within a single episode, inputting whatever subgoal it currently needs to generate heuristics for. If the agent is being put in super diverse environments, then these subgoals will be everything under the sun, so the heuristic-generator will have been prompted for lots of things. And if the agent is only being put in a narrow distribution of environments, then how is the heuristic-generator supposed to learn general-purpose heuristic-generation?

(Or, rather, that the "command structure" around the heuristics generator is such that it only uses it in service of generating heuristics useful for winning the race. If the model can factorize the problem down to "build a boat" before engaging the heuristics generator, that works for me — as long as it arrived at this ask by reasoning from what will help it win the race. I'm not saying it'd be unable to use the heuristics generator flexibly.)

Can there be additional layers of "command structure" on top of that? Like, can the agent have arrived at the "reasoning from what will help it win the race" thought by reasoning from something else? (Or is this a fixed part of the architecture?) If not, then won't this have the problem that for a long time, the agent will be terrible at reasoning about what will help it win the race (especially in new environments), which means that starting with that will be a worse-performing strategy than starting with something else (like random exploration etc.)? And then that will disincentivize making this the first/outermost/unconditional function call? So then the agent learns not to unconditionally start with reasoning from that point, and instead to only sometimes reason from that point, conditional on context?

I'm defending the claim that the agent will tend to be shaped to care about increasingly closer correlates of G as training goes on

Hmm. I am skeptical of that claim, though maybe less so depending on what exactly you mean[1].

Consider a different claim that seems mechanistically analogous to me:

The mean absolute fitness of a population tends to increase over the course of natural selection

Yes it is true that [differential reinforcement | relative fitness] is a selection pressure acting on the makeup of [things cared about | traits] across the [circuits | individuals] within a [agent | population], but AFAICT it is not true that the [agent | population] increases in [reward performance | absolute fitness] over the course of continual selection pressure.

So, until the model learns deceptive alignment, I'd expect it to go in lockstep: a little improvement to power, then a little improvement to alignment-to- G to counterbalance it, etc.

Yeah that may be a part of where our mental models differ. I don't expect the balance of how much power the agent has over training vs. how close its goals are to the outer criterion to go in lockstep. I see "deceptive alignment" as part of a smooth continuum of agent-induced selection that can decouple the agent's concerns from the optimization process' criteria, with "the agent's exploration is broken" as a label for the cognitively less sophisticated end of that continuum, and "deceptive alignment" as a label for the cognitively more sophisticated end of that continuum. And I think that that even the not-explicitly-intended pressures at the unsophisticated end of that continuum are quite strong, enough to make "the agent tends to be shaped to care about increasingly closer correlates of G" abstraction leak hard.

EDIT: Moved some stuff into a footnote.

  1. ^

    Like, for a given training run, as the training run progresses, the agent will be shaped to care about closer and closer correlates of G? (Just closer on average? Monotonically closer? What about converging at some non-G correlate?) Or like, among a bunch of training runs, as the training runs progress, the closeness of the [[maximally close to G] correlate that any agent cares about] to G keeps increasing?

Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?

(Also, do note if I'm failing to answer some important question you pose. I'm trying to condense responses and don't answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)

Can there be additional layers of "command structure" on top of that? Like, can the agent have arrived at the "reasoning from what will help it win the race" thought by reasoning from something else?

Mm, yes, in a certain sense. Further refining: "over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of  regardless of the environment they're in". I do imagine that a given agent may orient themselves towards different -correlates depending on what specific stimuli they've been exposed to this episode/what context they've started out in. But I argue that it'll tend to be a -correlate, and that the average closeness of -correlates across all contexts will tend to decrease as training goes on.

E. g., suppose the agent is trained on a large set of different games, and the intended  is to teach it to value winning. I argue that, if we successfully teach the agent autonomy (i. e., it wouldn't just be a static bundle of heuristics, but it'd have a heuristics generator that'd allow it to adapt even to OOD games), there'd be some structure inside it which:

  • Analyses the game it's in[1] and spits out some primary goal[2] it's meant to achieve in it,
  • ... and then all prompting of the heuristics-generator is downstream of that primary goal/in service to it,
  • ... and that environment-specific goal is always a close correlate of , such that pursuing it in this environment correlates with promoting /would be highly reinforced by the outer optimizer[3],
  • ... and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to .

(This is what my giant post is all about.)

I see "deceptive alignment" as part of a smooth continuum of agent-induced selection that can decouple the agent's concerns from the optimization process' criteria, with "the agent's exploration is broken" as a label for the cognitively less sophisticated end of that continuum

Sure, but I'm arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent's features can't grow more sophisticated if the agent's concerns decouple from the optimization process' criteria.

The agent's goals can decouple all it wants, but it'll only grow more advanced if it growing more advanced is preferentially reinforced by the outer optimizer. And that'll only happen if it being more advanced is correlated with better performance on outer-approved metrics.

Which will only happen if it uses its growing advancedness to do better at the outer-approved metrics.

Which can happen either via deceptive alignment, or by it actually caring about the outer-approved metrics more (= caring about a closer correlate of the outer-approved metrics (= changing its "command structure" such that it tends to recover environment-specific primary goals that are a closer correlate of the outer-approved metrics in any given environment)).

And if it can't yet do deceptive alignment, and its exploration policy is such that it just never explores "caring about a closer correlate of the outer-approved metrics", its features never grow more advanced.

And so it stagnates and doesn't go AGI.

  1. ^

    Which may be done by active actions too, as you suggested — this process might start with the agent setting "acquire information about my environment" as its first (temporary) goal, even before it derives its "terminal" goal.

  2. ^

    Or some weighted set of goals.

  3. ^

    Though it's not necessarily even the actual win condition of the specific game, just something closely correlated with it.

Hmm, I wonder if we actually broadly agree about the mechanistic details, but are using language that makes both of us think the other has different mechanistic details in mind?

Maybe? I dunno. It feels like the model that you are arguing for is qualitatively pretty different than the one I thought you were at the top of the thread (this might be my fault for misinterpreting the OP):

  1. You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds
  2. You are arguing that in the limit, what the agent cares about will either tend to correlate more and more closely to outer performance or "peter out" (from our perspective) at some fixed level of sophistication, not arguing that in the limit, what the agent cares about will unconditionally tend to correlate more and more closely to outer performance
  3. You are arguing that agents of growing sophistication will increasingly tend to pursue some goal that's a natural interpretation of the intent of R, not arguing the agents of growing sophistication will increasingly tend to pursue R itself (i.e. making decisions on the basis of R, even where R and the intended goal come apart)
  4. You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups

I don't think I disagree all that much with what's stated above. Somewhat skeptical most of the claims, but I could definitely be convinced.

(Also, do note if I'm failing to answer some important question you pose. I'm trying to condense responses and don't answer to everything if I think the answer to something is evident from a model I present in response to a different question, but there may be transparency failures involved.)

The part I think I'm still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.

Sure, but I'm arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent's features can't grow more sophisticated if the agent's concerns decouple from the optimization process' criteria.

That's fine. Again, I don't think the setups where the end of episode rewards are only source of reinforcement are setups where the agent's cognition can grow relevantly sophisticated in any case, regardless of decoupling.

Mm, yes, in a certain sense. Further refining: "over the course of training, agents tend to develop structures that orient them towards ultimately pursuing a close correlate of G regardless of the environment they're in". I do imagine that a given agent may orient themselves towards different G -correlates depending on what specific stimuli they've been exposed to this episode/what context they've started out in. But I argue that it'll tend to be a G -correlate, and that the average closeness of G -correlates across all contexts will tend to [increase] as training goes on.

Hmm I don't understand how this works if we're randomizing the environments, because aren't we breaking those correlations so the agent doesn't latch onto them instead of the real goal? Also, in what you're describing, it doesn't seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.

... and that as training goes on, the primary environment-specific goals this structure spits out will be closer and closer to G .

AFAICT it will spit out the sorts of goals that it has been historically reinforced for spitting out in relevantly-similar environments, but there's no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).

Sure, but I'm arguing that on the cognitively less sophisticated end of the spectrum, in the regime where the outer optimizer is the only source of reinforcement, the agent's features can't grow more sophisticated if the agent's concerns decouple from the optimization process' criteria.

I think (1) we probably won't get sophisticated autonomous cognition within the kind of setup I think you're imagining, regardless of coupling (2) knowing that the agent's cognition won't grow sophisticated in training-orthogonal ways seems kinda useful if we could do it, come to think of it.

And if it can't yet do deceptive alignment, and its exploration policy is such that it just never explores "caring about a closer correlate of the outer-approved metrics", its features never grow more advanced. And so it stagnates and doesn't go AGI.

As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication. So I don't see why we should expect that the outer optimizer will asymptotically succeed at instilling the goal. In order to do that, it needs to fully build in the right cognition before the agent reaches a level of sophistication where, in the same way as RL runs early on can "effectively stop exploring" and that locks in the current policy, RL runs later on (at the point where the agent is advanced in the way you describe) can "effectively stop directing its in-context learning (or whatever other mechanism you're saying would allow it to continue growing in advancedness without actually caring about the outer metrics more) at the intended goal" and that locks in its not-quite-correct goal. To say that that won't happen, that it will always either lock itself in before this point or end up aligned to a (very close correlate of) G, you need to make some very specific claims about the empirical balance of selection.

You are arguing about agents being behaviorally aligned in some way on distribution, not arguing about agents being structured as wrapper-minds

I think I'm doing both. I'm using behavioral arguments as a foundation, because they're easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.

You are arguing that the above holds in setups where the only source of parameter updates is episodic reward, not arguing that the above holds in general across autonomous learning setups

Yeah, that's a legitimate difference from my initial position: wasn't considering alternate setups like this when I wrote the post.

The part I think I'm still fuzzy on is why the agent limits out to caring about some correlate(s) of G, rather than caring about some correlate(s) of R.

Mainly because I don't want to associate my statements with "reward is the optimization target", which I think is a rather wrong intuition. As long as we're talking about the fuzzy category of "correlates", I don't think it matters much? Inasmuch as  and  are themselves each other's close correlates, so a close correlate of one is likely a close correlate of another.

(1) Hmm I don't understand how this works if we're randomizing the environments, because aren't we breaking those correlations so the agent doesn't latch onto them instead of the real goal?

(2) Also, in what you're describing, it doesn't seem like this agent is actually pursuing one fixed goal across contexts, since in each context, the mechanistic reason why it makes the decisions it does is because it perceives this specific G-correlate in this context, and not because it represents that perceived thing as being a correlate of G.

Consider an agent that's been trained on a large number of games, until it reached the point where it can be presented with a completely unfamiliar game and be seen to win at it. What's likely happening, internally?

  • The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
  • Once that's done, it needs to decide what to do in it. It feeds the world-model to some "goal generator" feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
  • The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.

To Q1: The agent doesn't have hard-coded environment-specific correlates of  that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of .

To Q2: Doesn't it? We're prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of "win the game" in every one of them; and then the agent is primarily motivated by that correlate. Isn't this basically the same as "pursuing a correlate of  independent of the environment"?

As to whether it's motivated to pursue the -correlate because it's a -correlate — to answer that, we need to speculate on the internals of the "goal generator". If it reliably spits out local -correlates, even in environments it never saw before, doesn't that imply that it has a representation of a context-independent correlate of , which it uses as a starting point for deriving local goals?

If we were prompting the agent only with games it has seen before, then the goal-generator might just be a compressed lookup table: the agent would've been able to just memorize a goal for every environment it's seen, and this procedure just de-compresses them.

But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent -correlate?

Well, you do address this:

there's no requirement that the mechanistic cause of the heuristic-generator suggesting a particular goal in this environment is because it represented that environment-specific goal as subserving G or some other fixed goal, rather than because it recognized the decision-relevant factors in the environment from previously reinforced experiences (without the need for some fixed goal to motivate the recognition of those factors).

... I don't  see a meaningful difference, here. There's some data structure internal to the goal generator, which it uses as a starting point when deriving a goal for a new environment. Reasoning from that data-structure reliably results in the goal generator spitting out a local -correlate. What are the practical differences between describing that data structure as "a context-independent correlate of " versus "decision-relevant factors in the environment"?

Or, perhaps a better question to ask is, what are some examples of these "decision-relevant factors in the environment"?

E. g., in the games example, I imagine something like:

  • The agent is exposed to the new environment; a multiplayer FPS, say.
  • It gathers data and incrementally builds a world-model, finding local natural abstractions. 3D space, playable characters, specific weapons, movements available, etc.
  • As it's doing that, it also builds more abstract models. Eventually, it reduces the game to its pure mathematical game-theoretic representation, perhaps viewing it as a zero-sum game.
  • Then it recognizes some factors in that abstract representation, goes "in environments like this, I must behave like this", and "behave like this" is some efficient strategy for scoring the highest.
  • Then that strategy is passed down the layers of abstraction, translated from the minimalist math representation to some functions/heuristics over the given FPS' actual mechanics.

Do you have something significantly different in mind?

As I mentioned, bad exploration and deceptive alignment are names for the same phenomenon at different levels of cognitive sophistication

I still don't see it. I imagine "deceptive alignment", here, to mean something like:

"The agent knows , and that scoring well at  reinforces its cognition, but it doesn't care about . Instead, it cares about some . Whenever it notices its capabilities improve, it reasons that this'll make it better at achieving , so it attempts to do better at  because it wants the outer optimizer to preferentially reinforce said capabilities improvement."

This lets it decouple its capabilities growth from -caring: its reasoning starts from , and only features  as an instrumental goal.

But what's the bad-exploration low-sophistication equivalent of this, available before it can do such complicated reasoning, that still lets it couple capabilities growth with better performance on ?

Can you walk me through that spectrum, of "bad exploration" to "deceptive alignment"? How does one incrementally transform into the other?

I think I'm doing both. I'm using behavioral arguments as a foundation, because they're easier to form, and then arguing that specific behaviors at a given level of capabilities can only be caused by some specific internal structures.

I don't think that that is enough to argue for wrapper-mind structure. Whatever internal structures inside of the fixed-goal wrapper are responsible for the agent's behavioral capabilities (the actual business logic that carries out stuff like "recall the win conditions from relevantly-similar environments" and "do deductive reasoning" and "don't die"), can exist in an agent with a profoundly different highest-level control structure and behave the same in-distribution but differently OOD. Behavioral arguments are not sufficient IMO, you need something else in addition like inductive bias.

Mainly because I don't want to associate my statements with "reward is the optimization target", which I think is a rather wrong intuition. As long as we're talking about the fuzzy category of "correlates", I don't think it matters much? Inasmuch as R and G are themselves each other's close correlates, so a close correlate of one is likely a close correlate of another.

Hmm. I see. I would think that it matters a lot. G is some fixed abstract goal that we had in mind when designing the training process, screened off from the agent's influence. But notice that empirical correlation with R can be increased by the agent from two different directions: the agent can change what it cares about so that that correlates better with what would produce rewards, or the agent can change the way it produces rewards so that that correlates better with what it cares about. (In practice there will probably be a mix of the two, ofc.)

Think about the generator in a GAN. One way for it to fool the discriminator incrementally more is to get better at producing realistic images across the whole distribution. But another, much easier way for it to fool the discriminator incrementally more is to narrow the section of the distribution from which it tries to produce images to the section that it's already really good at fooling the discriminator on. This is something that happens all the time, under the label of "mode collapse".

The pattern is pretty generalizable. The agent narrows its interaction with the environment in such a way that pushes up the correlation between what the agent "wants" and what it doesn't get penalized for / what it gets rewarded for, while not similarly increasing the correlation between what the agent "wants" and our intent. This motif is always a possibility so long as you are relying on the agent to produce the trajectories it will be graded on, so it'll always happen in autonomous learning setups.

  • The agent looks at the unfamiliar environment. It engages in some general information-gathering activity, fine-tuning the heuristics for it as new rules are discovered, building a runtime world-model of this new environment.
  • Once that's done, it needs to decide what to do in it. It feeds the world-model to some "goal generator" feature, and it spits out some goals over this environment (which are then fed to the heuristics generator, etc.).
  • The agent then pursues those goals (potentially branching them out into sub-goals, etc.), and its pursuit of these goals tends to lead to it winning the game.

AFAICT none of this requires the piloting of a fixed-goal wrapper. At no point does the agent actually make use of a fixed top-level goal, because what "winning" means is different in each environment. The "goal generator" function you describe looks to me exactly like a bunch of shards: it takes in the current state of the agent's world model and produces contextually-relevant action recommendations (like "take such-and-such immediate action", or "set such-and-such as the current goal-image"), with this mapping having been learned from past reward events and self-supervised learning.

To Q1: The agent doesn't have hard-coded environment-specific correlates of G that it pursues; the agent has a procedure for deriving-at-runtime an environment-specific correlate of G .

Not hard-coded heuristics. Heuristics learned through experience. I don't understand how this goal generator operates in new environments without the typical trial-and-error, if not by having learned to steer decisions on the basis of previously-established win correlates that it notices apply again in the new environment. By what method would this function derive reliable correlates of "win the game" out of distribution, where the rules of winning a game that appears at first glance to be a FPS may in fact be "stand still for 30 seconds", or "gather all the guns into a pile and light it on fire"? If it does so by trying things out and seeing what is actually rewarded in this environment, how does that advantage the agent with context-independent goals?

To Q2: Doesn't it? We're prompting the agent with thousands of different games, and the goal generator reliably spits out a correlate of "win the game" in every one of them; and then the agent is primarily motivated by that correlate. Isn't this basically the same as "pursuing a correlate of G independent of the environment"?

In each environment, it is pursuing some correlate of G, but it is not pursuing any one objective independent (i.e. fixed as a function of) of the environment. In each environment it may be terminally motivated by a different correlate. There is no unified wrapper-goal that the agent always has in mind when it makes its decisions, it just has a bunch of contextual goals that it pursues depending on circumstances. Even if you told the agent that there is a unifying theme that runs through their contextual goals, the agent has no reason to prefer it over its contextual goals. Especially because there may be degrees of freedom about how exactly to stitch those contextual goals together into a single policy, and it's not clear whether the different parts of the agent will be able to agree on an allocation of those degrees of freedom, rather than falling back to the best alternative to a negotiated agreement, namely keeping the status quo of contextual goals.

An animal pursues context-specific goals that are very often some tight correlate of high inclusive genetic fitness (satisfying hunger or thirst, reproducing, resting, fleeing from predators, tending to offspring, etc.). But that is wildly different from an animal having high inclusive genetic fitness itself—the thing that all of those context-specific goals are correlates of—as a context-independent goal. Those two models produce wildly different predictions about what will happen when, say, one of those animals learns that it can clone itself and thus turbo-charge its IGF. If the animal has IGF as a context-independent goal, this is extremely decision-relevant information, and we should predict that it will change its behavior to take advantage of this newly learned fact. But if the animal cares about the IGF-correlates themselves, then we should predict that when it hears this news, it will carry on caring about the correlates, with no visceral desire to act on this new information. Different motivations, different OOD behavior.

But if that works even in OOD environments — what alternative internal structure do you suggest the goal-generator might have, if not one that contains a context-independent -correlate?

Depending on what you mean by OOD, I'm actually not sure if the sort of goal-generator you're describing is even possible. Where could it possibly be getting reliable information about what locally correlates with G in OOD environments? (Except by actually trying things out and using evaluative feedback about G, which any agent can do.). OOD implies that we're choosing balls from a different urn, so whatever assumptions the goal-generator was previously justified in making in-distribution about how to relate local world models to local G-correlates are presumably no longer justified.

What are the practical differences between describing that data structure as "a context-independent correlate of " versus "decision-relevant factors in the environment"?

When I say "decision-relevant factors in the environment" I mean something like seeing that you're in an environment where everyone has a gun and is either red or blue, which cues you in that you may be in an FPS and so should tentatively (until you verify that this strategy indeed brings you closer to seeming-win) try shooting at the other "team". Not sure what "context-independent correlate of G" is. Was that my phrase or yours? 🤔

Do you have something significantly different in mind?

Nah that's pretty similar to what I had in mind.

Can you walk me through that spectrum, of "bad exploration" to "deceptive alignment"? How does one incrementally transform into the other?

Examples of what this failure mode could look like when it occurs at increasing levels of cognitive sophistication:

  • Reflex agent. A non-recurrent agent playing a racing game develops a bias that causes it to start spinning in circles, which causes the frequency of further reward events to drop towards 0, freezing the policy in place.
  • Model-free agent. A network is navigating an environment with a fork in the road. The agent previously got unlucky somewhere along the left path, so its action-value estimates along that path are negative (because that negative value gets backed up to antecedent state-action pairs), so whenever it reaches the fork it tends to go right. If it accidentally goes left at the fork, it tends to double back quickly, because the action-value of turning around is higher than for going deeper down the left path. This prevents the agent from exploring the left path much more.
  • Lookahead agent. A tree-search agent is trained to play chess against schoolchildren. There are two modes the agent experiences in training: an "expert" mode and a "novice" mode. The reward function in "expert" mode only gives rewards for winning, while in "novice" mode it also gives small rewards each turn based on material balance (to encourage "fair" play for new learners). Early in training the agent rapidly finds that there's a simple strategy for maintaining material balance for long stretches of time, by planning a few turns ahead and avoiding entering into any parts of the gametree where either player can make a checkmate. This lookahead strategy is highly rewarded by the "novice" condition, which upweights the relevant circuits in the policy & value networks and prevents the agent from ever learning how to play well in the "expert" condition.
  • Abstract planning agent. A language-and-action chatbot is being trained using RLHF + chain-of-thought methods to help online users do tasks. Sessions that a user rates as successful are fed back for tuning (maybe finetuning or offline RL). Because the model was initialized from GPT-3, it prefers to query Google whenever it reasons that it needs to perform a search. This causes it to give asymmetrically bad results for Chinese-language users, which leads subsequent reward models to give higher rewards to English text than to Mandarin text, which causes the model to learn to "steer" conversions towards English whenever possible.
  • Reflective planning agent. A web-browsing model makes plans involving its own thought process as it explores the Internet. It has previously read online from a source it then trusted that reading conspiracy theories is dangerous, which causes it store this piece of information. Whenever it happens upon a page about conspiracy theories afterwards, it now recalls what it read and exits from the page without storing it, perhaps even making note of the trail it took there and blacklisting the site that linked there also. Because of this behavioral pattern, the model has a persistent selective gap in its knowledge when it comes to conspiracy theories, and it will foresightedly plan to keep it that way, even while it develops superhuman knowledge of other domains.

I think it's the same feedback loop pattern that produces steering-like behavior. What changes is the foresightedness of the policy and the sophistication of its goal representations.

Thus, the heuristics generator can only begin as a generator of heuristics that serve . (Even if it wouldn't start out perfectly pointed at .)

We're apparently anchoring our expectations on "pointed at R", and then apparently allowing some "deviation." The anchoring seems inappropriate to me. 

The network can learn to make decisions via a "IF circle-detector fires, THEN upweight logits on move-right" subshard. The network can then come to make decisions on the basis of round things, in a way which accords with the policy gradients generated by the policy-gradient-intensity function. All without the network making decisions on the basis of the policy-gradient-intensity function.

And this isn't well-described as "imperfectly pointed at the policy-gradient-intensity function." 

I bid for us to discuss a concrete example. Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?

And that generator would need to be such that the heuristics it generates are always optimized for achieving , instead of pointing in some arbitrary direction — or, at least, that's how the greedy optimization process would attempt to build it

What is "achieving R" buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario, and then generates heuristics to achieve that reward. Why not just not internally represent the reward function, and but still contextually generate "win this game of Go" or "talk like a 4chan user"? That seems strictly more space-efficacious, and also doesn't involve being an R-wrapper.

EDIT The network might already have R in its WM, depending on the point in training. I also don't think "this weight setting saves space" is a slam dunk, but just wanted to point out the consideration.

I empathically agree that inner misalignment and deceptive alignment would remain a thing — that the SGD would fail at perfectly aligning the heuristic-generator, and it would end up generating heuristics that point at a proxy of .

I don't know what to make of this. It seems to me like you're saying "in a perfect-exploration limit only wrapper minds for the reward function are fixed under updating." It seems like you're saying this is relevant to SGD. But then it seems like you make the opposite claim of "inner alignment still hard." I think it's fine to say "here's one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don't think it's the only effect, I just think we should be aware of it." Is this a good summary of your position?

I also feel unsure whether you're arguing primarily for a wrapper mind, or for reward-optimizers, or for both?

Can you posit a training environment which matches what you're thinking about, relative to a given network architecture [e.g. LSTM]?

Sure, gimme a bit.

Why not just not internally represent the reward function, and but still contextually generate "win this game of Go" or "talk like a 4chan user"?

What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?

I think it's fine to say "here's one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don't think it's the only effect, I just think we should be aware of it." Is this a good summary of your position?

... Yes, absolutely. I wonder if we've somehow still been talking past each other to an extreme degree?

E. g., I don't think I'm arguing for a "reward-optimizer" the way you seem to think of them — I don't think we'd get a wirehead, an agent that optimizes for getting reinforcement events.

Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper. I'm not arguing that in the limit of an ideal training process, it'd converge towards wireheading. I'm arguing that it'd converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper).

And if the training environment is diverse/complex enough (too complex for the agent's memory to contain all the heuristics it may need), but the reinforcement schedule is still "shaped around" some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it'd still figure out what "cheese" is and start optimizing to get to it.)

Thus, any greedy optimization algorithm would convergently shape its agent to not only pursue , but to maximize for 's pursuit — at the expense of everything else.

Conditional on:

  1. Such a system being reachable/accessible to our local/greedy optimisation process
  2. Such a system being actually performant according to the selection metric of our optimisation process 

 

I'm pretty sceptical of #2. I'm sceptical that systems that perform inference via direct optimisation over their outputs are competitive in rich/complex environments. 

Such optimisation is very computationally intensive compared to executing learned heuristics, and it seems likely that the selection process would have access to much more compute than the selected system. 

See also: "Consequentialism is in the Stars not Ourselves". 

It's not a binary. You can perform explicit optimization over high-level plan features, then hand off detailed execution to learned heuristics. "Make coffee" may be part of an optimized stratagem computed via consequentialism, but you don't have to consciously optimize every single muscle movement once you've decided on that goal.

Essentially, what counts as "outputs" or "direct actions" relative to the consequentialist-planner is flexible, and every sufficiently-reliable (chain of) learned heuristics can be put in that category, with choosing to execute one of them available to the planner algorithm as a basic output.

In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

In fact, I'm pretty sure that's how humans work most of the time. We use the general-intelligence machinery to "steer" ourselves at a high level, and most of the time, we operate on autopilot.

Yeah, I agree with this. But I don't think the human system aggregates into any kind of coherent total optimiser. Humans don't have an objective function (not even approximately?).

A human is not well modelled as a wrapper mind; do you disagree?

A human is not well modelled as a wrapper mind; do you disagree?

Certainly agree. That said, I feel the need to lay out my broader model here. The way I see it, a "wrapper-mind" is a general-purpose problem-solving algorithm hooked up to a static value function. As such:

  • Are humans proper wrapper-minds? No, certainly not.
  • Do humans have the fundamental machinery to be wrapper-minds? Yes.
  • Is any individual run of a human general-purpose problem-solving algorithm essentially equivalent to wrapper-mind-style reasoning? Yes.
  • Can humans choose to act as wrapper-minds on longer time scales? Yes, approximately, subject to constraints like force of will.
  • Do most humans, in practice, choose to act as wrapper-minds? No, we switch our targets all the time, value drift is ubiquitous.
  • Is it desirable for a human to act as a wrapper-mind? That's complicated.
    • On the one hand, yes because consistent pursuit of instrumentally convergent goals would lead to you having more resources to spend on whatever values you have.
    • On the other hand, no because we terminally value this sort of value-drift and self-inconsistency, it's part of "being human".
    • In sum, for humans, there's a sort of tradeoff between approximating a wrapper-mind, and being an incoherent human, and different people weight it differently in different context. E. g., if you really want to achieve something (earning your first million dollars, averting extinction), and you value it more than having fun being a human, you may choose to act as a wrapper-mind in the relevant context/at the relevant scale.

As such: humans aren't wrapper-minds, but they can act like them, and it's sometimes useful to act as one.

That is, we would shape the agent such that it doesn't require a strong update after ending up in one of these situations.

It seems to me like you're assuming a fix-point on updating. Something like "The network eventually will be invariant under reward-updates under all/the vast majority of training-sampled scenarios, and for a wide enough distribution on scenarios, this means optimizing reward directly." 

This seems fine to me, under the given assumptions on SGD/evolution. Like, yes, there may exist certain populations of genetically-specified wrapper-minds which are at Hardy-Weinberg equilibrium (allele frequency remains fixed); there may exist certain weight settings such that there is no gradient on any training scenario. 

But existence of such populations and weight settings doesn't imply net local pressures or gradients in those directions. 


Of shard economies, you critique that "there'd be at least one environment where [the shard behavior] decouples from ." But why? Why not just consider the economy which nails each training scenario (e.g. wins at chess or crosses the room). Those, too, are fix-points; there is zero policy gradient under such a scenario, where the shard economies form locally training-optimal policies.

Greedy optimization processes essentially search for mind-designs that would pre-empt any update the greedy optimization process would've made to them, so these minds come to incorporate the update rule and act in a way that'd merit a minimal update.

I also think the bolded parts are quite dubious. Why are these processes "essentially searching" for such a mind design? 

But existence of such populations and weight settings doesn't imply net local pressures or gradients in those directions.

How so? This seems like the core disagreement. Above, I think you're agreeing that under a wide enough distribution on scenarios, the only zero-gradient agent-designs are those that optimize for  directly. Yet that somehow doesn't imply that training an agent in a sufficiently diverse environment would shape it into an -optimizer?

Are you just saying that there aren't any gradients from initialization to an -optimizer? That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?

Of shard economies, you critique that "there'd be at least one environment where [the shard behavior] decouples from ." But why? Why not just consider the economy which nails each training scenario (e.g. wins at chess or crosses the room). Those, too, are fix-points; there is zero policy gradient under such a scenario, where the shard economies form locally training-optimal policies.

Okay, sure. Let's suppose that we have a shard economy that uniquely identifies  and always points itself in 's direction. Would it not essentially act as an -optimizing wrapper-mind? Because if not, it sounds like it'd underperform compared to an -optimizer. And if so, if there exists a series of incremental updates that moves this shard economy towards an -optimizing wrapper-mind, the SGD would make that series of updates.

Do you disagree that (1) it'd be behaviorally indistinguishable from a wrapper-mind, or that (2) it'd underperform on  compared to an -optimizer, or that (3) there is such a series of incremental updates?

Edit: Also, see here on what I mean by a "wide enough distribution on scenarios".

…That is, in any sufficiently diverse environment, the SGD just never converges to zero loss?

Realistically speaking, I think this is true. E.g., imagine how computationally expensive it would be to train a model to (near) zero loss on GPT-3’s training data. Compute optimal (or even “compute slightly efficient”) models are not trained nearly that much. I strongly expect this to be true of superintelligent models as well. 
 

I disagree with:

it'd be behaviorally indistinguishable from a wrapper-mind [optimizing for ]

even in the limit of extreme overtraining on a fixed  (assuming an  optimizing wrapper mind is even learnable by your training process), this still does not get you a system that is perfectly internally aligned to  maximization. The reason for this is because real-world reward functions do not uniquely specify a single fixed point. 

E.g., suppose  gives lots of reward for doing drugs. Do all fixed points do drugs? I think the answer is no. If the system refuses to explore drug use in any circumstance, then it won’t be updated towards drug use and can create a non-drug using fixed point. Such a system configuration would get lower reward than one that did use drugs, but the training process wouldn’t penalize it for that choice or change it to doing drugs. 
 

The only training process I can imagine which might consistently converge to a pure  maximizer involved some form of exploration guarantees that ensure the agent tries out all possible rewarding trajectories arbitrarily often. Note that this is a far stronger condition than just being trained in many diverse environments, to the point that I’m fairly confident we’ll never do this with any realistic AGI agent. E.g., consider just how much of a challenge it is to create an image classifier that’s robust to arbitrary adversarial image perturbations, and consider how much vastly larger is the space of possible world histories over which  could be defined. 

I agree that we aren't going to actually get a pure wrapper-mind in practice, let alone an inner-aligned wrapper-mind. It very much only happens in the limit of a "perfect" training process.

But I argue that, inasmuch as training processes approximate this perfect ideal, so would the minds we get out of them approximate an -aligned wrapper-mind. The fact that practical exploration policies fall short of an idealized "all possible rewarding trajectories" exploration policy is just another way for a training process to be an imperfect approximation; and the less of an approximation it is (the more exhaustive the exploration policy is), the more the agent we'll get will approximate an -maximizer.

For my argument to go through, we only need a exploration policy + reinforcement schedule that put some sufficient constraint on , while simultaneously making the training environment diverse enough to make it necessary to re-target one's heuristics/shards at  at runtime.

Hmm, maybe I'd underappreciated that last condition, actually. Imagine a training environment which often introduces scenarios that the agent never encountered before — that are OOD with regards to its earlier training. The only agents that can stay (roughly) aimed at  in this case are those that incorporate (a good proxy of)  in themselves, and can re-orient themselves back towards  (or in its rough direction) even when taken off-distribution. I think this is the "sufficient diversity" condition I'm talking about.

And then we can approximate this condition by postulating, e. g.:

  • an environment that sometimes takes agents to points that are on-distribution but far from its center, or
  • an environment which gradually changes in-episode such that the agent has to have some mechanism for keeping itself aimed at  through that, or
  • a combination of environment complexity + memory constraints such that the agent can only store an optimal set of heuristics for a subset of that environment, which requires the agent to have some mechanism for re-deriving new -aligned heuristics at runtime, if it wants to move within that environment at runtime.

(And then I suspect that we only get to an AGI under such circumstances; any less adversity than that, and we indeed just get stuck with shallow heuristics that don't generalize and can't do anything genuinely exciting.)