In this post, I appreciated two ideas in particular:
"Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities.
In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is.
The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural."
I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and reinforcement learning work mechanistically.
I had an extremely brief irl conversation with Alex Turner a while before reading this post, in which he said he believed outer and inner alignment aren't good frames. It was a response to me saying I wanted to cover inner and outer alignment on Rational Animations in depth. RA is still going to cover inner and outer alignment, but as a result of reading this post and the Training Stories system, I now think we should definitely also cover alternative frames and that I should read more about them.
I welcome corrections of any misunderstanding I may have of this post and related concepts.
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him.
I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it.
In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research.
I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
TurnTrout is obviously correct that "robust grading is... extremely hard and unnatural" and that loss functions "chisel circuits into networks" and don't directly determine the target of the product AI. Where he loses me is the part where he suggests that this makes alignment easier and not harder. I think that all this just means we have even less control over the policy of the resulting AI, the default end case being some bizarre construction in policyspace with values very hard to determine based on the recipe. I don't understand what point he's making in the above post that contradicts this.
I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views.
I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it'd rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed.
Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem.
I think this post was potentially too long :P
To some extent, I think it's easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you're not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of "how dare you not talk about specifics," but more like "I bet this research direction would have to look more orthodox when you get down to brass tacks."
I think this post was potentially too long :P
I thought so as well, but then checked Ajeya's posts and realized many of them are significantly longer than this one, but still heavily commented. So I figured that people can just read this post if they want. I do think this is one of the most important posts I've ever written, and was surprised to see so few comments.
To some extent, I think it's easy to pooh-pooh finding a robust reward function (not maximally robust, merely way better than the state of the art) when you're not proposing a specific design for building an AI that does good things and not bad things. Not in the tone of "how dare you not talk about specifics," but more like "I bet this research direction would have to look more orthodox when you get down to brass tacks."
Someone saying ~this to me in an earlier draft of this post is actually why I wrote A shot at the diamond alignment problem. You'll notice that I didn't confront e.g. robust grading in that story, nor do the failure modes hinge on robust OOD grading by the reward function, nor do I think analogous challenges crop up elsewhere.
Yeah, fair enough. But I think stories about the diamond-maximizer, or value-child, do rely on robustness.
I'd split up "OOD performance" into two extremes. One extreme, let's call it "out of the generator" is situations that have no coherent reason to happen in our universe - they're thought experiments that you only find by random sampling, or by searching for adversarial examples, or other silly processes. E.g. a sequence of flashing lights and clicks that brainwashes you into liking to kick puppies. The other, "out of the dataset," is things that are heavily implied to exist (or have straightforward ways of coming about) by the training data, but aren't actually in the training data. E.g. a sequence of YouTube videos that teach you how to knit. Or like how avocado chairs weren't in Dall-E's training data.
When training your real-world agent with supervised RL, you have to be grading it somehow, and that grading process is constantly going to be presented with new inputs, which weren't in the training dataset until now but are logical consequences of a lawful universe, and could be predicted to happen by an AI that's modeling that universe. On these data points, you want your reward function to robustly keep being about the thing you want to teach to the AI, rather than having bad behavior that depends on the reward function's implementation details (e.g. failing to reward the agent for being near a diamond when the AI has hidden the diamond from the evaluation system).
I'm worried that sentiments like "there has to be robustness" can sometimes lose track of what level of robustness we're talking about -- what specific situations must be graded "robustly" in order for the training story to work. [EDIT although I think you take some steps to avoid this failure, in your comment here!] To further ward against this failure mode -- for concreteness, what difficult situations do you think might need to be rewarded properly in order for the diamond-AI to generalize appropriately?
On these data points, you want your reward function to robustly keep being about the thing you want to teach to the AI, rather than having bad behavior that depends on the reward function's implementation details (e.g. failing to reward the agent for being near a diamond when the AI has hidden the diamond from the evaluation system).
Actually, I think I disagree. Why do you think this?
Actually, I think I disagree. Why do you think this?
Maybe it's something like too many natural abstractions. When the number of natural abstractions is small, you can just point in the right general direction, and then regularize your way to teaching the exact natural abstraction that's closest. When the number of abstractions is large, or you're trying to point to something very complicated, if you just point in the right general direction, there will be a natural abstraction almost wherever you point, and regularization won't move you towards something that seems privileged to humans.
Closeness of natural abstractions also makes it easier for gradient descent to change your goals - shards are now on a continuum, rather than moated off from each other. The typical picture of value change due to stimulus is something like heroin, which hijacks the reward center in a way that we typically picture as "creating new desires related to heroin." But if shards can be moved around by gradient descent, then you can have a different-looking kind of value change, of which an example might be updating a political tenet because the culture around you changes - it's still somewhat resisted by the prior shard, but it's hard to avoid because each change is small and the gradient updates are a consequence of a deep part of the environment, and it doesn't have to lead to internal disagreement, at each point in time one's values are just slowly changing in place.
So information leakage that reflects unintended optima of the actual evaluation function is bad for alignment with vanilla RL. E.g. systematic classification errors, or not working for a few minutes when some software freezes, or systematic biases on what kind of diamonds you're showing it, or accidentally showing it some cubic zirconium. This is going to update its values to something with more unintended optima, although not necessarily exactly the same unintended optima as were in the reward evaluation process.
So even given both points, I would conclude "yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with 'nearest unblocked strategy', here!" And that was the original point of dispute, AFAICT.
something with more unintended optima
What do you have in mind with "unintended optima"? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that's not what you meant?
Even given all of this, why should reward function "robustness" be the natural solution to this? Like, what if you get your robust reward function and you're still screwed? It's very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don't actually have much hope for trying to pick out a concept we'd want just by pointing into a self-supervised world-model - I expect us to need to use human feedback and the AI's self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good "human values"-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
What do you have in mind with "unintended optima"?
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what's going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an "unintended optimum" of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it's an optimum or not. What's relevant is the scalar reward values output on realized datapoints.
I emphasize this because "unintended optimum" phrasing seems to reliably trigger cached thoughts around "reward functions need to be robust graders." (I also don't like "optimum" of values, because I think that's really not how values work in detail instead of in gloss, and "optimum" probably evokes similar thoughts around "values must be robust against adversaries.")
To some extent, I think it's easy to pooh-pooh finding a flapping wing design (not maximally flappy, merely way better than the best birds) when you're not proposing a specific design for building a flying machine that can go to space. Not in the tone of "how dare you not talk about specifics," but more like "I bet this chemical propulsion direction would have to look more like birds when you get down to brass tacks."
Wait, but surely RL-developed shards that work like human values are the biomimicry approach here, and designing a value learning scheme top-down is the modernist approach. I think this metaphor has its wires crossed.
I wasn't intending for a metaphor of "biomimicry" vs "modernist".
(Claim 1) Wings can't work in space because there's no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel.
If TurnTrout is right, then the equivalent statement is something like (Claim 2) "reward functions can't solve alignment because alignment isn't maximizing a mathematical function."
The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don't bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won't Blow up as if that is a thing that can be found if we try hard enough.
As of right now, I view "shard theory" sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don't understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement "shard theory" concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it's not incredibly useful to me. However, I think it's worth engaging with the idea because if correct then other research directions might be a dead-end.
Does that help explain what I was trying to do with the metaphor?
Until someone can propose actual designs for hardware or software that would implement "shard theory" concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it's not incredibly useful to me. However, I think it's worth engaging with the idea because if correct then other research directions might be a dead-end.
Have you read A shot at the diamond alignment problem? If so, what do you think of it?
Yeah, but on the other hand, I think this is looking for essential differences where they don't exist. I made a comment similar to this on the previous post. It's not like one side is building rockets and the other side is building ornithopters - or one side is advocating building computers out of evilite, while the other side says we should build the computer out of alignmentronium.
"reward functions can't solve alignment because alignment isn't maximizing a mathematical function."
Alignment doesn't run on some nega-math that can't be cast as an optimization problem. If you look at the example of the value-child who really wants to learn a lot in school, I admit it's a bit tricky to cash this out in terms of optimization. But if the lesson you take from this is "it works because it really wants to succeed, this is a property that cannot be translated as maximizing a mathematical function," then I think that's a drastic overreach.
I realize that my position might seem increasingly flippant, but I really think it is necessary to acknowledge that you've stated a core assumption as a fact.
Alignment doesn't run on some nega-math that can't be cast as an optimization problem.
I am not saying that the concept of "alignment" is some bizarre meta-physical idea that cannot be approximated by a computer because something something human souls etc, or some other nonsense.
However the assumption that "alignment is representable in math" directly implies "alignment is representable as an optimization problem" seems potentially false to me, and I'm not sure why you're certain it is true.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
I don't think it is ridiculous to suggest that what the human brain does is 1.) representable in math, 2.) in some type of way that we could actually understand and re-implement it on hardware / software systems, and 3.) but not as an optimization problem where there exists some reward function to maximize or some loss function to minimize.
There exist systems that can be 1.) represented mathematically, 2.) perform computations, and 3.) do not correspond to some type of min/max optimization, e.g. various analog computers or cellular automaton.
You don't even have to go that far. What about, just, regular non-iterative programs? Are type(obj)
or json.dump(dict)
or resnet50(image)
usefully/nontrivially recast as optimization programs? AFAICT there are a ton of things that are made up of normal math/computation and where trying to recast them as optimization problems isn't helpful.
Updated mentions of "cognitive groove" to "circuit", since some readers found the former vague and unhelpful.
@TurnTrout You wrote "if I don’t, within about a year’s time, have empirically verified loss-as-chisel insights which wouldn’t have happened without that frame..."
More than a year later, what do you think?
TL;DR: One alignment strategy is to 1) capture “what we want” in a loss function to a very high degree (“robust grading”), 2) use that loss function to train the AI, and 3) get the AI to exclusively care about optimizing that objective.
I think that each step contains either a serious and unnecessary difficulty, or an unnecessary assumption. I think that:
Extended summary. My views on alignment have changed a lot recently. To illustrate some key points, I’m going to briefly discuss a portion of Paul Christiano’s AXRP interview (emphasis added):
My summary: One alignment strategy is to 1) capture “what we want” in a loss function to a very high degree (“robust grading”), 2) use that loss function to train the AI, and 3) get the AI to exclusively care about optimizing that objective.[3]
I think that each step contains either a serious and unnecessary difficulty, or an unnecessary assumption. I think that:
Therefore, for all alignment approaches which aim to align an agent to a robust grading scheme, I think that that approach is doomed. However, I am not equally critiquing all alignment-decompositions which have historically been called "outer/inner alignment" (for more detail, see Appendix A).
Here’s the structure of the essay, and some key points made within:
This post wouldn’t have happened without Quintin Pope’s ideas and feedback. Thanks to David Udell for extensive brainstorming help. Thanks to Evan Hubinger, Rohin Shah, Abram Demski, Garrett Baker, Andrew Critch, Charles Foster, Nora Belrose, Leo Gao, Kaj Sotala, Paul Christiano, Peter Barnett, and others for feedback. See here for a talk based on this essay.
I think that alignment research will be enormously advantaged by dropping certain ways of outer/inner-centric thinking for most situations, even though those ways of thinking do have some use cases. Even though this essay is critical of certain ways of thinking about alignment, I want to emphasize that I appreciate and respect the work that many smart people have done through these frames.
For reasoning about trained AI systems, I like Evan Hubinger’s “training stories” framework:
In a training story, the training goal is a mechanistic description of the model you hope to train, and the training rationale explains why you’ll train the desired model and not something else instead.
One popular decomposition of AI alignment is (roughly) into outer alignment and inner alignment. These subproblems were originally defined as follows:
More recently, Evan Hubinger defined these subproblems as:
I initially found these concepts appealing. Even recently, I found it easy to nod along: Yeah, we compute a reward function in a way which robustly represents what we want. That makes sense. Then just target the inner cognition properly. Uh huh. What kind of reward functions would be good?
But when I try to imagine any concrete real-world situation in which these conditions obtain, I cannot. I might conclude “Wow, alignment is unimaginably hard!”. No! Not for this reason, at least—The frame is inappropriate.[4]
I: Robust grading is unnecessary, extremely hard, and unnatural
In my opinion, outer alignment encourages a strange view of agent motivation. Here’s one reasonable-seeming way we could arrive at an outer/inner alignment view of optimization:
One major mistake snuck in when I said “The actor needs a grading procedure which, when optimized, leads to the selection of a diamond-producing plan.” I suspect that many (perceived) alignment difficulties spill forth from this single mistake, condemning us to an extremely unnatural and hard-to-align portion of mind-space.
Why is it a mistake? Consider what happens if you successfully inner-align the actor so that it wholeheartedly searches for plans which maximize grader evaluations (e.g. “how many diamonds does it seem like this plan will lead to?”). In particular, I want to talk about what this agent “cares about”, or the factors which influence its decision-making. What does this inner-aligned actor care about?
Agents which care about the outer objective will make decisions on the basis of the output of the outer objective. Maximizing evaluations is the terminal purpose of the inner-aligned agent’s cognition. Such an agent is not making decisions on the basis of e.g. diamonds or having fun. That agent is monomaniacally optimizing for high outputs.
On the other hand, agents which terminally value diamonds will make decisions on the basis of diamonds (e.g. via learned subroutines like “IF
diamond
nearby
, THEN bid to setplanning subgoal
:navigate to diamond
”). Agents which care about having fun will make decisions on the basis of having fun. Even though people often evaluate plans (e.g. via their gut) and choose the plan they feel best about (e.g. predicted to lead to a fun evening), finding a highly-evaluated plan isn’t the point of the person’s search. The point is to have fun. For someone who values having fun, the terminal purpose of their optimization is to have fun, and finding a highly evaluated plan is a side effect of that process.“The actor needs a grading procedure which, when optimized against, leads to the selection of a diamond-producing plan” is a mistake because agents should not terminally care about optimizing a grading procedure. Generating highly evaluated plans should be a side effect of effective cognition towards producing diamonds.
Consider what the actor cares about in this setup. The actor does not care about diamond production. The actor cares about high evaluations from the objective function. These two goals (instrumentally) align if the only actor-imaginable way to get maximal evaluation is to make diamonds.
(This point is important under my current views, but it strikes me as the kind of concept which may require its own post. I’m not sure I know how to communicate this point quickly and reliably at this point in time, but this essay has languished in my drafts for long enough. For now, refer to Don't align agents to evaluations of plans and Alignment allows "nonrobust" decision-influences and doesn't require robust grading for more intuitions.)
If you inner-align the agent to the evaluative output of a Platonic outer objective, you have guaranteed the agent won’t make decisions on the same basis that you do. This is because you don’t, on a mechanistic level, terminally value high outputs from that outer objective. This agent will be aligned with you only if you achieve “objective robustness”—i.e. force the agent to make diamonds in order to get high evaluations by the outer objective.
Motivation via evaluations-of-X incentivizes agents to seek out adversarial inputs to the evaluative outer objective (e.g. “how many diamonds a specific simulated smart person expects of a plan”), since if there’s any possible way to get an even higher output-number, the inner-aligned agent will try to exploit that opportunity. I’m 95% confident that outer objectives will have adversarial inputs which have nothing to do with what we were attempting to grade on, because the input-space is exponentially large, the adversaries superintelligent, and real-world evaluative tasks are non-crisp/non-syntactic. This case is made in depth in don't design agents which exploit adversarial inputs. Don’t build agents which care about evaluations of X. Build agents which care about X.
This conflict-of-interest between evaluations-of-X and X is why you need to worry about e.g. “nearest unblocked strategy” and “edge instantiation” within the outer/inner alignment regime. If you’re trying to get an agent to optimize diamonds by making it optimize evaluations, of course the agent will exploit any conceivable way to get high evaluations without high diamonds. I tentatively conjecture[6] (but will not presently defend) that these problems are artifacts of the assumption that agents must be grader-optimizers (i.e. a smart “capabilities” module which optimizes for the outputs of some evaluation function, be that a utility function over universe-histories, or a grader function over all possible plans). But when I considered the problem with fresh eyes, I concluded that alignment allows "nonrobust" decision-influences and doesn't require robust grading.
The answer is not to find a clever way to get a robust outer objective. The answer is to not need a robust outer objective. Robust grading incentivizes an inner-aligned AI to search for upwards errors in your grading procedure, but I think it’s easy to tell plausible training stories which don’t require robust outer objectives.
Outer/inner introduces indirection
We want an AI which takes actions which bring about a desired set of results (e.g. help us with alignment research or make diamonds). Outer/inner proposes getting the AI to care about optimizing some objective function, and hardening the objective function such that it’s best optimized by e.g. helping us with alignment research. This introduces indirection—the AI cares about the objective function, which then gets the AI to behave in the desired fashion. Just cut out the middleman and entrain the relevant decision-making influences into the AI.
Outer/inner violates the non-adversarial principle
We shouldn’t build an agent where the inner agent spends a ton of time thinking hard about how to get high evaluations / output-of-outer-objective, while also we have to specify an objective function which can only be made to give high evaluations if the agent does what we want. In such a situation, the outer objective has to spend extra compute to not get tricked by the inner agent doing something which only looks good. I think it’s far wiser to entrain decision-making subroutines which are thinking about how to do what we want, and cut out the middleman represented by an adversarially robust outer objective.
In Don't align agents to evaluations of plans, I wrote:
There are no known outer-aligned objectives for any real-world task
It’s understandable that we haven’t found an outer objective which “represents human values” (in some vague, possibly type-incorrect sense). Human values are complicated, after all. What can we specify? What about diamond maximization? Hm, that problem also hasn’t yielded. Maybe we can just get the AI to duplicate a strawberry, and then do nothing else? What an innocent-sounding task! Just one tiny strawberry! Just grade whether the AI made a strawberry and did nothing, or whether it did some other plan involving more than that!
We can do none of these things. We don't know how to design an argmax agent, operating in reality with a plan space of plans about reality, such that the agent chooses a plan which a) we ourselves could not have specified and b) does what we wanted.
At first pass, this seems like evidence that alignment is hard. In some worlds where alignment is easy, “just solve outer alignment” worked. We were able to “express what we wanted.” Perhaps, relative to your subjective uncertainty, “just solve outer alignment” happens in fewer worlds where alignment is hard. Since “just solve outer alignment” isn’t known to work for pinning down any desirable real-world behavior which we didn’t already know how to specify, we update (at least a bit) towards “alignment is hard.”
But also, we update towards “outer/inner is just a bad frame.” Conditional on my new frame, there isn’t an “alignment is hard” update. Repeated failures at outer alignment don’t discriminate between worlds where cognition-updating-via-loss is hard or easy to figure out in time.
II: Loss functions chisel circuits into networks
In this section, I use “reward” and “loss” somewhat interchangeably, with the former bearing a tint of RL.
A loss function is a tool which chisels cognitive grooves into agents. Mechanistically, loss is not the optimization target, loss is not the “ground truth” on whether a state is good or not—loss chisels cognition into the agent’s mind. A given training history and loss/reward schedule yields a sequence of cognitive updates to the network we’re training. That’s what reward does in the relevant setups, and that’s what loss does in the relevant setups.
As Richard Ngo wrote in AGI safety from first principles: Alignment:
The mechanistic function of loss is to supply cognitive updates to an agent. In policy gradient methods, rewarding an agent for putting away trash will reinforce / generalize the computations which produced the trash-putting-away actions. Reward’s mechanistic function is not necessarily to be the quantity which the agent optimizes, and—when you look at the actual math implementing cognition-updating in deep learning—reward/loss does not have the type signature of goal/that-which-embodies-preferences. I have already argued why agents probably won’t end up primarily optimizing their own reward signal. And that’s a good thing!
Loss-as-chisel is mathematically correct
I kinda thought that when I wrote Reward is not the optimization target, people would click and realize “Hey, I guess outer and inner alignment were leaky frames on the true underlying update dynamics, and if we knew what we were doing, we could just control the learned cognition via the cognitive-update-generator we provide (aka the reward function). This lets us dissolve the nearest unblocked strategy problem—how amazing!” This, of course, proved wildly optimistic. Communication takes effort and time. So let me continue from that trailhead.
Let’s compare loss-as-chisel with a more common frame for analysis:
Rohin Shah likes to call (1) “deep learning's Newtonian mechanics” and (2) the “quantum mechanics”, in that (2) more faithfully describes the underlying learning process, but is harder to reason about. But often, when I try to explain this to alignment researchers, they don’t react with “Oh, yeah, but I just use (1) as a shortcut for (2).” Rather, they seem to react, “What an interesting Shard Theory Perspective you have there.” Rohin has told me that his response to these researchers would be: “Your abstraction (1) is leaky under the true learning process which is actually happening, and you should be sharply aware of that fact.”
Loss-as-chisel encourages thinking about the mechanics and details of learning
Loss-as-chisel encourages substantive and falsifiable speculation about internals and thus about generalization behavior. Loss-as-chisel also avoids the teleological confusions which arise from using the intentional stance to view agents as ~“wanting” to optimize their loss functions.[7] I consider a bunch of "what is outer/inner alignment" discourse and debate to be confusing, even still, even as a relatively senior researcher. Good abstractions hew close to the bare-metal of the alignment problem. In this case, I think we should hew closer to the actual learning process. (See also Appendix B for an example of this.)
By taking a more faithful loss-as-chisel view on deep learning, I have realized enormous benefits. Even attempting to mechanistically consider a learning process highlights interesting considerations and—at times—vaporizes confused abstractions you were previously using.
For example, I asked myself “when during training is it most important to provide ‘high-quality’ loss signals to the network?”. I realized that if you aren’t aiming for inner alignment on a robust grading procedure represented by the loss function, it probably doesn’t matter what the loss function outputs in some late-training and any deployment situations (e.g. what score should you give to a plan for a high-tech factory?).
At that stage, a superintelligent AI could just secretly set its learning rate to zero if it didn’t want to be updated, and then the loss signal wouldn’t matter. And if it did want to be updated, it could set the loss itself. So when the AI is extremely smart, it doesn’t matter at all what reward/loss signals look like. This, in turn, suggests (but does not decisively prove) we focus our efforts on early- and mid-training value development. Conveniently, that’s the part of training when supervision and interpretability is easier (although still quite hard).
Loss doesn’t have to “represent” intended goals
Outer/inner unnecessarily assumes that the loss function/outer objective should “embody” the goals which we want the agent to pursue.
For example, shaping is empirically useful in both AI and animals. When a trainer is teaching a dog to stand on its hind legs, they might first give the dog a treat when it lifts its front paws off the ground. This treat translates into an internal reward event for the dog, which (roughly) reinforces the dog to be more likely to lift its paws next time. The point isn’t that we terminally value dogs lifting their paws off the ground. We do this because it reliably shapes target cognition (e.g. stand on hind legs on command) into the dog. If you think about reward as exclusively “encoding” what you want, you lose track of important learning dynamics and seriously constrain your alignment strategies. (See Some of my disagreements with List of Lethalities for a possible example of someone being hesitant to use reward shaping because it modifies the reward function.)
Be precise when reasoning about outer objectives
I also think that people talk extremely imprecisely and confusingly about “loss functions.” I get a lot of mileage out of being precise—if my idea is right in generality, it is right in specificity, so I might as well start there. In Four usages of "loss" in AI, I wrote:
I think that outer alignment is an “intuitive notion” in part because loss functions don’t natively represent goals. For agents operating in reality, extra interpretation is required to view loss functions as representing goals. I can imagine, in detail, what it would look like to use a loss function to supply a stream of cognitive updates to a network, such that the network ends up reasonably aligned with my goals. I cannot imagine what it would mean for a physically implemented loss function to be “aligned with my goals.” I notice confusion and unnaturality when I try to force that mental operation.
This “optimize the loss function” speculation is weird and sideways of how we actually get AI to generalize how we want. Here’s a small part of an outer/inner training story:
This is just, you know, so weird. Why would you use a loss function or reward function this way?!
According to me, the bottleneck hard problem in AI alignment is how do we predictably control the way in which an AI generalizes; how do we map outer supervision signals (e.g. rewarding the agent when it makes us smile) into the desired inner cognitive structures (e.g. the AI cares about making people happy)?
Here’s what I think we have to do to solve alignment: We have to know how to produce powerful human-compatible cognition using large neural networks. If we can do that, I don’t give a damn what the loss function looks like. It truly doesn’t matter. Use the chisel to make a statue and then toss out the chisel. If you’re making a statue, your chisel doesn’t also have to look like the statue.
III: Outer/inner just isn’t how alignment works in people
Inner and outer alignment decompose one hard problem (AI alignment) into two extremely hard problems. Inner and outer alignment both cut against known grains of value formation.
Inner alignment seems anti-natural
We have all heard the legend of how evolution selected for inclusive genetic fitness, but all it got was human values. I think this analogy is relatively loose and inappropriate for alignment, but it’s proof that inner alignment failures can happen in the presence of selection pressure. Far more relevant to alignment is the crush of empirical evidence from real-world general intelligences with reward circuitry, suggesting to us billions and billions of times over that reinforcement learning at scale within a certain kind of large (natural) neural network does not primarily produce inner value shards oriented around their reward signals, or the world states which produce them.
When considering whether human values are inner-aligned to the human reward circuitry, you only have to consider the artifact which evolution found. Evolution found the genome, which—in conjunction with some environmental influences—specifies the human learning process + reward circuitry. You don't have to consider why evolution found that artifact (e.g. selection pressures favoring certain adaptations). For this question, it might help to imagine that the brain teleported into existence from some nameless void.
From my experience with people, I infer that they do not act to maximize some simple function of their internal reward events. I further claim that people do not strictly care about bringing about the activation preconditions for their reward circuitry (e.g. for a sugar-activated reward circuit, those preconditions would involve eating sugar). True, people like sugar, but what about artificial sweeteners? Isn’t that a bit “unaligned” with our reward circuitry, in some vague teleological sense?
More starkly, a soldier throwing himself on a grenade is not acting (either consciously or subconsciously) to most reliably bring about the activation preconditions for some part of his reward system. I infer that he is instead executing lines of cognition chiseled into him by past reinforcement events. He is a value shard-executor, not an inner-aligned reward maximizer. Thus, his values of protecting his friends and patriotism constitute inner alignment failures on the reward circuitry which brought those values into existence.[8] Those values are not aligned with the goals “represented by” that reward circuitry, nor with the circuitry’s literal output. I think that similar statements hold for values like “caring about one’s family”, “altruism”, and “protecting dogs.”
Therefore, the only time human-compatible values have ever arisen, they have done so via inner alignment failures.[9] Conversely, if you aim to “solve” inner alignment, you are ruling out the only empirically known way to form human-compatible values. Quintin Pope wrote (emphasis mine):
(I caution that "cause a carefully orchestrated inner alignment failure in a simple learning system" sounds like we’re trying something “hacky” or “mistake-prone”, when we really aren’t attempting something strange. Rather, we’re talking about the apparently natural way for values to form.)
The above argues that inner alignment is unnatural—counter to natural tendencies. I further infer that inner alignment is unnatural partly because it is antinatural. We've never seen it happen, we don't know how to make it happen, there are lots of reasons to think it won't happen, and I don't think we need to make it happen.
Complete inner alignment seems unnecessary
In the AXRP interview, Paul stated that he would (under the outer/inner frame) aim for an agent “not doing any other optimization beyond pursuit of [the outer objective].” But why must there be no other optimization? Why can’t the AI value a range of quantities?
On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.”
An agent with lots of values (e.g. coffee and sex and art) will be more likely to choose plans which incorporate positive features under all of the values (since those plans get bid for by many decision-influences). I believe that this complexity of value is the default. If an AI strongly and reflectively values both protecting people and paperclips, it will make decisions on the basis of both considerations. Therefore, the AI will both protect people and make paperclips (assuming the values work in the described way, which is a whole ‘nother can of worms).
I have written:
So ultimately, I think “the agent has to exclusively care about this one perfect goal” is dissolved by the arguments of alignment allows "nonrobust" decision-influences and doesn't require robust grading. And trying to make an agent only care about one goal seems to go against important grains of effective real-world cognition.
Outer alignment seems unnatural
People are not inner-aligned to their reward circuitry, nor should they be. The human reward circuitry does not specify an ungameable set of incentives such that, if the reward circuitry is competently optimized, the human achieves high genetic fitness, or lives a moral and interesting life, or anything else. As Quintin remarked to me, “If you find the person with the highest daily reward activation, it’s not going to be Bill Gates or some genius physicist.” According to Atlantic’s summary of a 1986 journal article:[10]
That’s what happens when the human reward circuitry is somewhat competently optimized. Good thing we aren’t inner-aligned to our reward circuitry, because it isn’t “outer-aligned” in any literal sense. But even in a more abstract sense of "outer alignment", I infer that human values have not historically arisen from optimizing a “hard-to-game” outer criterion which specifies those values.
David Udell made an apt analogy:
As best I can tell, human values have never arisen via the optimization of a hard-to-game outer criterion which specifies their final form. That doesn’t logically imply that human values can’t arise in such a way—although I have separately argued that they won’t—but it’s a clue.
Why does it matter how alignment works in people?
Suppose we came up with outer/inner alignment as a frame on AI alignment. Then we realized that people do seem to contain an “outer objective”—neural circuitry which people terminally want to optimize (i.e. the genome inner-aligns people to the circuitry) such that the neural circuitry faithfully represents the person’s motivations (i.e. the neural circuitry is an outer alignment encoding of their objective). I would react: “Huh, looks like we really have reasoned out something true and important about how alignment works. Looks like we’re on roughly the right track.”
As I have argued, this does not seem to be the world we live in. Therefore, since inferring outer/inner alignment in humans would have increased my confidence in the outer/inner frame, inferring not-outer/inner must necessarily decrease my confidence in the outer/inner frame by conservation of expected evidence.
IV: Dialogue about inner/outer alignment
Communication is hard. Understanding is hard. Even if I fully understood what other people are trying to do (I don't), I'd still not have space to reply to every viewpoint. I’m still going to say what I think, do my best, and be honest. I expect to be importantly right, which is why I’m sharing this essay. As it stands, I’m worried about much of the field and the concepts being used.
Alex’s model of an outer alignment enjoyer (A-Outer): Outer/inner alignment is cool because it lets us decompose “what we want the agent to care about” and “how we get the agent to care about that.” This is a natural problem decomposition and lets us allocate the agent’s motivations to the part we have more specification-level control over (i.e. its reward function).
Alex (A): I don’t think it makes sense to design an agent to have an actor/grader motivational structure. As I’ve discussed, I think those design patterns are full of landmines.
A-Outer: I think we can recover the concept if we just let “outer alignment” be “what cognition / values should the AI have?”.
A: That is indeed important to think about. That’s also not aiming for an “outer-aligned” reward function or grading procedure. Don’t pollute the namespace—allocate different phrases to different concepts. That is, you can consider “what values should the AI have?” and then “what reward function will chisel those values into the AI?”. But then we aren’t inner-aligning the agent to the outer objective anymore, but rather we are producing the desired internal values. We’re now reasoning about reward-chiseling, which I’m a big fan of.
A-Outer: Right, but you have to admit that “consider what kinds of objectives are safe to maximize” is highly relevant to “what do we want the AI to end up doing for us?”. As you just agreed, we obviously want to understand that.
(And yes, maximization. Just look at the coherence theorems spotlighting expected utility maximization as the thing which non-stupid real-world agents do! Unless you think we won’t get an EU maximizer?)
A: Compared to “what reward signal-generators are safe to optimize?”, it’s far more reasonable to consider “what broad-strokes utility function should the AI optimize?”. Even so, there are tons of skulls along that path. We just suck at coming up with utility functions which are safe to maximize, for generalizable reasons. Why should a modern alignment researcher spend an additional increment of time thinking about that question, instead of other questions? Do you think that we’ll finally find the clever utility function/grading procedure which is robust against adversarial optimization? I think it’s wiser to simply avoid design patterns which pit you against a superintelligence’s adversarial optimization pressure.
(And I don’t think you’ll get a meaningfully viewable-as-bounded-EU-maximizer until late in the agent’s developmental timeline. That might be a very important modeling consideration. Be careful to distinguish asymptotic limits from finite-time results.)
A-Outer: Seriously? It would be real progress to solve the outer alignment problem in terms of writing down a utility function over universe-histories which is safe to maximize. For example, suppose we learned that if the utility function penalizes the agent for gaining more than X power for >1 year (in some formally specifiable sense) would bound the risk from that AI, making it easier to get AIs which do pivotal acts without keeping power forever. Then we learn something about the properties we might aim to chisel into the AI’s inner cognition, in order to come out alive on the other side of AGI.
A: First, note that your argument is for finding a safe-to-maximize utility function over universe histories, which is not the same as the historically prioritized reward-outer-alignment. Second, not only do I think that your hope won’t happen, I think the hope is written in an ontology which doesn’t make sense.
Here’s a non-strict analogy which hopefully expresses some of my unease. Your hope feels like saying, “If I could examine the set of physically valid universe-histories in which I go hiking tonight, I’d have learned something about where I might trip and fall during the hike.” Like, sure? But why would I want to examine that mathematical object in order to not trip during the hike? Sure seems inefficient and hard to parse.
I agree that “What decision-making influences should we develop inside the AI?” is a hugely important question. I just don’t think that “what utility functions are safe to maximize?” is a sensible way to approach that question.
A-Outer: Even though we probably won’t discover a compact specification of a utility function which is strictly and literally safe to literally maximize, there are degrees of safety when a real-world agent optimizes an objective. Two objectives may be gameable, but one can still be less gameable than the other.
A: Sure seems like that in the outer/inner paradigm, those “degrees of safety” are irrelevant in the limit, as their imperfections burst under the strain of strong optimization. (Aren’t you supposed to be the discussant operating that paradigm, A-Outer?)
A-Outer: I don’t see how you aren’t basically giving up on figuring out what the AI should be doing.
A: Giving up? No! Thinking about “what utility function over universe-histories is good?” is just one way of framing “How can we sculpt an AI’s internal cognition so that it stops the world from blowing up due to unaligned AI?”. If you live and breathe the inner/outer alignment frame, you’re missing out on better framings and ontologies for alignment! To excerpt from Project Lawful:
Stop trying to write complicated long sentences in terms of outer objectives. Just, stop. Let’s find a new language. (Do you really think a future alignment textbook would say “And then, to everyone’s amazement, outer alignment scheme #7,513 succeeded!”)
Now, I can legitimately point out that outer and inner alignment aren’t a good framing for alignment, without offering an alternative better framing. That said, I recently wrote:[11]
A-Outer: Bah, “shard theory of human values.” We didn’t build planes with flapping wings. Who cares if human values come from inner alignment failures—Why does that suggest that we shouldn’t solve inner alignment for AI? AI will not be like you.
A: Yes, it is indeed possible to selectively consider historical disanalogies which support a (potentially) desired conclusion (i.e. that outer/inner is fine). If we’re going to play reference class tennis, how about all of the times biomimicry has worked?
But let’s not play reference class tennis. As mentioned above, we have to obey conservation of expected evidence here.
In worlds where inner alignment was a good and feasible approach for getting certain human-compatible values into an AI (let’s call that hypothesis class Hinner-align), I think that we would expect with greater probability for human values to naturally arise via inner alignment successes. However, in worlds where inner alignment failures are appropriate for getting human values into an AI (Hfail), we would expect with greater probability for human values to naturally arise via inner alignment failures.
Insofar as I have correctly inferred that human values constitute inner alignment failures on the human reward circuitry, this inference presents a decent likelihood ratio P(reality | Hfail) / P(reality | Hinner-align), since Hfail predicts inferred reality more strongly. In turn, this implies an update towards Hfail and away from Hinner-align. I think it's worth considering the strength of this update (I'd guess it's around a bit or so against outer/inner), but it's definitely an update.
I agree that there are important and substantial differences e.g. between human inductive biases and AI inductive biases. But I think that the evidential blow remains dealt against outer/inner, marginalizing over possible differences.
A-Outer: On another topic—What about “the outer objective gets smarter along with the agent”?
A: That strategy seems unwise for the target motivational structures I have in mind (e.g. "protect humanity" or "do alignment research").
A-Outer: It’s easy to talk big talk. It’s harder to propose concrete directions which aren’t, you know, doomed.
A: The point isn’t that I have some even more amazing and complicated scheme which avoids these problems. The point is that I don’t need one. In the void left by outer/inner, many objections and reasons for doom no longer apply (as a matter of anticipation and not of the problems just popping up in a different language).
In this void, you should reconsider all fruits which may have grown from the outer/inner frame. Scrutinize both your reasons for optimism (e.g. “maybe it’s simpler to just point to the outer objective”) and for pessimism (e.g. “if the graders are exploitable by the AI, the proposal fails”). See alignment with fresh eyes for a while. Think for yourself.
This is why I wrote Seriously, what goes wrong with "reward the agent when it makes you smile"?:
If you’re considering “reward on smile” from an outer alignment frame, then obviously it’s doomed. But from the reward-as-chisel frame, not so fast. For that scheme to be doomed, it would have to be true that, for every probable sequence of cognitive updates we can provide the agent via smile-reward events, those updates would not build up into value shards which care about people and want to protect them. That scheme’s doom is not at all clear to me.
(One objection to the above is “Ignorance of failure is no protection at all. We need a tight story for why AI goes well.” Well, yeah. I’m just saying “in the absence of outer/inner, it doesn’t make sense to start debating hyper-complicated reward chisels like debate or recursive reward modeling, if we still can’t even adjudicate what happens for ‘reward on smile.’ And, there seems to be misplaced emphasis on ‘objective robustness’, when really we’re trying to get good results from loss-chiseling.”)
A-Outer: Suppose I agreed. Suppose I just dropped outer/inner. What next?
A: Then you would have the rare opportunity to pause and think while floating freely between agendas. I will, for the moment, hold off on proposing solutions. Even if my proposal is good, discussing it now would rob us of insights you could have contributed as well. There will be a shard theory research agenda post which will advocate for itself, in due time.
A-Outer, different conversational branch. We know how to control reward functions to a much greater extent than we know how to control an AI’s learned value shards.
A: This is true. And?
A-Outer: I feel like you’re just ignoring the crushing amount of RL research on regret bounds and a moderate amount of research on the expressivity of reward functions and how to shape reward while preserving the optimal policy set. Literally I have proven a theorem[13] constructively showing how to transfer an optimal policy set from one discount rate to another. We know how to talk about these quantities. Are you seriously suggesting just tossing that out?
A: Yes, toss it out, that stuff doesn't seem very helpful for alignment thinking—including that theorem we were so proud of! Yes, toss it out, in the sense of relinquishing the ill-advised hope of outer alignment. Knowing how to talk about a quantity (reward-optimality) doesn’t mean it’s the most appropriate quantity to consider.
A-Outer: Consider this: Obviously we want to reward the agent for doing good things (like making someone smile) and penalize it for doing bad things (like hurting people). This frame is historically, empirically useful for getting good behavior out of AI.
A: First, we have not solved AI alignment in the inner/outer paradigm—even for seemingly simple objectives like diamond-production and strawberry duplication—despite brilliant people thinking in that frame for years. That is weak evidence against it being a good paradigm.
Second, I agree that all else equal, it’s better to reward and penalize the agent for obvious good and bad things, respectively. But not because the reward function is supposed to represent what I want. As I explained, the reward function is like a chisel. If I reward the agent when it makes me smile, all else equal, that’s probably going to upweight and generalize at least some contextual values upstream of making me smile. That reward scheme should differentially upweight and strengthen human-compatible cognition to some extent.
Since reward/loss is actually the chisel according to the math of cognition-updating in the most relevant-seeming approaches, insofar as your suggestion is good, it is good because it can be justified via cognition-chiseling reasons. Your basic suggestion might not be enough for alignment success, but it’s an important part of our best current guess about what to do.
More broadly, I perceive a motte and bailey:
I think that the bailey is wrong and the motte is right.
A-Outer: You keep wanting to focus on the “quantum mechanics” of loss-as-chisel. I agree that, in principle, if we really knew what we were doing—if we deeply understood SGD dynamics—we could skillfully ensure the network generalizes in the desired way (e.g. makes diamonds). You criticize the “skulls” visible on the “robust grader” research paths, while seemingly ignoring the skulls dotting the “just understand SGD” paths.
A: I, at the least, agree that we aren’t going to get a precise theory like “If you initialize this architecture and scale of foundation model on this kind of corpus via self-supervised learning, it will contain a diamond concept with high probability; if you finetune on this kind of task, it will hook up its primary decision-influences to the diamond-abstraction; …”. That seems quite possible to understand given enough time, but I doubt we’ll have that much time before the rubber hits the road.
However, I’d be more sympathetic to this concern if there wasn’t a bunch of low-hanging fruit to be had from simply realizing that loss-as-chisel exists, and then trying to analyze the dynamics anyways. (See basically everything I’ve written since this spring. Most of my insights have been enabled by my unusually strong desire to think mechanistically and precisely about what actually happens during a learning process.)
One thing which would make me more pessimistic about the “understand how loss chisels cognition into agents” project is if I don’t, within about a year’s time, have empirically verified loss-as-chisel insights which wouldn’t have happened without that frame. But even if so, everything we're doing will still be governed by loss-as-chisel. We can't ignore it and make it go away.
A-Outer: But if we do inner alignment, we don’t have to understand SGD dynamics to the same extent that we do to chisel in diamond-producing values.
A: I don’t know why you think that. (I don't even understand enough yet to agree or disagree in detail; I currently disagree in expectation over probable answers.)
What, exactly, are we chiseling in order to produce an inner-aligned network? How do we know we can chisel agents into that shape, if we don’t understand chiseling very well? What do we think we know, and how do we think we know it? How is an inner-aligned diamond-producing agent supposed to be structured? This is not a rhetorical question. I literally do not understand what the internal cognition is supposed to look like for an inner-aligned agent. Most of what I’ve read has been vague, on the level of “an inner-aligned agent cares about optimizing the outer objective.”
Charles Foster comments:
Perhaps my emphasis on mechanistic reasoning and my unusual level of precision in my speculation about AI internals, perhaps these make people realize how complicated realistic cognition is in the shard picture. Perhaps people realize how much might have to go right, how many algorithmic details may need to be etched into a network so that it does what we want and generalizes well.
But perhaps people don’t realize that a network which is inner-aligned on an objective will also require a precise and conforming internal structure, and they don’t realize this because no one has written detailed plausible stabs at inner-aligned cognition.
A-Outer: Just because the chisel frame is technically accurate doesn’t mean it’s the most pragmatically appropriate frame. The outer alignment frame can abstract over the details of cognition-chiseling and save us time in designing good chiseling-schemes. For example, I can just reward the AI when it wins the game of chess, and not worry about designing reward schedules according to my own (poor) understanding of chess and what chess-shards to upweight.
A: I agree that sometimes you should just think about directly incentivizing the outcomes and letting RL figure out the rest; I think that your chess example is quite good! Chess is fully observable and has a crisply defined, algorithmically gradable win condition. Don’t worry about “if I reward for taking a queen, what kind of cognition will that chisel?”—just reinforce the network for winning.
However, is the “reward outcomes based on their ‘goodness’” frame truly the most appropriate frame for AGI? If that were true, how would we know? I mean—gestures at probability theory intuitions—however outer alignment-like concepts entered the alignment consciousness, it was not (as best I can discern) because outer alignment concepts are optimally efficient for understanding how to chisel good cognition into agents.[14] Am I now to believe that, coincidentally, this outer alignment frame is also the most appropriate abstraction for understanding how to e.g. chisel diamond-producing values into policy networks? How fortuitous!
A-Outer: Are you saying it's never appropriate to consider outer/inner, then?
A: I think that the terminology and frame are unhelpful. At least, I feel drastically less confused in my new primary frame, and people have told me my explanations are quite clear and focused in ways which I think relate to my new frame.
In e.g. the chess example, though, it seems fine to adopt the "Newtonian mechanics" optimized-for-reward view on deep learning. Reward the agent for things you want to happen, in that setting. Just don't forget what's really going on, deeper down.
A-Outer: Even if the inner/outer alignment problem isn’t literally solvable in literal reality, it can still guide us to good ideas.
A: Many things can guide us to good ideas. Be careful not to privilege a hypothesis which was initially elevated to consideration for reasons you may no longer believe!
Conclusion
Inner and outer alignment decompose one hard problem (AI alignment) into two extremely hard problems. These problems go against natural grains of cognition, so it’s unsurprising that alignment has seemed extremely difficult and unnatural. Alignment still seems difficult to me, but not because e.g. we have to robustly grade plans in which superintelligences are trying to trick us.
I think that “but what about applying optimization pressure to the base objective?” has warped lots of alignment thinking. You don’t need an “extremely hard to exploit” base objective. That’s a red herring.
Stepping away from the worldview in which outer/inner is a reasonable frame, a range of possibilities open up, and the alignment problem takes on a refreshing and different nature. We need to understand how to develop good kinds of cognition in the networks we train (e.g. how to supply a curriculum and reward function such that the ensuing stream of cognitive-updates leads to an agent which cares about and protects us). At our current level of understanding, that’s the bottleneck to solving technical alignment.
Appendix A: Additional definitions of “outer/inner alignment”
Here are a few more definitions, for reference on how the term has been historically defined and used.
Evan’s definitions
I will note that the human reward circuitry is not outer-aligned to human values under this definition, since people who experience the “data” of wireheading will no longer have their old values.
Anyways, It’s not clear what this definition means in the RL setting, where high path-dependence occurs due to the dependence of the future policy on the future training data, which in turn depends on the current policy, which depended on the past training data. For example, if you like candy and forswear dentists (and also forswear ever updating yourself so that you will go see the dentist), you will never collect reward data from the dentist’s office, and vice versa. One interpretation is: infinite exploration of all possible state-action tuples, but I don’t know what that means in reality (which is neither ergodic nor fully observable). I also don’t know the relative proportions of the “infinite data.”
Evan privately provided another definition which better accounts for the way he currently considers the problem of outer+inner alignment:
I then wrote a dialogue with my model of him, which he affirmed as “a pretty reasonable representation.”
Alex (A): Hm. OK. So it sounds like the outer objective is less of something which grades the agent directly across all situations, and which is safe to optimize for. Under your operationalization of the outer alignment training goal, the reward function is more like an artifact which emits reward on training in a way which tightly correlates with getting gold coins on training.
Suppose I have an embodied AI I’m training via RL (for conceptual simplicity, not realism), and it navigates mazes and reaches a gold coin at the end of each maze. I’ll just watch the agent through one-way glass and see if it looks like it touched the gold coin by legit solving the maze. If it does, I hit the reward button.
Now suppose that this in fact just trains a smart AI which “terminally cares” about gold coins, operationalized in the “values as policy-influences” sense: In all realistically attainable situations where the AI believes there are gold coins nearby, the AI reliably reaches the gold coin. The AI doesn’t go to yellow objects, or silver coins, or any other junk.
So even though on training, the reward schedule was unidentifiable from “reward when a metal disk was touched”, that doesn’t matter for our training goal. We just want the AI to learn a certain kind of cognition which we “had in mind” when specifying the outer objective, and it doesn’t matter if the outer objective is “unambiguously representing” the intended goal.
Alex’s model of Evan (A-E): Yup, basically.
A: OK. So in this scenario, though, the actual reward-generating process would in fact be foolable by an AI which replaces the window with an extremely convincing display which showed me a video which made me believe it got gold coins, even though it was actually touching a secret silver coin in the real room. The existence of that adversarial input isn’t a problem, because in this story, we aren’t trying to get the AI to directly optimize the reward-generating process or any of its Cartesian transforms or whatever.
A-E: Well, I guess? If you assume you get the gold-coin AI, you can satisfy the story with such an underdetermined and unhardened outer objective. But I expect in reality you need to supply more reward data to rule out e.g. silver coins, and possibly to disincentivize deception during training. See the RLHF + camera-duping incident.
So I think the answer is “technically no you don’t have to worry about adversarial inputs to the grading procedure on this definition, but in reality I think you should.”
A: I think we’re going to have a separate disagreement on that camera incident which isn’t related to this decomposition, so I’ll just move past that for the moment. If this is the perspective, I don’t disagree with it as much as “have the objective represent what you want as faithfully as possible, maybe even exactly, such that the outer objective is good to optimize for.”
I think that this decomposition is actually compatible with some shard theory stories, even. It feels like this outer alignment definition is actually pretty lax. It feels more like saying “I want to write down an objective which appears to me to ‘encode’ gold coin-grabbing, and then have that objective entrain a gold coin value in the agent.” And, for chisel = statue reasons, the levers for inner alignment would then have to come from inductive biases (speed / complexity / hyperparameters / whatever), and not the actual feedback signals (which are kinda fixed to match the “represent the gold coin objective”).
Daniel Ziegler’s working definitions
I recently spoke with Daniel Ziegler about one frame he uses for alignment, which he described as inspired by Christiano’s Low-stakes alignment, and relating to outer/inner alignment. Here’s my summary:
I don’t think we need robust grading in every possible training situation; it seems to me like early and mid-training will be far more important for chiseling values into the AI. I’m less worried about evaluating late-training situations where the AI is already superintelligent. I also don’t think we need robust adequacy. There probably has never ever existed a human which behaves adequately in every possible situation. Probably Gandhi goes on a killing spree in some situation.
I’m more concerned about on-trajectory properties—make the AI nice to begin with, make it want to keep being nice in the future, and I don’t worry about off-trajectory bad situations it won’t even want to enter. If the AI thought “I’m nice now but won’t be nice later”, wouldn’t the AI take action of its own accord to head off that event, which would be bad by its own values?
I worry that absolute robustness is an unnatural cognitive property, which is also not necessary, and that certain attempts to achieve it could even worsen alignment properties. As one concrete (but mostly theoretical) concern, adversarial training might make an initially nice AI, less nice / aligned:
EDIT: This is less "don't do adversarial training", and more "I have some intuitions there are subtle costs and difficulties to demanding extreme robustness from a system."
Outer alignment on physical reward is impossible
Consider the following definitions:
Unsolvability of outer alignmentliteral. Any outer objective P must be implemented within the real world. Suppose that P reliably produces huge numbers in worlds where the AI is doing what we want. But then the number produced by P can be further increased by just modifying the physically implemented output.
So, for any agent with a sufficiently rich action space (so that it can affect the world over time), any search for maximal P-outputs yields tampering (or something else, not related to what we want, which yields even greater outputs).[16]
Appendix B: RL reductionism
A bunch of alignment thinking seems quite airy, detached from step-by-step mechanistic thinking. I think there are substantial gains to thinking more precisely. I sometimes drop levels of abstraction to view NN training as a physical process which imperfectly shadows the nominal PyTorch code, which itself imperfectly shadows the mathematical learning algorithms (e.g. SGD under certain sampling assumptions on minibatches), which itself is imperfectly abstracted by rules like "loss as chisel", which itself is sometimes abstractable as "networks get trained to basically minimize loss / maximize reward on a certain distribution."
Consider what happens when you train a deep Q-learning network on Pac-Man. I'll start with reward-as-chisel, but then take a slightly more physical interpretation.
In comments on an earlier draft of this post, Paul clarified that the reward doesn’t have to exactly capture the [expected] utility of deploying a system or of taking an action, but just e.g. correlate on reachable states such that the agent can’t predict deviations between reward and human-[expected] utility.
Agreed.
I’m not claiming this is Paul’s favorite alignment plan, I can’t speak for him. However, I do perceive most alignment plans to contain many/all of: 1) robust grading, 2) “the chisel must look like the statue”, and 3) aligning the AI to a grading procedure.
I am by no means the first to consider whether the outer/inner frame is inappropriate for many situations. Evan Hubinger wrote:
"It’s worth pointing out how phrasing inner and outer alignment in terms of training stories makes clear what I think was our biggest mistake in formulating that terminology, which is that inner/outer alignment presumes that the right way to build an aligned model is to find an aligned loss function and then have a training goal of finding a model that optimizes for that loss function.”
In this essay, I focus on the case where the outer objective’s domain is the space of possible plans. However, similar critiques hold for grading procedures which grade world-states or universe-histories.
The truth is that I don't yet know what goes on in more complicated and sophisticated shard dynamics. I doubt, though, that grader-optimization and value-optimization present the same set of risk profiles (via e.g. Goodhart and nearest unblocked strategy), which coincidentally derive from different initial premises via different cognitive dynamics. "It’s improbable that you used mistaken reasoning, yet made no mistakes."
Outer/inner fails to describe/explain how GPT-3 works, or to prescribe how we would want it to work (“should GPT-3 really minimize predictive loss over time?” seems like a Wrong Question). Quintin wrote in private communication:
“GPT-3’s outer ‘objective’ is to minimize predictive error, and that’s the only thing it was ever trained on, but GPT-3 itself doesn’t ‘want’ to minimize its predictive error. E.g., it’s easy to prompt GPT-3 to act contrary to its outer objective as part of some active learning setup where GPT-3 selects hard examples for future training. Such a scenario leads to GPT-3 taking actions that systematically fail to minimize predictive error, and is thus not inner aligned to that objective.”
This point is somewhat confounded because humans “backchain” reward prediction errors, such that a rewarding activity bleeds rewardingness onto correlated activities (in the literature, see the related claim: “primary reinforcers create secondary reinforcers”). For example, in late 2020, I played Untitled Goose Game with my girlfriend. My affection for my girlfriend spilled over onto a newfound affection for geese, and now (I infer that) it’s rewarding for me to even think about geese, even though I started off ambivalent towards them. So, I infer that there’s a big strong correlation between “things you value and choose to pursue” and “mental events you have learned to find rewarding.”
I don’t actually think in terms of “inner alignment failures” anymore, but I’m writing this way for communication purposes.
The original abstract begins: “A 48-year-old woman with a stimulating electrode implanted in the right thalamic nucleus ventralis posterolateralis developed compulsive self-stimulation associated with erotic sensations and changes in autonomic and neurologic function.”
I think the shard frame is way better than the utility function frame because of reasons like “I can tell detailed stories for how an agent ends up putting trash away or producing diamonds in the shard frame, and I can’t do that at all in the utility frame.” That said, I’m still only moderate-strength claiming “the shard frame is better for specifying what kind of AI cognition is safe” because I haven’t yet written out positive mechanistic stories which spitball what kinds of shard-compositions lead to safe outcomes. I am, on the other hand, quite confident that outer/inner is inappropriate.
The coherence theorems can pin down “EU maximization” all you please, but they don’t pin down the domain of the utility functions. They don’t dictate what you have to be coherent over, when trading off lotteries. I commented:
And so it goes for human values. If human values tend to equilibrate to utility functions which factorize into factors like
-1 * subjective micromorts
or# of times I tell a joke around my friends
, but you think that the former is “just instrumental” and the latter is “too contextual”, you’re working in the wrong specification language.Another difficulty to “just produce diamonds” is it assumes a singular shard (diamond-production), which seems anti-natural. Just look at people and their multitudes of shards! I think we should not go against suspected grains of cognition formation.
Proposition E.30 of Optimal Policies Tend to Seek Power.
RL practitioners do in fact tend to reward agents for doing good things and penalize them for doing bad things. The prevalence of this practice is some evidence for “rewarding based on goodness is useful for chiseling policies which do what you want.” But this evidence seems tamped down somewhat because “reward optimization” was a prevalent idea in RL theory well before deep reinforcement learning really took off. Just look at control theory back in the 1950’s, where control systems were supposed to optimize a performance metric over time (reward/cost). This led to Bellman’s optimality equations and MDP theory, with all of its focus on reward as the optimization target. Which probably led to modern-day deep RL retaining its focus of rewarding good outcomes & penalizing bad outcomes.
The loss function can indeed “hit back” against bad behavior, in the form of providing cognitive updates which “downweight” the computations which produced the negative-loss event. However, this “hitting back” only applies while the AI’s values are still malleable to the loss function. If the AI crystallizes unaligned values (like seeking power and winning games) and gets smart, it can probably gradient hack and avoid future updates which would break its current values.
However, reality will always “hit back” against bad capabilities. A successful AGI will continually become more capable, even well after value crystallization.
This argument works even if P originally penalizes tampering actions. Suppose the agent is grading itself for the average output of the procedure over time (or sum-time-discounted with 𝛾 ≈ 1, or the score at some late future time step, or whatever else; argument should still go through). Then penalizing tampering actions will decrease that average. But since the penalties only apply for a relatively small number of early time steps, the penalties will get drowned out by the benefits of modifying the P-procedure.