If you buy the RLHF Conditioning Hypothesis, then selecting goals from learned knowledge is what RL does too.
I think you're right. I think I'd even say that RLHF is selecting goals from learned knowledge if the RLHF conditioning hypothesis is false. They'd just be poorly specified and potentially dangerous goals (do things people like in these contexts).
This calls into question my choice of name; perhaps I should've called it explicit goal selection or something to contrast it to the goals implied in a set of RLHF training.
But I didn't consider this carefully because RLHF is not an important part of either my risk model or my proposed path to successful alignment. It's kind of helpful to have an "aligned" LLM as the core engine of your language model agent, but such a thing will pursue explicitly defined goals (as it (mis)understands them), not just do things the LLM spits out as ideas.
I see language model agents as a far more likely risk model than mesa-optimization from an LLM. We'll actively make AGI language model agents before dangerous things emerge from LLM training, on my model. I have some ideas on how easy will be to make agents with slightly better LLMs and memory systems, and how to align them with a stack of things including RLHF or other fine-tuning, but centering on well-goals that are carefully checked, and humans loosely in the loop.
Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.
Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)
I like this post. I like goals selected from learned knowledge (GSLK). It sounds a lot like what I was thinking about when I wrote how-i-d-like-alignment-to-get-done. I plan to use the term GSLK in the future. Thank you : )
Recent successes in transformers and other architectures suggest that predictive learning may be superior to RL in creating rich representations of the world.
Great post! By the way, is there any material to support this sentence?
I haven't seen any work trying to make this comparison in a rigorous way. I'm referring to the common opinion, which I share, that LLMs are the current high-water mark of AI (and AGI) research. Most of the best ones use both predictive learning for the majority of training, and RL for fine-tuning; but there's some work indicating that it's just as effective to do that fine-tuning with more predictive learning on a selected dataset (either hand-selected by humans for preferred responses similar to RLHF, or produced by an LLM with some criteria, similar to constitutional AI).
Prior to this, AlphaZero and its family of RL algorithms were commonly considered the high water mark.
Each has strengths and weaknesses, so I and others suspect that a combination of the two may continue to be the most effective approach.
The core intuition for why predictive learning would be overall more powerful is that there's more to learn from. Predictive learning has a large vector, RL is a scalar signal. And prediction doesn't need any labeling of the data. If I have a stream of data, I can predict what happens next, and have a large vector signal. Even if I only have a collection of static data, I can block chunks of it and have the system predict what occurs in those chunks (like current vision model training does). RL relies on having a marker of what is good or bad in the data; that has to be added either by hand or by an algorithm (like a score in a game environment or energy measure for protein folding). Powerful critic systems can extrapolate limited reward information, but that still gives at most a single scalar signal for each set of input data, where prediction is learning from the full large vector signal for each input.
There's more than you asked for; the sad answer is no, I don't have a good reference.
>> I’ve been trying to understand and express why I find natural language alignment ... so much more promising >> than any other alignment techniques I’ve found.
Could it be that we humans have millennia of experience aligning our new humans (children) using this method? Whereas every other method is entirely new to us, and has never been applied to a GI even if it has been tested on other AI systems; thus, predictions of outcomes are speculative.
But it still seems like there is something missing from specifying goals directly via expression through language or even representational manipulation. If the representations themselves do not contain any reference to motivational structure (i.e., they are "value free" representations), then the goals will not be particularly stable. Johnny knows that it's bad to hit his friends because Mommy told him so, but he only cares because it's Mommy who told him, and he has a rather strong psychological attachment to Mommy.,
I wouldn't say this is the method we use to align children, for the reaon you point out: we can't set the motivational valence of the goals we suggest. So I'd call that "'goal suggestion". The difference in this method is that we are setting the goal value of that representation directly, editing the AGIs weights to do this in a way we can't with children. It would be like when I say "it's bad to hit people" I also set the weights into and through the amygdala so that the concept he represents, hitting people, is tied to a very negative reward prediction. That steers his actions away from hitting people.
By selecting a representation, then editing how it connects to a steering subsystem (like the human dopamine system), we are selecting it as a goal directly, not just suggesting it and allowing the system to set its own valance (goal/avoidance marker) for that representation, as we do with human children.
The existing research on selecting goals from learned knowledge would be conceptual interpretability and model steering through activation addition or representation engineering, if I understood your post correctly? I think these are promising paths to model steering without RL.
I'm curious if there is a way to bake conceptual interpretability into the training process. In a sense, can we find some suitable loss function that incentivizes the model to represent its learned concepts in an easily readable form, and applying it during training? Maybe train a predictor that predicts a model's output from its weights and activations? The hope is to have a reliable interpretability method that scales with compute. Another issue is that existing papers also focus on concepts represented linearly, which is fine if most important concepts are represented that way, but who knows?
Anyways, sorry for the slightly rambling comment. Great post! I think this is the most promising plan to alignment.
Thanks! Yes, I think interpretability, activation-addition steering and representational engineering (I know less about this one) are all routes to GSLK alignment approaches.
Activation-addition steering isn't my favorite route, because it's not directly selecting selecting a goal representation in an important sense. There's no steering subsystem with a representation goals. Humans have explicit representations of goals in two senses that LLMs lack, and this makes it potentially easier to control and understand what goals the system will pursue in a variety of situations. I wrote about this in Steering subsystems: capabilities, agency, and alignment.
As for your second question, it almost sounds like you're describing large language models. They literally learn representations we can read. I think they have translucent thoughts in the sense that they're highly if not perfectly interpretable. The word outputs don't perfectly capture the underlying representations, but they capture most of it, most of the time. Redundancy checks can be applied to catch places where the underlying "thoughts" don't match the outputs. I think. This needs more exploration.
That's why, of the three GSLK approaches above, I favor aligning language model agents with a stack of approaches including external review relying on reading their train of thought.
Additional interpretability such as you suggest would be great. Steve Byrnes has talked about training additional interpetability outputs just like you suggest (I think) in his Intro to brain-like AGI alignment sequence. I think these approaches are great, but they impose a fairly high alignment tax. One thing I like about the GSLK approaches for brainlike AGI and language model agents is that they have very low alignment taxes in the sense of being computationally and conceptually easy to apply to the types of AGI I think we're most likely to build first by default.
Thanks for the response!
I'm worried that instead of complicated LMA setups with scaffolding and multiple agents, labs are more likely to push for a single tool using LM agent, which seems cheaper and simpler. I think some sort of internal steering for a given LM based on learned knowledge discovered through interpretability tools is probably the most competitive method. I get your point that the existing method in LLMs aren't necessarily re targeting some sort of searching method, but at the same time they don't have to be? Since there isn't this explicit search and evaluation process in the first place, I think of it more as a nudge guiding LLM hallucinations.
I was just thinking, a really ambitious goal would be apply some sort of GSLK steering to LLAMA and see if you could get it to perform well on the LLM leaderboard, similar to how there's models there that's just DPO applied to LLAMA.
What I'm envisioning is a single agent, with some scaffolding of episodic memory and executive function to make it more effective. If I'm right, that would be not the simplest, but the cheapest way to AGI, since it fills some gaps in the language model's abilities without using brute force. I wrote about this vision of language model cognitive architectures here.
I'm realizing that the distinction between a minimal language model agent and the sort of language model cognitive architecture I think will work better is a real distinction, and most people assume with you that a language model agent will just be a powerful LLM prompted over and over with something like "keep thinking about that, and take actions or get data using these APIs when it seems useful". That system will be much less explicitly goal directed than an LMA with additonal executive function to keep it on-task and therefore goal-directed.
I intend to write a post about that distinction.
On your original question, see also Kristin's comment and the paper she suggests. It's work on a modification of the transformer algorithm to make more easily interpretable representations. I meant to mention it, and Roger Dearnaley's post on it. I do find this a promising route to better interpretability. The ideal foundation model for a safe agent would be a language model that's also trained with an algorithm that encourages interpretable representations.
That's interesting re: LLMs as having "conceptual interpretability" by their very nature. I guess that makes sense, since some degree of conceptual interpretability naturally emerges given 1) sufficiently large and diverse training set, 2) sparsity constraints. LLMs are both - definitely #1, and #2 given regularization and some practical upper bounds on total number of parameters. And then there is your point - that LLMs are literally trained to create output we can interpret.
I wonder about representations formed by a shoggoth. For the most efficient prediction of what humans want to see, the shoggoth would seemingly form representations very similar to ours. Or would it? Would its representations be more constrained by and therefore shaped by its theory of human mind, or by its own affordances model? Like, would its weird alien worldview percolate into its theory-of-human-mind representations? Or would its alienness not be weird-theory-of-human-mind so much as everything else going on in shoggoth mind?
More generically, say there's System X with at least moderate complexity. One generally intelligent creature learns to predict System X with N% accuracy, but from context A (which includes its purpose for learning System X / goals for it). Another generally intelligent creature learns how to predict System X with N% accuracy but from a very different context B (it has very different goals and a different background). To what degree would we expect their representations to be be similar / interpretable to one another? How does that change given the complexity of the system, the value of N, etc?
Anyway, I really just came here to drop this paper - https://arxiv.org/pdf/2311.13110.pdf - re: @Sodium's wondering "some suitable loss function that incentivizes the model to represent its learned concepts in an easily readable form." I'm curious about the same question, more from the applied standpoint of how to get a model to learn "good" representations faster. I haven't played with it yet tho.
I tend to think that more-or-less how we interpret the world is the simplest way to interpret it (at least for the mesa-scale of people and technologies. I doubt there's a dramatically different parsing that makes more sense. The world really seems to be composed of things made of things, that do things to things for reasons based on beliefs and goals. But this is an intuition.
Clever compressions of complex systems, and better representations of things outside of our evolved expertise, like particle phsyics, sociology and economics, seem quite possible.
Good citation; I meant to mention it. There's a nice post on it.
If System X is of sufficient complexity / high dimensionality, it's fair to say that there are many possible dimensional reductions, right? And not just globally better or worse options; instead, reductions that are more or less useful for a given context.
However, a shoggoth's theory-of-human-mind context would probably be a lot of like our context, so it'd make sense that the representations would be similar.
The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)
This is an interesting direction, thank you for summarizing.
How about an intuition pump? How would it feel if someone did this to me through brain surgery? If we are editing the agent, then perhaps the target concept would come to mind and tongue with ease? That seems in line with the steerability results from multiple mechanistic interpretability papers. I
I'll note that those mechanistic steering papers usually achieve around 70% accuracy on out-of-distribution behaviors. Not high enough? So while we know how to intervene and steer a little, we still have some way to go.
It's worth mentioning that we can elicit these concepts in either the agent, the critic/value network, or the world model. Each approach would lead to different behavior. Eliciting them in the agent seems like agent steering, eliciting in the value network like positive reinforcement, and in the world model like hacking its understanding of the world and changing its expectations.
Notably, John Wentworth's plan has a fallback strategy. If we can't create exact physics-based world models, we could fall back on using steerable world models. Here, we steer them by eliciting learned concepts. So his fallback plan also aligns well with the content of this article.
Thanks for the thoughts!
If someone did this to you, it would be the "plan for mediocre alignment" since you're a brainlike GI. They'd say "think about doing whatever I tell you to do", then you'd wake up and discover that you absolutely love the idea of doing what that person says.
What they'd have done while you are asleep is to do whatever intterpretability stuff they could manage, to verify that you were really thinking about that concept. Then they re-instantiated that same brain state, and induced heavy synaptic strengthening in all the synapses into the dopamine system while you were activating that concept (firing all the neurons that represent it) in your cortex.
I agree that we're not quite at the level of doing this safely. But if it's done with sub-human level AGI, you can probably get multiple attempts in a "boxed" state, and deploy whatever interpretability measures you can to see if it worked. It wouldn't need anything like Davidad's physics simulation (does Wentworth really also calll for this?), which is good because that seems wildly unrealistic to me. This approach is still vulnerable to deceptive alignment if your techniques and interpretability isn't good enough.
I don't think we really can apply this concept to choosing values in the "'agent" if I'm understanding what you mean. I'm only applying the concept to selecting goal representations in a critic or other steering subsystem. You could apply the same concept to selecting goals in a policy or actor network, to the extent they have goal representations. It's an interesting question. I think current instantiations only have goals to a very vague and limited extent. The concept doesn't extend to selecting actions you want, since actions can serve different goals depending on context.
See my footnote on why I don't think constitutional AI really is a GSLK approach. But constitutional AI is about the closest you could come to selecting goal representations in an actor or policy network; you need a separate representtation of those goals to do the training, like the "constitution" in CAI to call it "selecting".
Ii realize this stance conflicts with the idea of doing RLHF on an LLM for alignment. I tend to agree with critics that it's pretty much a misuse of the term "alignment". The LLM doesn't have goals in the same strong sense that humans do, so you can't align its goals. LLMs do have sort of implicit or simulated goals, and I do realize that these could hypothetically make them dangerous. But I just don't think that's a likely first route to dangerous AGI when it's so easy to add goals and make LLMs into agentic language model cognitive architectures.
Summary:
Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties of RL alignment. Instead of trying to build an adequate representation of the behavior and goals we want, by specifying rewards, we can choose its goals from the representations it has learned through any learning method.
I give three examples of this approach: Steve Byrnes’ plan for mediocre alignment (of RL agents); John Wentworth’s “retarget the search” for goal-directed mesa-optimizers that could emerge in predictive networks; and natural language alignment for language model agents. These three approaches fall into a natural category that has important advantages over more commonly considered RL alignment approaches.
An alternative to RL alignment
Recent work on alignment theory has focused on reinforcement learning (RL) alignment. RLHF and Shard Theory are two examples, but most work addressing network-based AGI assumes we will try to create human-aligned goals and behavior by specifying rewards. For instance, Yudkowsky’s List of Lethalities seems to address RL approaches and exemplifies the most common critiques: specifying behavioral correlates of desired values seems imprecise and prone to mesa-optimization and misgeneralization in new contexts. I think RL alignment might work, but I agree with the critique that much optimism for RL alignment doesn’t adequately consider those concerns.
There’s an alternative to RL alignment for network-based AGI. Instead of trying to provide reinforcement signals that will create representations of aligned values, we can let it learn all kinds of representations, using any learning method, and then select from those representations what we want the goals to be.
I’ll call this approach goals selected from learned knowledge (GSLK). It is a novel alternative not only to RL alignment but also to older strategies focused on specifying an aligned maximization goal before training an agent. Thus, it violates some of the assumptions that lead MIRI leadership and similar thinkers to predict near-certain doom.
Goal selection from learned knowledge (GSLK) involves allowing a system to learn until it forms robust representations, then selecting some of these representations to serve as goals. This is a paradigm shift from RL alignment. RL alignment has dominated alignment discussions since deep networks became the clear leader in AI. RL alignment attempts to construct goal representations by specifying reward conditions. In GSLK alignment, the system learns representations of a wide array of outcomes and behaviors, using any effective learning mechanisms. From that spectrum of representations, goals are selected. This shifts the problem from creation to selection of complex representations.
This class of alignment approaches shares some of the difficulties of RL alignment proposals, but not all of them. Thus far GSLK approaches have received little critique or analysis. Several recent proposals share this structure, and my purpose here is to generalize from those examples to identify the category.
I think this approach is worth some careful consideration because it’s likely to actually be tried. This approach applies both to LLM agents, and to most types of RL agents, and to agentic mesa-optimization in large foundation models. And it’s pretty obvious, at least in hindsight. If the first agentic AGI is an LLM agent, an RL agent, or a combination of the two, I think it’s fairly likely that this will be part of the alignment plan whose success or failure determines all of our fates. So I’d like to get more critique and analysis of this approach.
A metaphor: communicating with an alien
Prior to giving examples of GSLK alignment, I’ll share a loose metaphor that captures some intuitive reasons for optimism them. Suppose you had to convey to an alien what you meant by “kindness” without sharing a language. You might show it many instances of people helping other people and animals. You’d probably include some sci-fi depictions of aliens and humans helping each other. That might work. But it might not; the alien, if it was more alien than you expected, might deduce something related but importantly wrong like “charity” with a negative connotation.
If you could somehow read that alien’s mind fairly well, you could signal “that!” when it’s thinking about something like kindness. If you repeated that procedure, it seems more likely that you’d get it to understand what you’re trying to convey. This is one way of selecting goals, by interpretability. Better yet, if the alien has some grasp of a shared language, you could use the word “kindness” and a bunch more words to try to convey what you’re talking about.
Goal selection from learned knowledge is like using language and/or “mind reading” (in the form of interpretability methods) to identify or evoke the alien’s existing knowledge of the concept you want to convey. RL alignment is like trying to convey what you mean solely by giving examples.
Plans for GSLK alignment
To clarify, I want to briefly mention the three proposals I know of that take this approach (constitutional AI doesn’t fit this category, despite similarities[1]). Each allows humans to select an AGI’s goals from representations it’s learned.
Each of these is a method to select goals from learned knowledge. None of them involve constructing goals (or just aligned behavior) using RL with carefully selected reinforcements.
I’m sure you can see potential problems with each of these. I do too. There are a lot of caveats, problems to solve, and bells and whistles to be added to those simple summaries. But here I want to focus on the overlap between these approaches, and the advantage they give over RL alignment plans.
Advantages of GSLK over RL
Wentworth’s statement of his method’s strengths applies to this whole class of approaches:
Each of these techniques does this, for the same reasons. There are still problems to be solved. Correctly identifying the desired goal representations, and wisely choosing the goal representations you want (goalcrafting) are still nontrivial problems. But worrying about mesa-optimization (the inner alignment problem), is gone. Misgeneralization in new contexts is still a problem, but it's arguably easier to solve with these approaches, and with wise selection of goals. More on this below.
GSLK approaches allow learning beyond RL to be useful for alignment. Recent successes in transformers and other architectures suggest that predictive learning may be superior to RL in creating rich representations of the world. Predictive learning is driven by a vector signal of information about what actually occurred, whereas RL uses a scalar signal reflecting only the quality of outcomes. Predictive learning also avoids the limitations of external labeling required in RL. The brain appears to use predictive learning for its “heavy lifting” in the cortex, with RL in subcortical areas to select actions and goals from those rich cortical representations.[2] RL agents appear to benefit from similar combinations.
Goal selection from learned knowledge is contingent on being able to stop an AGI’s training to align it. But deep network learning progresses relatively predictably, at least as far as we’ve trained them. So stopping network-based systems after substantial learning seems likely to work. There are ways this can go wrong, but those don’t seem likely enough to prevent people from trying it.[3] I’ve written more about how this predictable rate of learning allows us to use its understanding of what we want to make AGI a “genie that cares what we want” in The (partial) fallacy of dumb superintelligence.
RL alignment focuses on creating aligned goals by rewarding preferable outcomes. Examples include RLHF (Reinforcement Learning from Human Feedback) for Large Language Models (LLMs) and the Shard Theory's suggestion of selecting a set of algorithmic rewards to achieve alignment. The challenge lies in ensuring that the set of reinforcements collectively forms a representation of the desired goals, a process that seems unreliable. For instance, specifying something as complex and abstract as human flourishing, that remains stable in all contexts, by pointing to specific instances seems difficult and fallible. Even conveying the relatively simple goal “do what this guy wants” by rewarding examples seems fraught with generalization problems. This is the basis of squiggle maximizer concerns.
Remaining challenges
Some of those concerns also apply to goals selected from learned knowledge. We could make the selection poorly, or abstractions that adequately describe our values within the current context might fail to generalize to very different contexts. The strength of GSLK over RL alignment is that we have better representations to select from, so it’s more likely that they’ll generalize well. This is particularly apparent for systems that have acquired a grasp of natural language; language tends to generalize fairly well, since the meaning of words is dependent on surrounding language. However, the functional meaning of words does change for humans as we learn and encounter new contexts, so this does not entirely solve the problem. Those concerns can also be addressed by mechanistic interpretability; it can be used to better understand and select goals from learned representations. However, even with advanced mechanistic interpretability, there remains a risk of divergence between the model’s understanding of concepts like human flourishing and our own.
Concerns about incorrectly specified or changing meanings of goal representations are unavoidable in any alignment scheme including deep networks. It’s impossible to know their representations fully, and their functional meaning changes if any part of the network continues learning. Our best mechanistic interpretability will almost certainly be imperfect. And generalizing from current representations to apply them out of context is also difficult to predict. I think these difficulties strongly suggest that we will attempt to retain direct control and the ability to modify our AGI, rather than attempting to specify outer alignment and allow it full autonomy (and eventually sovereignty). I think that Corrigibility or DWIM is an attractive primary goal for AGI in part because “Do what I mean, and check with me” reduces the complexity of the target, making it easier to adequately learn and identify, but outer alignment is separable from the GSLK approach to inner alignment.
I’ve been trying to understand and express why I find natural language alignment and “mediocre” alignment for actor-critic RL so much more promising than any other alignment techniques I’ve found. I’ve written about how they work directly on steering subsystems and how they have low alignment taxes, including applying to the types of AGI we’re most likely to develop first. They’re also a distinct alternative to RL alignment, so they seem important to consider separately.
Constitutional AI from Anthropic has some of the properties of GSLK alignment but not others. Constitutional AI does select goals from its learned knowledge in one important sense. It specifies criteria for its output, (a weak but real sense of “goal”), using a "constitution" stated in natural language. But it's not an alternative to RL, because it applies those “goals” entirely through an RL process. The other methods I mention include no RL in their alignment methods. The “plan for mediocre alignment” applies to RL agents, but the method of setting critic weights to the selected goal representations overwrites the goal representations created through RL training. See that post for important caveats about whether it would work to entirely overwrite the RL-created goal representations. Similarly, natural language alignment has no element of RL, but could be applied in parallel with RL training - but that would re-introduce the problems of mesa-optimization and goal mis-specification inherent to RL alignment.
I think this division of labor between RL and other learning mechanisms is nearly a consensus in neuroscience. I'm not sure only because polls are rare, and neuroscientists are often contrarians. Steve Byrnes has summarized evidence for this in [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering and the remainder of his excellent sequence Intro to Brain-Like-AGI Safety.
LLMs might display emergent agency at some point in their training, but it seems likely we can train them farther without that, or detect that agency. Current LLMs appear to have adequate world knowledge to mostly understand the most relevant concepts. I wouldn’t trust them to adequately understand “human flourishing” in all contexts, but I think they adequately understand “Your primary goal is to make sure this team can shut you down for adjustments”. Such a corrigibility or "do what I mean and check" goal also punts on the problem of selecting a perfect set of goals for all time.