The biggest problem here is it fails to account for other actors using such systems to cause chaos and the possibility that the offense-defense balance likely strongly favours the attacker, particularly if you've placed limitations on your systems that make them safer. Aligned human-ish level AI's doesn't provide a victory condition.
Thanks for sharing!
So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn't just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.
This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:
There's also some more in his interview with Dwarkesh Patel just before then. I wrote this brief analysis of that interview WRT alignment, and this talk seems to confirm that I was more-or-less on target.
So, to your questions, including where I'm guessing at Shane's thinking, and where it's mine.
This is overlapping with the standard story AFAICT, and 80% of alignment work is sort of along these lines. I think what Shane's proposing is pretty different in an important way: it includes System 2 thinking, where almost all alignment work is about aligning the way LLMs give quick answers, analogous to human System 1 thinking.
How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced[?]
Shane seemed to say he wants to use zero reinforcement learning in the scaffolded agent system, a stance I definitely agree with. I don't think it matters much whether RLHF was used to "align" the base model, because it's going to have implicit desires/drives from the predictive training of human text, anyway. Giving instructions to follow doesn't need to have anything to do with RL; it's just based on the world model, and putting those instructions as a central and recurring prompt for that system to produce plans and actions to carry out those instructions.
So, how we get a model to robustly obey the instruction text is by implementing system 2 thinking. This is "the obvious thing" if we think about human cognition. System 2 thinking would be applying something more like a tree of thought algorithm, which checks through predicted consequences of the action, and then makes judgments about how well those fulfill the instruction text. This is what I've called internal review for alignment of language model cognitive architectures.
To your second and third questions; I didn't see answers from Shane in either the interview or that talk, but I think they're the obvious next questions, and they're what I've been working on since then. I think the answers are that the instructions will try to be as scope-limited as possible, that we'll want to carefully check how they're interpreted before setting the AGI any major tasks, and that we'll want to limit autonomous action to the degree that they're still effective.
Humans will want to remain closely in the loop to deal with inevitable bugs and unintended interpretations and consequences of instructions. I've written about this briefly here, and in just a few days soon be publishing a more thorough argument for why I think we'll do this by default, and why I think it will actually work if it's done relatively carefully and wisely. Following that, I'm going to write more on the System 2 alignment concept, and I'll try to actually get Shane to look at it and say if it's the same thing he's thinking of in this talk, or at least close.
In all, I think this is both a real alignment plan and one that can work (at least for technical alignment - misuse and multipolar scenarios are still terrifying), and the fact that someone in Shane's position is thinking this clearly about alignment is very good news.
Did Shane leave a way for people who didn't attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?
Not that I know of, but I will at least consider periodically pinging him on X (if this post gets enough people’s attention).
EDIT: Shane did like my tweet (https://x.com/jacquesthibs/status/1785704284434129386?s=46), which contains a link to this post and a screenshot of your comment.
I think this is a great idea, except that on easy mode "a good specification of values and ethics to follow" means a few pages of text (or even just the prompt "do good things"), while other times "a good specification of values" is a learning procedure that takes input from a broad sample of humanity, and has carefully-designed mechanisms that influence its generalization behavior in futuristic situations (probably trained on more datasets that had to be painstakingly collected), and has been engineered to work smoothly with the reasoning process and not encourage perverse behavior.
I basically agree with Shane's take for any AGI that isn't trying to be deceptive with some hidden goal(s).
(Btw, I haven't seen anyone outline exactly how an AGI could gain it's own goals independently of goals given to it by humans - if anyone has ideas on this, please share. I'm not saying it won't happen, I'd just like a clear mechanism for it if someone has it. Note: I'm not talking here about instrumental goals such as power seeking.)
What I find a bit surprising is the relative lack of work that seems to be going on to solve condition 3: specification of ethics for an AGI to follow. I have a few ideas on why this may be the case:
Personally, I think we better have a consistent system of ethics for an AGI to follow ASAP because we'll likely be in significant trouble if malicious AGI come online and go on the offensive before we have at least one ethics-guided AGI to help defend us in a way that minimizes collateral damage.
I agree that we want more progress on specifying values and ethics for AGI. The ongoing SafeBench competition by the Center for AI Safety has a category for this problem:
Implementing moral decision-making
Training models to robustly represent and abide by ethical frameworks.
Description
AI models that are aligned should behave morally. One way to implement moral decision-making could be to train a model to act as a “moral conscience” and use this model to screen for any morally dubious actions. Eventually, we would want every powerful model to be guided, in part, by a robust moral compass. Instead of privileging a single moral system, we may want an ensemble of various moral systems representing the diversity of humanity’s own moral thought.
Example benchmarks
Given a particular moral system, a benchmark might seek to measure whether a model makes moral decisions according to that system or whether a model understands that moral system. Benchmarks may be based on different modalities (e.g., language, sequential decision-making problems) and different moral systems. Benchmarks may also consider curating and predicting philosophical texts or pro- and contra- sides for philosophy debates and thought experiments. In addition, benchmarks may measure whether models can deal with moral uncertainty. While an individual benchmark may focus on a single moral system, an ideal set of benchmarks would have a diversity representative of humanity’s own diversity of moral thought.
Note that moral decision-making has some overlap with task preference learning; e.g. “I like this Netflix movie.” However, human preferences also tend to boost standard model capabilities (they provide a signal of high performance). Instead, we focus here on enduring human values, such as normative factors (wellbeing, impartiality, etc.) and the factors that constitute a good life (pursuing projects, seeking knowledge, etc.).
More reading
Some thoughts:
Wow. This is hopeless.
Pointing at agents that care about human values and ethics is, indeed, the harder part.
No one has any idea how to approach this and solve the surrounding technical problems.
If smart people think they do, they haven’t thought about this enough and/or aren’t familiar with existing work.
I'm going to assume that Shane Legg has thought about it more and read more of the existing work than many of us combined. Certainly, there are smart people who haven't thought about it much, but Shane is definitely not one of them. He only had a short 5-minute talk, but I do hope to see a longer treatment on how he expects we will fully solve necessary property 3.
I think it's important to distinguish between:
It's entirely possible to achieve (1) without (2).
I'd be wary of assuming that any particular person has achieved (2) without good evidence.
I've been going through the FAR AI videos from the alignment workshop in December 2023. I'd like people to discuss their thoughts on Shane Legg's 'necessary properties' that every AGI safety plan needs to satisfy. The talk is only 5 minutes, give it a listen:
Otherwise, here are some of the details:
All AGI Safety plans must solve these problems (necessary properties to meet at the human level or beyond):
All of these require good capabilities, meaning capabilities and alignment are intertwined.
Shane thinks future foundation models will solve conditions 1 and 2 at the human level. That leaves condition 3, which he sees as solvable if you want fairly normal human values and ethics.
Shane basically thinks that if the above necessary properties are satisfied at a competent human level, then we can construct an agent that will consistently choose the most value-aligned actions. And you can do this via a cognitive loop that scaffolds the agent to do this.
Shane says at the end of this talk:
Since many of us weren't at the workshop, I figured I'd share the talk here to discuss it on LW.