You don't seem to have mentioned the alignment target "follow the common sense interpretation of a system prompt" which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible (or incorrigible if you are worried about misuse).
It means something closer to "very subtly bad in a way that is difficult to distinguish from quality work". Where the second part is the important part.
I think my arguments still hold in this case though right?
i.e. we are training models so they try to improve their work and identify these subtle issues -- and so if they actually behave this way they will find these issues insofar as humans identify the subtle mistakes they make.
My guess is that your core mistake is here
I agree there are lots of "messy in between places," but these are also alignment failures we see in humans.
And if humans had a really long time to do safety reseach, my guess is we'd be ok. Why? Like you said, there's a messy complicated system of humans with different goals, but these systems empirically often move in reasonable and socially-beneficial directions over time (governments get set up to deal with corrupt companies, new agencies get set up to deal with issues in governments, etc)
and i expect we can make AI agents a lot more aligned than humans typically are. e.g. most humans don't actually care about the law etc but, Claude sure as hell seems to. If we have agents that sure as hell seem to care about the law and are not just pretending (they really will, in most cases, act like they care about the law) then that seems to be a good state to be in.
I definitely agree that the AI agents at the start will need to be roughly aligned for the proposal above to work. What is it you think we disagree about?
Yes, thanks! Fixed
> So to summarize your short, simple answer to Eliezer's question: you want to "train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end". And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
This one is roughly right. Here's a clarified version:
- If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to "put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome" (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) "despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad."
I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we'll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)
And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.
Here's where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.
My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.
Very concretely. I imagine that if I'm training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?
If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.
Are we close to a crux?
Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.
Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.
Like I said at the start of my comment: "I'll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement."
I don't actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.
What I actually expect to happen is something like:
- Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
- Early AI successors create smarter successors in basically the same way.
- At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.
But at this point, humans will long be obsolete. The position I am defending in this post is that it's not very important for us humans to think about these scalable approaches.
Yeah that's fair. Currently I merge "behavioral tests" into the alignment argument, but that's a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task
But my guess is we agree on the object level here and there's a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.
Because "nice" is a fuzzy word into which we've stuffed a bunch of different skills, even though having some of the skills doesn't mean you have all of the skills.
Developers separately need to justify models are as skilled as top human experts
I also would not say "reasoning about novel moral problems" is a skill (because of the is ought distinction)
> An AI can be nicer than any human on the training distribution, and yet still do moral reasoning about some novel problems in a way that we dislike
The agents don't need to do reasoning about novel moral problems (at least not in high stakes settings). We're training these things to respond to instructions.
We can tell them not to do things we would obviously dislike (e.g. takeover) and retain our optionality to direct them in ways that we are currently uncertain about.
I do not think that the initial humans at the start of the chain can "control" the Eliezers doing thousands of years of work in this manner (if you use control to mean "restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner")
That's because each step in the chain requires trust.
For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work -- even though the Eliezers at the end are fully capable of doing something else instead.
You don't seem to have mentioned the alignment target "follow the common sense interpretation of a system prompt" which seems like the most sensible definition of alignment to me (its alignment to a message, not to a person etc). Then you can say whatever the heck you want in that prompt, including how you would like the AI to be corrigible.