Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.
The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.
I agree that the vulnerable world hypothesis is probably false and that if we could only scale up to superintelligence in parallel across many different projects / nations / factions, such that the power is distributed, AND if we can make it so that most of the ASIs are aligned at any given time, things would probably be fine.
However, it seems to me that we are headed to a world with much higher concentration of AI power than that. Moreover, it's easier to create misaligned AGI than to create aligned AGI, so at any given time the most powerful AIs will be misaligned--the companies making aligned AGIs will be going somewhat slower, taking a few hits to performance, etc.
But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.
What we want is reasonable compliance in the sense of:
- Following the specification precisely when it is clearly defined.
- Following the spirit of the specification in a way that humans would find reasonable in other cases.
This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.
Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do what's right' and 'love humanity' 'be reasonable' and 'do what normal people would want you to do in this situation if they understood it as well as you do' on the one end, and then thousand-page detailed specs / constitutions / flowcharts on the other end. But I claim that the problems that arise on each end of the spectrum don't go away if you are in the middle of the spectrum, they just lessen somewhat. Example: On the "thousand page spec" end of the spectrum, the obvious problem is 'what if the spec has unintended consequences / loopholes / etc.?" If you go to the middle of the spectrum and try something like Reasonable Compliance, this problem remains but in weakened form: 'what if the clearly-defined parts of the spec have unintended consequences / loopholes / etc.?' Or in other words, 'what if every reasonable interpretation of the Spec says we must do X, but X is bad?' This happens in Law all the time, even though the Law does include for flexible vague terms like 'reasonableness' in its vocabulary.
Second point. Making an AI be reasonably compliant (or just compliant) instead of Good, means you are putting less trust in the AI's philosophical reasoning / values / training process / etc. but more trust in the humans who get to write the Spec. Said humans had better be high-integrity and humble, because they will be tempted in a million ways to abuse their power and put things in the Spec that essentially make the AI a reflection of their own ideosyncratic values -- or worse, essentially making the AI their own loyal servant instead of making it serve everyone equally. (If we were in a world with less concentration of AI power, this wouldn't be so bad -- in fact arguably the best outcome is 'everyone gets their own ASI aligned to them specifically.' But if there is only one leading ASI project, with only a handful of people at the top of the hierarchy owning the thing... then we are basically on track to create a dictatorship or oligarchy.
Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.
I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critique by the scientific community, and also if the safety cases get torn to shreds such that a reasonable disinterested expert would conclude 'holy shit this thing is not safe, it plausibly is faking alignment already and/or inclined to do so in the future' then we halt internal deployment and beef up our control measures / rebuild with a different safer design / etc. Does not feel like too much to ask, given that everyone's lives are on the line.
We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.
I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I'm thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don't have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development works and figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)
Safety and alignment are AI capabilities
I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).
Thanks for taking the time to think and write about this important topic!
Here are some point-by-point comments as I read:
(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.)
I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ASI into the economy, you've probably fallen into one of the two stable attractor states I describe here. Which one you fall into depends on choices made earlier, e.g. how much alignment talent you bring into the project, the extent to which that talent is optimistic vs. appropriately paranoid, the time you give them to let them cook with the models, the resources you give them (% of total compute, say in overall design strategy), etc.
This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.
I agree that future AGIs and ASIs will be to a significant extent scaled up versions of current models (at least at first; I expect the intelligence explosion to rapidly lead to additional innovations and paradigm shifts). I'm not sure what you are saying with the other sentences. Sometimes when people talk about current alignment techniques working, what they mean is 'causes current models to be better at refusals and jailbreak resistance' which IMO is a tangentially related but importantly different problem from the core problem(s) we need to solve in order to end up in the good attractor state. After all, you could probably make massive progress on refusals and jailbreaks simply by making the models smarter, without influencing their goals/values/principles at all.
Oh wait I just remembered I can comment directly on the text with a bunch of little comments instead of making one big comment here -- I'll switch to that henceforth.
Cheers!
Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"
Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
Brief thoughts on Deliberative Alignment in response to being asked about it
My summary as I currently understand it:
So there’s a chicken and egg problem here. You are basically using a model to grade/evaluate/train itself. Obvious problem is that if the initial model is misaligned, the resulting model might also be misaligned. (In fact it might even be *more* misaligned).
I can imagine someone responding “no problem we’ll just do the same thing but with a different model as the reward model, perhaps a more trusted model.” So let me break down three categories of strategy:
See below for explanation of each term/phrase.
I’m most excited about the second kind of strategy (which includes IDA and W2SG for example). Deliberative Alignment and Constitutional AI seem to be either the first or the third.
Possible confusing drift dynamics: This chain of AIs training AIs training AIs... seems like the sort of thing that could have all sorts of interesting or even chaotic drift dynamics. I don't have much to say here other than that. If we understood these drift dynamics better perhaps we could harness them to good effect, but in our current state of ignorance we need to add this to the list of potential problems.