Daniel Kokotajlo

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:


(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."
(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)


 
Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Sequences

Agency: What it is and why it matters
AI Timelines
Takeoff and Takeover in the Past and Future

Wiki Contributions

Comments

Sorted by

Brief thoughts on Deliberative Alignment in response to being asked about it

  • We first train an o-style model for helpfulness, without any safety-relevant data.  
  • We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
  • We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model with a strong prior for safe reasoning. Through SFT, the model learns both the content of our safety specifications and how to reason over them to generate aligned responses.
  • We then use reinforcement learning (RL) to train the model to use its CoT more effectively. To do so, we employ a reward model with access to our safety policies to provide additional reward signal.

My summary as I currently understand it:

  1. Pretrain
  2. Helpful-only Agent (Instruction-following and Reasoning/Agency Training)
  3. Deliberative Alignment
    1. Create your reward model: Take your helpful-only agent from step 2 and prompt it with “is this an example of following these rules [insert spec]?” or something like that
    2. Generate some attempted spec-followings: Take your helpful-only agent from step 2 and prompt it with “Follow these rules [insert spec], think step by step in your CoT before answering.” Feed it a bunch of example user inputs.
    3. Evaluate the CoT generations using your reward model, and then train another copy of helpful-only agent from 2 on that data using SFT to distill the highest-evaluated CoTs into it. (removing the spec part of the prompt) That way, it doesn’t need to have the spec in the context window anymore, and also, it ‘reasons correctly’ about the spec (i.e. reasons in ways the RM would have most approved of)
    4. Take the resulting agent and do RL on it using the same RM from B. This time, the RM doesn’t get to see the CoT.
  4. Deploy the resulting agent.

So there’s a chicken and egg problem here. You are basically using a model to grade/evaluate/train itself. Obvious problem is that if the initial model is misaligned, the resulting model might also be misaligned. (In fact it might even be *more* misaligned).

I can imagine someone responding “no problem we’ll just do the same thing but with a different model as the reward model, perhaps a more trusted model.” So let me break down three categories of strategy:

Train your smart agent with evaluations produced by humans and/or dumber modelsSevere oversight failures + Inductive bias worries + possible confusing drift dynamics
Train your smart agent with evaluations produced by different-but-similarly-capable models.Less severe oversight failures +  Inductive bias worries + mild chicken-and-egg problem + possible confusing drift dynamics
Train your smart agent with evaluations produced by itself (or something similar enough to be relevantly similar to itself)Severe chicken-and-egg problem + possible confusing drift dynamics

 

 

 



 

 

 

See below for explanation of each term/phrase.

I’m most excited about the second kind of strategy (which includes IDA and W2SG for example). Deliberative Alignment and Constitutional AI seem to be either the first or the third.

  • Severe oversight failures: The smart agent will often find itself having to consciously and/or unconsciously choose between being aligned and getting reinforced, e.g. when it notices opportunities to lie, manipulate, or cheat and get away with it. See basically all the classic literature e.g. Ajeya Cotra's training game report.
  • Less severe oversight failures: Being overseen by an equally-competent agent helps with a lot of things, but doesn't completely solve the oversight failure problem -- e.g. I might be just as competent as you, but since I can't read your mind I can't tell if you are lying. I might not even be able to tell if the code you are writing contains a security vulnerability or an applause light (i.e. something that isn't actually good code but which appears to human raters to be good code) though I'm much more likely to be able to tell than if I was dumber than you.
  • Severe chicken-and-egg problem: You don't really have oversight failures if the overseer is exactly or almost exactly the same model. Because e.g. it knows exactly what the agent model was thinking when it output X, so it knows if the agent model was being dishonest. (Though now that I say that out loud, I realize it's not strictly true, there could still be oversight failures. But they seem much less severe at least.) However, the chicken and egg problem is severe. Somehow you need to get the overseer-variant of the model to e.g. do its best to penalize dishonesty, penalize cheating, etc. despite not itself being disinclined to lie and cheat! (If it already was discinclined to lie and cheat, then great, but how did that happen?) Perhaps this can work e.g. if different copies of the same model have different goals (perhaps because they are selfish / have indexical preferences).
  • Mild chicken-and-egg problem: When you are training a new-generation AI system that is more capable than your current stuff, how do you align it, if your alignment strategy involves using similarly-capable models to do the evaluation? This paradox may have some answers; perhaps W2SG is an example of one. (Use the non-situationally-aware base model!) Another possible answer is IDA maybe (use a slightly earlier version of the model but with more time to think / more scaffolding)
  • Inductive bias worries: If your Spec is complicated enough, the model might  internalize the Spec in the world-model rather than in the goal-slot, where the thing in the goal-slot is something simpler than the Spec. If you are lucky it'll be a pointer to the Spec that is robust to future distribution shifts; if you aren't, it'll be a nonrobust pointer, if you are extra unlucky it'll be some other goalset (including proxies, ICGs, and shards or variants of the Spec) combined with instrumental convergence. Ditto for your Intentions. See the mesaoptimizers paper "Risks from Learned Optimization." (Also I think this doesn't actually depend on there being a goal slot, the goal slot model is just a helpful simplification.)

Possible confusing drift dynamics: This chain of AIs training AIs training AIs... seems like the sort of thing that could have all sorts of interesting or even chaotic drift dynamics. I don't have much to say here other than that. If we understood these drift dynamics better perhaps we could harness them to good effect, but in our current state of ignorance we need to add this to the list of potential problems.

Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.

The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.

I agree that the vulnerable world hypothesis is probably false and that if we could only scale up to superintelligence in parallel across many different projects / nations / factions, such that the power is distributed, AND if we can make it so that most of the ASIs are aligned at any given time, things would probably be fine.

However, it seems to me that we are headed to a world with much higher concentration of AI power than that. Moreover, it's easier to create misaligned AGI than to create aligned AGI, so at any given time the most powerful AIs will be misaligned--the companies making aligned AGIs will be going somewhat slower, taking a few hits to performance, etc.

But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.

  1. We currently have about as much visibility into corporations as we do into large teams of AIs, because both corporations and AIs use english CoT to communicate internally. However, I fear that in the future we'll have AIs using neuralese/recurrence to communicate with their future selves and with each other.
  2. History is full of examples of corporations being 'misaligned' to the governments that in some sense created and own them. (and also to their shareholders, and also to the public, etc. Loads of examples of all kinds of misalignments). Drawing from this vast and deep history, we've evolved institutions to deal with these problems. But with AI, we don't have that history yet, we are flying (relatively) blind.
  3. Moreover, ASI will be qualitatively smarter and than any corporation ever has been.
  4. Moreover, I would say that our current methods for aligning corporations only work as well as they do because the corporations have limited power. They exist in a competitive market with each other, for example. And they only think at the same speed as the governments trying to regulate them. Imagine a corporation that was rapidly growing to be 95% of the entire economy of the USA... imagine further that it is able to make its employees take a drug that makes them smarter and think orders of magnitude faster... I would be quite concerned that the government would basically become a pawn of this corporation. The corporation would essentially become the state. I worry that by default we are heading towards a world where there is a single AGI project in the lead, and that project has an army of ASIs on its datacenters, and the ASIs are all 'on the same team' because they are copies of each other and/or were trained in very similar ways... 

What we want is reasonable compliance in the sense of:

  1. Following the specification precisely when it is clearly defined.
  2. Following the spirit of the specification in a way that humans would find reasonable in other cases.

 

This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.

Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do what's right' and 'love humanity' 'be reasonable' and 'do what normal people would want you to do in this situation if they understood it as well as you do' on the one end, and then thousand-page detailed specs / constitutions / flowcharts on the other end. But I claim that the problems that arise on each end of the spectrum don't go away if you are in the middle of the spectrum, they just lessen somewhat. Example: On the "thousand page spec" end of the spectrum, the obvious problem is 'what if the spec has unintended consequences / loopholes / etc.?" If you go to the middle of the spectrum and try something like Reasonable Compliance, this problem remains but in weakened form: 'what if the clearly-defined parts of the spec have unintended consequences / loopholes / etc.?' Or in other words, 'what if every reasonable interpretation of the Spec says we must do X, but X is bad?' This happens in Law all the time, even though the Law does include for flexible vague terms like 'reasonableness' in its vocabulary. 

Second point. Making an AI be reasonably compliant (or just compliant) instead of Good, means you are putting less trust in the AI's philosophical reasoning / values / training process / etc. but more trust in the humans who get to write the Spec. Said humans had better be high-integrity and humble, because they will be tempted in a million ways to abuse their power and put things in the Spec that essentially make the AI a reflection of their own ideosyncratic values -- or worse, essentially making the AI their own loyal servant instead of making it serve everyone equally. (If we were in a world with less concentration of AI power, this wouldn't be so bad -- in fact arguably the best outcome is 'everyone gets their own ASI aligned to them specifically.' But if there is only one leading ASI project, with only a handful of people at the top of the hierarchy owning the thing... then we are basically on track to create a dictatorship or oligarchy.

Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.


I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critique by the scientific community, and also if the safety cases get torn to shreds such that a reasonable disinterested expert would conclude 'holy shit this thing is not safe, it plausibly is faking alignment already and/or inclined to do so in the future' then we halt internal deployment and beef up our control measures / rebuild with a different safer design / etc.  Does not feel like too much to ask, given that everyone's lives are on the line.

We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.

I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoid them, then we are in a decent position to solve all the other problems. I'm thinking primarily of the failure mode where your AI is pretending to be aligned instead of actually aligned. This failure mode can arise fairly easily if (a) you don't have the interpretability tools to reliably tell the difference, and (b) inductive biases favor something other than the goals/principles you are trying to train in OR your training process is sufficiently imperfect that the AI can score higher by being misaligned than by being aligned. And both a and b seem like they are plausibly true now and will plausibly be true for the next few years. (For more on this, see this old report and this recent experimental result) If we can avoid this failure mode, we can stay in the regime where iterative development works and figure out how to align our AIs better & then start using them to do lots of intellectual work to solve all the other problems one by one in rapid succession. (The good attractor state)

Safety and alignment are AI capabilities

 

I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).

Thanks for taking the time to think and write about this important topic!

Here are some point-by-point comments as I read:

(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.) 

I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ASI into the economy, you've probably fallen into one of the two stable attractor states I describe here. Which one you fall into depends on choices made earlier, e.g. how much alignment talent you bring into the project, the extent to which that talent is optimistic vs. appropriately paranoid, the time you give them to let them cook with the models, the resources you give them (% of total compute, say in overall design strategy), etc.

This assumes that our future AGIs and ASIs will be, to a significant extent, scaled-up versions of our current models. On the one hand, this is good news, since it means our learnings from current models are relevant for more powerful ones, and we can develop and evaluate safety techniques using them. On the other hand, this makes me doubt that safety approaches that do not show signs of working for our current models will be successful for future AIs.

I agree that future AGIs and ASIs will be to a significant extent scaled up versions of current models (at least at first; I expect the intelligence explosion to rapidly lead to additional innovations and paradigm shifts). I'm not sure what you are saying with the other sentences. Sometimes when people talk about current alignment techniques working, what they mean is 'causes current models to be better at refusals and jailbreak resistance' which IMO is a tangentially related but importantly different problem from the core problem(s) we need to solve in order to end up in the good attractor state. After all, you could probably make massive progress on refusals and jailbreaks simply by making the models smarter, without influencing their goals/values/principles at all.

Oh wait I just remembered I can comment directly on the text with a bunch of little comments instead of making one big comment here -- I'll switch to that henceforth.

Cheers!

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

Load More