I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). ... But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
I'd love to hear more about this. To me, "egregious misalignment" feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. You might as well have said "I really don't want my AI to think about politics" or "I really don't want my AI to think about distant superintelligences" or "I really don't want my AI to break any laws."
Separately, how much do you think your views would change if your feelings on this particular point changed?
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field - or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul's words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer - then they would come away from your paragraph thinking, "Oh, well, this isn't something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an 'extreme and somewhat strange failure mode' must surely require that I add on some unusual extra special code to my model in order to produce it."
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you're assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these "extreme and somewhat strange failure modes" from happening, as we agree they automatically would given any "naive" simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It's not just a buffer overflow that's the default for bad security, it's the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. "Strange" is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it's just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.
I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in ways we can't fix as we go - treacherous turns, inner misalignment or reactions to distributional shift. It's just that there are different answers to the question of what's the default outcome depending on if you're asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won't pay sufficient attention to out of distribution robustness), is like saying 'you know the vast majority of physically possible bridge designs fall over straight away and also there's a giant crack in that load-bearing concrete pillar over there' - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn't do much to help the case for expecting catastrophic misalignment and isn't enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can't be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis - that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).
= 'because strongly optimizing for almost anything leads to catastrophe via IC, we can't call catastrophic misalignment a bizarre outcome'?
Maybe it's just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between 'IC alone raises an issue that might not be obvious but doesn't give us a strong reason to expect a catastrophe' and 'IC alone suggests a catastrophe even though it's not the whole story' - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the 'classic' formulation of instrumental convergence/orthogonality - that these are just 'measure based' arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.
Didn't it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Have you played Poly Bridge?
I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.
In some sense changing this view would change my bottom line---e.g. if you ask me "Should you be able to design a bridge that doesn't fall down even in the worst case?" my gut take would be "why would that be possible?"---but I don't feel like there's a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I'm not really sure if or how this is a reductio. I don't think it's a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that's really all I want to say---that this failure seems preventable, and that intuition doesn't seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn't empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels "extreme and somewhat strange," sorry for not clarifying. Like, suppose for some reason "doesn't think about politics" was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I'd be interested to hear more about the law-breaking stuff -- what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn't constructing a reductio, just explaining why the phrase didn't help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there's some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there's at least some possible way to handle it well besides "just don't do it in the first place." Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?
Nice post! I'm interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?
1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)
2. Typical-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and still it doesn't seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety teams, will do.)
3. Worst-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and can't think of anything plausible. (This is your methodology, right?)
4. Ordinary Paranoia: We aren't satisfied until we try hard to think of a way our scheme could fail, and can't think of anything logically and physically possible. (Maybe this isn't importantly different from #3? See below.)
5. Security Mindset: As with ordinary paranoia, except that also we aren't satisfied until we can write a premise-conclusion form argument for why our scheme won't fail, such that the premises don't contain value-laden concepts and are in general fairly concrete/detailed, and such that each premise seems highly likely to be true. (This is what I think MIRI advocates? But I think I see shades of it in your methodology too.)
Second question: What counts as plausible? What does it mean for a story to contradict something we know to be true? The looser our standards for plausibility, the more your methodology ends up looking like Ordinary Paranoia. The stricter our standards for plausibility, the more it ends up looking like Typical-Case AI Alignment.
I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you've been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can't be any failure story...
I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great.) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight. (Iirc, the translation example was in some comment on one of the AI Alignment blog posts from ~2016, though I can't find it right now.) To my eye, inaccessible information is mostly stating this sort of objection more clearly and generally (in particular, it isn't a fundamentally different argument). What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?
High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.
I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.
What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?
I don't think that iterated amplification ever was at the point where we couldn't tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like "what is a solution going to look like?" and there aren't clean lines between "big changes to this algorithm" and "new algorithm."
I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that's a bit of a simplification, but you can see how I talked about it e.g. here).
Some thoughts on that:
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it's the best way to engage with Paul's work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul's native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.
Overview of why:
As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it's interesting if I knew most of my favorite posts the moment they came out.
Curated. This post gives me a lot of context on your prior writing (unaligned benchmark, strategy stealing assumption, iterated amplification, imitative generalization), it helps me understand your key intuitions behind the plausibility of alignment, and it helps me understand where your research is headed.
When I read Embedded Agency, I felt like I then knew how to think productively about the main problems MIRI is working on by myself. This post leaves me feeling similarly about the problems you've been working on for the past 6+ years.
So thanks for that.
I'd like to read a version of this post where each example is 10x the length and analyzes it more thoroughly... I could just read all of your previous posts on each subject, though they're fairly technical. (Perhaps Mark Xu will write it such a post, he did a nice job previously on Solomonoff Induction.)
I'd also be pretty interested in people writing more posts presenting arguments for/against plausible stories for the failure of Imitative Generalization, or fleshing out the details of a plausible story such that we can see more clearly if the story is indeed plausible. Basically, making contributions in the ways you outline.
Aside: Since the post was initially published, some of the heading formatting was lost in an edit, so I fixed that before curating it.
Edit: Removed the line "After reading it I have a substantially higher probability of us solving the alignment problem." Understanding Paul's research is a big positive, but I'm not actually sure I stand by it leading to a straightforward change in my probability.
This post gives great insight into your research methodology, thanks for writing it.
After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the empirical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.
You contrast ‘applied’ and ‘empirical’ here, but they sound the same to me. Is it a typo and you meant ‘applied’ and ‘theoretical’? That would make sense to me.
Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)
For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystallized into definitions like semantic security (and still people sometimes retreat to this informal process in order to define and clarify new security notions).
Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?
The best case is that we end up with a precise algorithm for which we still can’t tell any failure story. In that case we should implement it (in some sense this is just the final step of making it precise) and see how it works in practice.
Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography? What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?
In my other response to your comment I wrote:
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long.
I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities:
My guess would be that probably we're in world 2, and if not that it's probably because no one cares that much (e.g. because it's obvious that there will be some material weakness and the standards of the field are such that it's not publishable unless it actually comes with an attack) and we are in world 3.
(On a quick skim, and from the author's language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)
I'm still curious for your view on the crypto examples you cited. My current understanding is that people do not expect the security proofs to rule out all possible attacks (a situation I can sympathize with since I've written multiple proofs that rule out large classes of attacks without attempting to cover all possible attacks), so I'm interested in whether (i) you disagree with that and believe that serious onlookers have had the expectation that proofs are comprehensive, (ii) you agree but feel it would be impractical to give a correct proof and this is a testament to the difficulty of proving things, (iii) you feel it would be possible but prohibitively expensive, and are expressing a quantitative point about the cost of alignment analyses being impractical, (iv) you feel that the crypto case would be practical but the AI case is likely to be much harder and just want to make a directionally analogous update.
I still feel like more of the action is in my skepticism about the (alignment analysis) <--> (security analysis) analogy, but I could still get some update out of the analogy if the crypto situation is thornier than I currently believe.
Why did you write "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)
I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarification/exposition than changing views.
See the thread with Rohin above for some rough history.
Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me?
I'm not sure.It's possible I would become more pessimistic if I walked through concrete cases of people's analyses being wrong in subtle and surprising ways.
My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don't care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I've seen people on the theory side exhibit when they are trying to think of attacks.
I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long. It sounds like a fun game.
Another possible divergence is that I'm less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it's not clear if that game behaves in the same way. I'm not sure if that's more or less important than the prior point.
Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography?
I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though "before letting it become potentially superintelligent" kind of suggests a development model I'm not on board with).
That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on.
It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).
What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?
I'm not exactly sure what you are asking. It seems like we'll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we'd use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?
I wanted a concrete example of the "human feedback" example. Let me know if this seems like a reasonable interpretation of the scenario.
Let's call the plan that acquires influence in a way obvious to a generative model but invisible and incomprehensible to humans "Plan X." Since by definition, I cannot conceive of "Plan X," we'll modify this scenario slightly.
Instead of a scenario in which humans design an AI, we'll imagine a scenario in which some very dumb goblins design an AI that's as smart as an average human. We'll call this a human-level AI or HL-AI.
The HL-AI controls a robot with camera eyes and motor appendages, and is located in a room full of tools and disassembled Ikea furniture. The robot moves its appendages randomly in response to the HL-AI. The observations of the camera eyes are used to train a generative model that predicts what will be observed via the camera eyes based on previous movements and camera eye observations.
The goblins want the HL-AI to help them build the Ikea furniture. The model has been trained to accurately generate videos that illustrate the predicted results of simulated robot movements. The goblins watch these videos until they find ones in which pieces of furniture are partly or fully assembled. They assign value to the videos according to how much furniture has been built.
Based on these evaluations, the HL-AI searches for plans that lead to trajectories in which the most furniture is assembled.
Failure mode:
The goblins, for example, don't have very good object permanence. They can keep track of a piece of Ikea furniture as it's being assembled step by step. However, if you introduce a screen for a couple of minutes, blocking their view of the furniture, you can remove the original furniture, introduce a new piece (or more than one), and the goblins won't realize what's happened, or even be suspicious.
The goblins are real dumb, so they give points to the videos only based on the amount of partially or fully completed furniture they observe. That means that if the HL-AI figures out how to drag in pieces of furniture from the adjacent office, while turning off the camera eyes so that the goblins can't see it doing this, then the goblins will be tricked into thinking the HL-AI has magically produced extra furniture, for which they will reward it highly.
The goblins just think the furniture is magically disappearing from the office, so they keep replacing it. This allows the HL-AI to continue its strategy.
The goblins now believe they've created an HL-AI that's able to magically produce furniture, which they believe will be very helpful in solving the problem of the office in which furniture magically disappears. They continue putting the HL-AI to new "magical creation tasks," with the result that the HL-AI keeps on moving resources around pointlessly in the goblin world, creating new problems which they assign it to solve, leading to havoc.
Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way"? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
- We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
- We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
- We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.
Of these, only the last one looks to me like it's making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn't cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.
What this looks like (3 examples)
Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your "stable" evolved over the years.
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:
How does one go about working on an impossibility proof? [...]
Then it's time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternatively on one direction and on the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.
I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:
I expect this description of the process to be really helpful to many starting researchers who don't know where to push when one direction or approach fails.
- I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.
This is the main reason I'm excited by empirical work.
For the objections and your response, I don't have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the "grasping towards the precise problem" instead of "solving a well-defined precise problem". I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem -- it has to be relevant to alignment.
The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.
You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion is that we'd have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).
One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.
I feel like these distinctions aren't important until we get to an algorithm for which we can't think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it's impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.
Red-penning is a general problem-solving method that's kinda similar to this research methodology.
These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:
The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantifier. (For step 3, you inline the "forall y in Y" part, because Y is a small finite set.)
The methodology laid out in this post is a counterexample-guided approach to solve the claim "exists alignment proposal: forall plausible worlds: The alignment proposal is safe in the world"
Examples from programming languages include CEGIS (counterexample guided inductive synthesis) and CEGAR (counterexample guided abstraction refinement).
For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:
Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:
To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don't yet have any other helper model that allows to mitigate the associated risk.
If you don't find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.
One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it's not like we know how to optimize for that (while doing a competitive search over a space of ML models).
The process of finding this model of the world is much more complex than anything our AI can do at inference time, and the intermediate results are too complex and numerous to be “memorized” in the weights of our trained AI. So there doesn’t seem to be any way to break the model-finding work into pieces that can be delegated to an ML assistant (in amplification) or a debater (in debate).
I am not understanding this, but it's probably a simple ML terminology thing.
First you train a model, then you use it lots as a black box (of the type: input video-camera data -> output further (predicted) video-camera data). It has a model of physics, and the broad system it's in (Earth, 2000s, industrial revolution has happened, etc).
Is this paragraph saying that the learned model does not have an understanding of physics and current-Earth, but deduces all of this every time the model is run? And that's why the ML assistant isn't able to analyze this model of physics plus current-Earth?
Planned summary for the Alignment Newsletter:
This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:
1. Come up with some alignment algorithm that solves the issues identified so far
2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.
This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.
This methodology could play out as follows:
Step 1: RL with a handcoded reward function.
Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).
Step 1: RL from human preferences over behavior, or other forms of human feedback.
Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.
Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.
Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.
Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.
Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about.
The post also talks about various possible objections you might have, which I’m not going to summarize here.
Planned opinion:
I'm a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.
I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](https://www.alignmentforum.org/posts/EF5M6CmKRd6qZk27Z/my-research-methodology?commentId=8Hq4GJtnPzpoALNtk).
rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2
That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."
In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."
I'm mostly doing that by making it more and more concrete---something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you?
Sometimes after filling in a few details I'll see that the current story isn't actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.
Sometimes I fill in enough details that I'm fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that's consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.
(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there's an argument that a big enough model could certainly compute X . Or sometimes I'm just pretty convinced for heuristic reasons.)
That's not a fully-precise methodology. But it's roughly what I'd do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)
If I was starting looking at the trivial story "and then your algorithm kills you," my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim "And this was the model learned by SGD"), then gradually filling in more details as necessary to evaluate plausibility of the story.
In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."
To fill in the details more:
Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).
It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.
However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.
(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)
More generally, this story is going to be something like:
Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":
there is some way to fill in the rest of the details that's consistent with everything I know about the world.
EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.
----
When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.
----
Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).
As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?
I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).
I agree this involves discretion [...] So instead I'm doing some in between thing
Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).
I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.
(Probably I could have been clearer about this in the original opinion.)
Interesting... On first reading your post, I felt that your methodological approach for dealing with the 'all is doomed in the worst case' problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I'll try to explore the possible differences in methodological outlook, and will end with a question.
The key to your methodology is that you list possible process steps which one might take when one feels like
all of our current algorithms are doomed in the worst case.
The specific doom-removing process step that I want to focus on is this one:
If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]
My feeling is that AGI safety/alignment community is way too reluctant to take this process step of 'add another assumption about the world' in order to eliminate a worst case failure story.
These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can't for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.
So I'm happy to see a post that encourages people to make explicit assumptions about the agent's environment. I have definitely used this technique to make progress in my own work.
But.
When I look at your example of 'the strategy stealing assumption' as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.
To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.
But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:
We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.
The key here is the 'highly unlikely'. If we have an algorithm were
We train a generative model that predicts these camera observations given the motor actions.
then I typically add the following assumption to avoid doom:
In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.
So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for 'throwing out failure stories' in your methodology?
Or is the existence of this assumption automatically implied by default in your process?
I feel confused about the failure story from example 3. (First 3 bullet-points in that section.)
It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can't use the model to predict any high-level facts besides X.
But I don't see how that leads to egregious misalignment. Shouldn't the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase? (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails "off-screen" somewhere.)
(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.)
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post).
Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up?
But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment.
This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:
I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.
What this looks like (3 examples)
My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”
Example 1: human feedback
In an unaligned benchmark I describe a simple AI training algorithm:
In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment:
I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time.
But from my perspective this failure mode is at least plausible — I don’t see any contradictions between this sequence of events and anything I know about the real world. So this is enough for me to conclude that human feedback can’t handle the worst plausible situation, and to keep looking for an algorithm that can.
To better understand whether this story is really plausible, we can spend time refining it into something more and more concrete to see if it still seems to make sense. There are lots of directions in which we could add detail:
Filling more and more details lets us notice if our abstract story was actually incoherent in important ways, or to notice weird things the story implies about the world that we might want to rule out by assumption.
Example 2: iterated amplification
To avoid the problems with raw human feedback, we could train additional ML assistants that help us evaluate outcomes. For example, assistants could point out possible consequences of a plan that we didn’t notice. Various variants of this idea are explored in benign model-free RL, supervising strong learners by amplifying weak experts, AI safety via debate, and recursive reward modeling.
In inaccessible information I tried to explore a story about how this entire family of algorithms could fail:
Example 3: imitative generalization
Imitative generalization is intended to address this problem with iterated amplification.
To briefly summarize: instead of using gradient descent to search over a space of human-incomprehensible models that predict some data (e.g. autoregressive models of videos), we try to search over space of models that a human can “understand” (perhaps with the help of aligned assistants as in amplification or debate), and optimize for a model that both looks plausible to the human and allows the human to successfully predict the same data (i.e. to predict the next pixel of a video). We hope that this allows us to find a human-comprehensible model that allows the human to both predict the data and figure out if the camera is being hacked.
(This algorithm is quite vague, so you could think of it as a whole family of algorithms based on how you parametrize the space of “human-comprehensible” models, how you search over that space, and how you define the prior. I’m going to try to tell a story about the limitations of this whole approach.)
Here’s an exotic situation where I think the naive version of this approach wouldn’t work:
There are many obvious ways to try to address this problem, but I think it does break the most obvious implementations of imitative generalization. So now I have two questions:
After a little bit of inspection it turns out that the original story is inconsistent: it’s literally impossible to run a detailed low-level simulation of physics in situations where the computer itself needs to be part of the simulation. So the story as I told it is inconsistent, and we can breathe a temporary sigh of relief.
Unfortunately, the basic problem persists even when we make the story more complicated and plausible. Our AI inevitably needs to reason about some parts of the world in a heuristic and high-level way, but it could still use a model that is lower-level than what humans are familiar with (or more realistically just alien but simpler). And at that point we have the same difficulty.
It’s possible that further refinements of the story would reveal other inconsistencies or contradictions with what we know about ML. But I’ve thought enough about this that I think this failure story is probably something that could actually happen, and so I’m back to the step of improving or replacing imitative generalization.
This story is even more exotic than the ones in the previous sections. I’m including it in part to illustrate how much I’m willing to push the bounds of “plausible.” I think it’s extremely difficult to tell completely concrete and realistic stories, so as we make our stories more concrete they are likely to start feeling a bit strange. But I think that’s OK if we are trying to think about the worst case, until the story starts contradicting some clear assumptions about reality that we might want to rely on for alignment. When that happens, I think it’s really valuable to talk concretely about what those assumptions are, and be more precise about why the unrealistic nature of the story excuses egregious misalignment.
More general process
We start with some unaligned “benchmark”. We rule out a proposed alignment algorithm if we can come up with any story about how it can be either egregiously misaligned or uncompetitive.
I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:
Objections and responses
Can you really come up with a working algorithm on paper? Empirical work seems important
My goal from theoretical work is to find a credible alignment proposal. Even from that point I think it will take a lot of practical work to get it to the point where it works well and we feel confident about it in practice:
My view is that working with pen and paper is an important first step that allows you to move quickly until you have something that looks good on paper. After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the theoretical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.
That’s why I’m personally excited about “starting with theory,” but I think we should do theoretical and applied work in parallel for a bunch of reasons:
Why think this task is possible? 50% seems way too optimistic
When I describe this methodology, many people feel that I’ve set myself an impossible task. Surely any algorithm will be egregiously misaligned under some conditions?
My “50% probability of possibility” is coming largely from a soup of optimistic intuitions. I think it would be crazy to be confident on the basis of this kind of intuition, but I do think it’s enough to justify 50%:
Despite having lots of optimistic words to say, feasibility is one of my biggest concerns with my methodology.
These failure stories involve very unrealistic learned models
My failure stories involve neural networks learning something like “simulate physics at a low level” or “perform logical deductions from the following set of axioms.” This is not the kind of thing that a neural network would learn in practice. I think this leads many people to be skeptical that thinking about such simplified stories could really be useful.
I feel a lot more optimistic:
Are there any examples of a similar research methodology working well? This is different from traditional theoretical work
When theorists design algorithms they often focus on the worst case. But for them the “worst case” is e.g. a particular graph on which their algorithm runs slowly, not a “plausible” story about how a model is “egregiously misaligned.”
I think this is a real, big divergence that’s going to make it way harder to get traditional theorists on board with this approach. But there are a few ways in which I think the situation is less disanalogous than it looks:
I think the biggest difference is that most parts of theoretical CS move quickly past this stage and spend most of their time working with precise definitions. That said, (i) part of this is due to the taste of the field and the increasing unwillingness to engage in hard-to-formalize activities, rather than a principled take that you need to avoid spending long in this stage, (ii) although many people are working on alignment only very few are taking the kind of approach I’m advocating here, so it’s not actually clear that we’ve spent so much more time than is typically needed in theoretical CS to formalize a new area (especially given that people in academia typically pick problems based on tractability).
My research methodology was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.