Concrete Reasons for Hope about AI

I'm surprised by this sentence in conjunction with the rest of this post: the views in this post seem very different from my Nate model. This is based only on what I've read on LessWrong, so it feels a bit weird to write about what I think Nate thinks, but it still seems important to mention. If someone more qualified wants to jump in, all the better. Non-comprehensive list:

I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe.

Not as important as the other points, but I'm not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:

Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem.

And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren't careful enough, and won't be?

Language model interventions work pretty well

My Nate model doesn't view this as especially informative about how AGI will go. In particular:

HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.

If I understand you correctly, the "vital pieces" that are missing are not ones that make it shut down and never cause catastrophe? (Not entirely sure what they are about instead). My Nate model agrees that vital pieces are missing, and that never causing a catastrophe would be great, but crucially thinks that the pieces that are missing are needed to never cause a catastrophe.

Few attempts to align ML systems

In my Nate model, empirical work with pre-AGI/pre-sharp-left-turn systems can only get you so far. If we can now do more empirical alignment work, that still won't help with what are probably the deadliest problems. Once we can empirically work on those, there's very little time left.

Interpretability is promising!

Nate has said he's in favor of interpretability research, and I have no idea if he's been positively surprised by the rate of progress. But I would still guess you are way more optimistic in absolute terms about how helpful interpretability is going to be (see his comments here).

If a major lab saw something which really scared them, I think others would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated.

Nate wrote a post which I understand to argue against more or less this claim.

I don’t expect a ‘sharp left turn’

Nate has of course written about how he does expect one. My impression is that this isn't just some minor difference in what you think AGI will look like, but points at some pretty deep and important disagreements (that are upstream of some other ones).

Maybe you're aware of all those disagreements and would still call your views "similar", or maybe you have a better Nate model, in which case great! But otherwise, it seems pretty important to at least be aware there are big disagreements, even if that doesn't end up changing your position much.

[-]Zac Hatfield-Dodds2y*30

I'm basing my impression here on having read much of Nate's public writing on AI, and a conversation over shared lunch at a conference a few months ago. His central estimate for is certainly substantially higher than mine, but as I remember it we have pretty similar views of the underlying dynamics to date, somewhat diverging about the likelihood of catastrophe with very capable systems, and both hope that future evidence favors the less-doom view.

Unfortunately I agree that "shut down" and "no catastrophe" are still missing pieces. I'm more optimistic than my model of Nate that the HHH research agenda constitutes any progress towards this goal though.
I think labs correctly assess that they're neither working with or at non-trivial immediate risk of creating x-risky models, nor yet cautious enough to do so safely. If labs invested in this, I think they could probably avoid accidentally creating an x-risky system without abandoning ML research before seeing warning signs.
I agree that pre-AGI empirical alignment work only gets you so far, and that you probably get very little time for direct empirical work on the deadliest problems (two years if very fortunate, days to seconds if you're really not). But I'd guess my estimate of "only so far" is substantially further than Nate's, largely off different credence in a sharp left turn.
- I was struck by how similarly we assessed the current situation and evidence available so far, but that is a big difference and maybe I shouldn't describe our views as similar.
I generally agree with Nate's warning shots post, and with some comments (e.g.), but the "others" I was thinking would likely agree to a moratorium were other labs, not governments (c.f.), which could buy say 6--24 vital months.

[-]Zack_M_Davis2y41

I expect that human extinction would follow from comprehensive disempowerment within decades.

Why would it take decades? (In contrast to scenarios where AI builds nanotech and quickly disassembles us for spare atoms.) Are you imagining a world where AI-powered corporations, governments, &c. are still mostly behaving as designed, but we have no way to change course when it turns out that industrial byproducts are slowly poisoning us, or ...?

[-]Zac Hatfield-Dodds2y31

I don't feel I can rule out slow/weird scenarios like those you describe, or where extinction is fast but comes considerably after disempowerment, or where industrial civilization is destroyed quickly but it's not worth mopping up immediately - "what happens after AI takes over" is by nature extremely difficult to predict. Very fast disasters are also plausible, of course.

[-]Noosphere892y4-1

Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.

IMO, I don't think this is a realistic hope, since from my point of view they come with unacceptable downsides.

Quoting another post:

“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.

Among other things, the paper finds concrete evidence of current large language models exhibiting:

convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb's problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.

Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.

Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]

This is essentially saying that there's good evidence that the precursors of deceptive alignment is there, and this is something that I think no alignment plan could deal with here.

Here's a link to the post:

https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written#RHHrxiyHtuuibbcCi

Thus, I think this:

There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.

Is somewhat unrealistic, since I think a key relaxation alignment researchers are making is pretty likely to be violated IRL.

[-]Zac Hatfield-Dodds2y71

I'm one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I'm certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I've seen alignment-ish properties generalize about as well as capabilities, and that I don't have a strong expectation that this will change in future.

I also find this summary a little misleading. Consider for example, "the paper finds concrete evidence of current large language models exhibiting: convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), ..." (italics added in both) vs:

Worryingly, RLHF also increases the model's tendency to state a desire to pursue hypothesized “convergent instrumental subgoals” ... While it is not dangerous to state instrument subgoals, such statements suggest that models may act in accord with potentially dangerous subgoals (e.g., by influencing users or writing and executing code). Models may be especially prone to act in line with dangerous subgoals if such statements are generated as part of step-by-step reasoning or planning.

While indeed worrying, models generally seem to have weaker intrinsic connections between their stated desires and actual actions than humans. For example, if you ask about code models can and will discuss SQL injections (or buffer overflows, or other classic weaknesses, bugs, and vulnerabilities) and best-practices to avoid them in considerable detail... while also prone to writing them whereever a naive human might do so. Step-by-step reasoning, planning, or model cascades do provide a mechanism to convert verbal claims into actions; but I'm confident that strong supervision of such intermediates is feasible.

I'm not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?

[-]Noosphere892y21

I'm not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?

The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.

I'm one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I'm certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I've seen alignment-ish properties generalize about as well as capabilities, and that I don't have a strong expectation that this will change in future.

I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it's alignment properties due to deceptive alignment.

[-]Gurkenglas2y31

For those of us who like accurate beliefs, could you make a second list of the truths you filtered out due to not furthering hope?

[-]Zac Hatfield-Dodds2y*71

That's not how I wrote the essay at all - and I don't mean to imply that the situation is good (I find 95% chance of human extinction in my lifetime credible! This is awful!). Hope is a attitude to the facts, not a claim about the facts; though "high confidence in [near-certain] doom is unjustified" sure is. But in the spirit of your fair question, here are some infelicitous points:

I find is less credible than $P (d o o m) \approx 0.9$ (in the essay above). Conditional on no attempts at alignment, my inside view might be about $P (d o o m) \approx 0.8$ - haven't thought hard about that though and it's a counterfactual anyway.
I think it's pretty unlikely (credible interval 0--40%) that we'll have good enough interpretability tools [to avoid an otherwise existential catastrophe] by the time we really really need them. (via this comment)
[short timelines, civilizational inadequacy, etc. I've given up on precision for this point but the vibe is not good.]

Writing up my views on x-risk from AI in detail and with low public miscommunication risk would be a large and challenging project, and honestly I expect that I'll choose to focus on work, open source, and finishing my PhD thesis instead.

[-]M. Y. Zuo2y10

Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.

This is too simplistic. In reality governments are not monolithic so although a certain lab may have 'high trust relationships' with one or more subdivision of the relevant government, it may have adversarial relationships with other subdivisions of the same government.

This is further complicated by the fact that most modern societies are structured around the government having multiple subdivisions with overlapping jurisdiction, each sufficiently influential to have a veto over substantial decision making of the whole government but not sufficient, by themselves, to push through anything.

So this view abstracts away the most challenging coordination problems.

[-]Program Den2y-2-28

Ironically this still seems pretty pessimistic to me. I'm glad to see something other than "AHHH!" though, so props for that.

I find it probably more prudent to worry about a massive solar flare, or an errant astral body collision, than to worry about "evil" AI taking a "sharp turn".

I put quotes around evil because I'm a fan of Nietzsche's thinking on the matter of good and evil. Like, what, exactly are we saying we're "aligning" with? Is there some universal concept of good?

Many people seem to dismiss blatant problems with the base premise— like the "reproducibility problem". Why do we think that reality is in fact something that can be "solved" if we just had enough processing power, as it were? Is there some hard evidence for that? I'm not so sure. It's not just our senses that are fallible. There are some fundamental problems with the very concept of "measurement' for crying out loud, which I think it's pretty optimistic to think that super-smart AI is just going to be able to skip over.

I also think if AI gets good enough to "turn evil" as it were, it would be good enough to realize that it's a pretty dumb idea. Humans don't really have much in common with silicon-based life forms, afaik. You can find more rare elements, easier, in space, than you can on Earth. What, exactly, would AI gain by wiping out humanity?

I feel that it's popular to be down on AI, and saying how scary all these "recent" advances really are, but it doesn't seem warranted.

Take the biological warfare ideas that were in the "hard turn" link someone linked in their response. Was this latest pandemic really a valid test-run for something with a very high fatality rate? (I think the data is coming in that far more people had COVID than we initially thought, right?)

CRISPR &c. are, to me, far more scary, but I don't see any way of like, regulating that people "be good", as it were. I'm sure most people here have read or seen Jurassic Park, right? Actually, I think our Science Fiction pretty much sums up all this better than anything I've seen thus far.

I'm betting if we do get AGI any time soon it will be more like the movies Her or AI than Terminator or 2001, and I have yet to see any plausible way of stopping, or indeed "ensuring alignment" (again, along what axis? Who's definition of "good"?)

The answer to any question can be used for good or ill. "How to take over the world" is functionally the same as "how to prevent world takeovers", is it not? All this talk of somehow regulating AI seems akin to talk of regulating "hacking" tools, or "strong maths" as it were.

Are we going to next claim that AI is a munition?

It would be neat to see some hard examples of why we should fear and why we think we can control alignment… maybe I'm just not looking in the right places? So far I don't get what all the fear is about— at least not compared to what I would say are more pressing and statistically likely problems we face.

I think we can solve some really hard problems if we work together, so if this is a really hard problem that needs solving, I'm all for getting behind it, but honestly, I'd like to see us not have all our eggs in one basket here on Earth before focusing on something that seems, at least from what I've seen so far, nigh impossible to actually focus on.

[-]Zac Hatfield-Dodds2y31

https://theprecipice.com/faq has a good summary of reasons to believe that human-created risks are much more likely than naturally-occuring risks like solar flares or asteroid or cometary impacts. If you'd like to read the book, which covers existential risks including from AI in more detail, I'm happy to buy you a copy. Specific to AI, Russel's Human Compatible and Christian's The Alignment Problem are both good too.

More generally it sounds like you're missing the ideas of and .

[-]Program Den2y10

It might be fun to pair Humankind: A Hopeful History with The Precipice, as both have been suggested reading recently.

It seems to me that we are, as individuals, getting more and more powerful. So this question of "alignment" is a quite important one— as much for humanity, with the power it currently has, as for these hypothetical hyper-intelligent AIs.

Looking at it through a Sci-Fi AI lens seems limiting, and I still haven't really found anything more than "the future could go very very badly", which is always a given, I think.

I've read those papers you linked (thanks!). They seem to make some assumptions about the nature of intelligence, and rationality— indeed, the nature of reality itself. (Perhaps the "reality" angle is a bit much for most heads, but the more we learn, the more we learn we need to learn, as it were. Or at least it seems thus to me. What is "real"? But I digress) I like the idea of Berserkers (Saberhagen) better than run amok Pi calculators… however, I can dig it. Self-replicating killer robots are scary. (Just finished Horizon: Zero Dawn - Forbidden West and I must say it was as fantastic as the previous installment!)

Which of the AI books would you recommend I read if I'm interested in solutions? I've read a lot of stuff on this site about AI now (before I'd read mostly Sci-Fi or philosophy here, and I never had an account or interacted), most of it seems to be conceptual and basically rephrasing ideas I've been exposed to through existing works. (Maybe I should note that I'm a fan of Kurzweil's takes on these matters— takes which don't seem to be very popular as of late, if they ever were. For various reasons, I reckon. Fear sells.) I assume Precipice has some uplifting stuff at the end^[1], but I'm interested in AI specifically ATM.

What I mean is, I've seen a few of proposals to "ensure" alignment, if you will, with what we have now (versus say warnings to keep in mind once we have AGI or are demonstrably close to it). One is that we start monitoring all compute resources. Another is that we start registering all TPU (and maybe GPU) chips and what they are being used for. Both of these solutions seem scary as hell. Maybe worse than replicating life-eating mecha, since we've in essence experienced ideas akin to the former a few times historically. (Imagine if reading was the domain of a select few and books were regulated!)

If all we're talking about with alignment here, really, is that folks need keep in mind how bad things can potentially go, and what we can do to be resilient to some of the threats (like hardening/distributing our power grids, hardening water supplies, hardening our internet infrastructure, etc.), I am gung-ho!

On the other hand, if we're talking about the "solutions" I mentioned above, or building "good" AIs that we can use to be sure no one is building "bad" AIs, or requiring the embedding of "watermarks" (DRM) into various "AI" content, or ~~building~~ extending sophisticated communication monitoring apparatus, or other such — to my mind — extremely dangerous ideas, I'm thinking I need to maybe convince people to fight that?

In closing, regardless of what the threats are, be they solar flares or comets (please don't jinx us!) or engineered pathogens (intentional or accidental) or rogue AIs yet to be invented — if not conceived of —, a clear "must be done ASAP" goal is colonization of places besides the Earth. That's part of why I'm so stoked about the future right now. We really seem to be making progress after stalling out for a grip.
Guess the same goes for AI, but so far all I see is good stuff coming from that forward motion too.
A little fear is good! but too much? not so much.

^{^}
I really like the idea of 80,000 Hours, and seeing it mentioned in the FAQ for the book, so I'm sure there are some other not-too-shabby ideas there. I oft think I should do more for the world, but truth be told (if one cannot tell from my writing), I barely seem able to tend my own garden.

Moderation Log

Curated and popular this week

13Comments

94 Concrete Reasons for Hope about AI

by Zac Hatfield-Dodds

14th Jan 2023

AI Alignment Forum

1 min read

94 Ω 29

Anthropic (org)AI

Frontpage

94 Ω 29

Mentioned in

348Shallow review of live agendas in alignment & safety

256Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)

25AI Safety - 7 months of discussion in 17 minutes

17EA & LW Forum Summaries (9th Jan to 15th Jan 23')

5Generalizability & Hope for AI [MLAISU W03]

Concrete Reasons for Hope about AI

New Comment

13 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:24 AM

[-]Erik Jenner2y1314

Thanks for writing this, it's great to see people's reasons for optimism/pessimism!

My views on alignment are similar to (my understanding of) Nate Soares’.

I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe.

Not as important as the other points, but I'm not even sure how much you disagree here. E.g. Nate on difficulty, from the sharp left turn post:

Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem.

And on the point of labs, I would have guessed Nate agrees with the literal statement, just thinks current labs aren't careful enough, and won't be?

Language model interventions work pretty well

My Nate model doesn't view this as especially informative about how AGI will go. In particular:

HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.

Few attempts to align ML systems

Interpretability is promising!

If a major lab saw something which really scared them, I think others would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated.

Nate wrote a post which I understand to argue against more or less this claim.

I don’t expect a ‘sharp left turn’

[-]Zac Hatfield-Dodds2y*30

Unfortunately I agree that "shut down" and "no catastrophe" are still missing pieces. I'm more optimistic than my model of Nate that the HHH research agenda constitutes any progress towards this goal though.
I think labs correctly assess that they're neither working with or at non-trivial immediate risk of creating x-risky models, nor yet cautious enough to do so safely. If labs invested in this, I think they could probably avoid accidentally creating an x-risky system without abandoning ML research before seeing warning signs.
I agree that pre-AGI empirical alignment work only gets you so far, and that you probably get very little time for direct empirical work on the deadliest problems (two years if very fortunate, days to seconds if you're really not). But I'd guess my estimate of "only so far" is substantially further than Nate's, largely off different credence in a sharp left turn.
- I was struck by how similarly we assessed the current situation and evidence available so far, but that is a big difference and maybe I shouldn't describe our views as similar.
I generally agree with Nate's warning shots post, and with some comments (e.g.), but the "others" I was thinking would likely agree to a moratorium were other labs, not governments (c.f.), which could buy say 6--24 vital months.

[-]Zack_M_Davis2y41

I expect that human extinction would follow from comprehensive disempowerment within decades.

[-]Zac Hatfield-Dodds2y31

[-]Noosphere892y4-1

Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.

IMO, I don't think this is a realistic hope, since from my point of view they come with unacceptable downsides.

Quoting another post:

“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.

Among other things, the paper finds concrete evidence of current large language models exhibiting:

convergent instrumental goal following (e.g. actively expressing a preference not to be shut down), non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb's problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.

Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don't generally seem to be alleviating and sometimes seem to be actively making worse.

Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]

This is essentially saying that there's good evidence that the precursors of deceptive alignment is there, and this is something that I think no alignment plan could deal with here.

Here's a link to the post:

https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written#RHHrxiyHtuuibbcCi

Thus, I think this:

There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.

Is somewhat unrealistic, since I think a key relaxation alignment researchers are making is pretty likely to be violated IRL.

[-]Zac Hatfield-Dodds2y71

Worryingly, RLHF also increases the model's tendency to state a desire to pursue hypothesized “convergent instrumental subgoals” ... While it is not dangerous to state instrument subgoals, such statements suggest that models may act in accord with potentially dangerous subgoals (e.g., by influencing users or writing and executing code). Models may be especially prone to act in line with dangerous subgoals if such statements are generated as part of step-by-step reasoning or planning.

I'm not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?

[-]Noosphere892y21

I'm not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?

The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.

I'm one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I'm certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I've seen alignment-ish properties generalize about as well as capabilities, and that I don't have a strong expectation that this will change in future.

[-]Gurkenglas2y31

For those of us who like accurate beliefs, could you make a second list of the truths you filtered out due to not furthering hope?

[-]Zac Hatfield-Dodds2y*71

I find is less credible than $P (d o o m) \approx 0.9$ (in the essay above). Conditional on no attempts at alignment, my inside view might be about $P (d o o m) \approx 0.8$ - haven't thought hard about that though and it's a counterfactual anyway.
I think it's pretty unlikely (credible interval 0--40%) that we'll have good enough interpretability tools [to avoid an otherwise existential catastrophe] by the time we really really need them. (via this comment)
[short timelines, civilizational inadequacy, etc. I've given up on precision for this point but the vibe is not good.]

[-]M. Y. Zuo2y10

Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.

So this view abstracts away the most challenging coordination problems.

[-]Program Den2y-2-28

[-]Zac Hatfield-Dodds2y31

More generally it sounds like you're missing the ideas of and .

[-]Program Den2y10

^{^}
I really like the idea of 80,000 Hours, and seeing it mentioned in the FAQ for the book, so I'm sure there are some other not-too-shabby ideas there. I oft think I should do more for the world, but truth be told (if one cannot tell from my writing), I barely seem able to tend my own garden.

Moderation Log

Curated and popular this week

13Comments

Recent advances in machine learning—in reinforcement learning, language modeling, image and video generation, translation and transcription models, etc.—without similarly striking safety results, have rather dampened the mood in many AI Safety circles. If I was any less concerned by extinction risks from AI, I would have finished my PhD^[1] as planned before moving from Australia to SF to work at Anthropic; I believe that the situation is both urgent and important.^[2]

On the other hand, despair is neither instrumentally nor terminally valuable.^[3] This essay therefore lays out some concrete reasons for hope, which might help rebalance the emotional scales and offer some directions to move in.

Background: a little about Anthropic

I must emphasize here that this essay represents only my own views, and not those of my employer. I’ll try to make this clear by restricting we to actions, and using I for opinions to avoid attributing my own views to my colleagues. Please forgive any lapses of style or substance.

Anthropic’s raison d’etre is AI Safety. It was founded in early 2021, as a public benefit corporation,^[4] and focuses on empirical research with advanced ML systems. I see our work as having four key pillars:

Training near-SOTA models.
This ensures that our safety work will in fact be relevant to cutting-edge systems, and we’ve found that many alignment techniques only work at large scales.^[5] Understanding how capabilities emerge over model scale and training-time seems vital for safety, as a basis to proceed with care or as a source of evidence that continuing to scale capabilities would be immediately risky.
Direct alignment research. There are many proposals for how advanced AI systems might be aligned, many of which can be tested empirically in near-SOTA (but not smaller) models today. We regularly produce the safest model we can with current techniques,^[6] and characterize how it fails in order to inform research and policy. With RLHF as a solid baseline and building block, we're investigating more complicated but robust schemes such as constitutional AI, scalable supervision, and model-assisted evaluations.
Interpretability research. Fully understanding models could let us rule out learned optimizers, deceptive misalignment, and more. Even limited insights would be incredibly valuable as an independent check on other alignment efforts, and might offer a second chance if they fail.
Policy and communications. I expect AI capabilities will continue to advance, with fast-growing impacts on employment, the economy, and cybersecurity. Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.

If you want to know more about what we’re up to, the best place to check is anthropic.com for all our published research. We’ll be posting more information about Anthropic throughout this year, as well as fleshing out the website.

Concrete reasons for hope

My views on alignment are similar to (my understanding of) Nate Soares’. I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe. We also seem to have different views on how labs should respond to the situation, which this essay does not discuss.

Language model interventions work pretty well

I wasn’t expecting this, but our helpful/harmless/honest research is in fact going pretty well! The models are far from perfect, but we’ve made far more progress than I would have expected a year ago, and no signs of slowing down yet. HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.

As expected, we’ve also seen a range of failures on more difficult tasks or where train-time supervision was relatively weak – such as inventing a series of misleading post-hoc justifications when inconsistent responses are questioned.^[7] The ‘treacherous turn’ scenario is still concerning, but I find it plausible that work on e.g. scalable supervision and model-based red-teaming could help us detect it early.

Few attempts to align ML systems

There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.

Interpretability is promising!

It feels like we’re still in the fundamental-science stage, but interpretability is going much better than I expected. We’re not in the ‘best of all possible worlds’ where polysemanticity just isn’t a thing, but compared to early 2020, transformer interpretability is going great. I’m also pretty optimistic about transfer to new architectures, if one comes along – there are some shared motifs between the imagenet and transformer circuits threads, and an increasing wealth of tooling and experience.

Mechanistic interpretability is also popping up in many other places! I’ve recently enjoyed reading papers from Redwood Research, Conjecture, and DeepMind, for example, and it feels more like a small but growing field than a single research project. Mechanistic interpretability might hit a wall before it becomes useful for TAI/AGI safety, or simply fail to get there in time; or it might not.

Outcome-based training can be limited or avoided

Supervised learning seems to be working really well, and process-based techniques could plausibly scale to superhuman performance; they might even be better for capabilities in regimes with very scarce feedback or reward signals. I’d expect this to be good for safety relative to outcome-based RL systems, and more amenable to audits and monitoring. “Just” being as skilled as the best humans in every domain – with perfect synergy between every skill-set, at lower cost and wildly higher speed, with consistently-followed and constantly-refined playbooks – would be incredibly valuable.

Good enough is good enough

The first TAI system doesn’t have to be a perfectly aligned sovereign—so long as it’s corrigible it can be turned back off, reengineered, and tried again. The goal is to end the acute risk period, which might be possible via direct assistance with alignment research, by enabling policy interventions, or whatever else.

Training can probably stop short of catastrophe

Model capabilities increase with scale and over the course of training, and evaluation results on checkpoints are quite precisely predictable from scaling laws, even for models more-capable than those from which the scaling laws were derived. You can write pretty sensitive evaluations for many kinds of concerning behavior; check log-probs or activations as well as sampled tokens, use any interpretability techniques you like (never training against them!) – and just stop training unless you have a strong positive case for safety!

I’d aim to stop before getting concrete reason to think I was training a dangerous model, obviously, but also value defense in depth against my own mistakes.

People do respond to evidence

Not everyone, and not all institutions, but many. The theme here is that proposals which need some people to join are often feasible, including those which require specific people. I often talk to people who have been following AI Safety from a distance for a few years, until some recent result^[8] convinced them that there was a high-impact use for their skills^[9] and it was time to get more involved.

If a major lab saw something which really scared them, I think other labs would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated. Publicly announcing a pause would more or less declare that AGI was imminent, risk a flood of less safety-conscious entrants to the field, and there are questions of antitrust law too, but I’m confident that these are manageable issues.

I don’t expect a ‘sharp left turn’

The ‘sharp left turn’ problem derives from the claim that capabilities generalize better than alignment. This hasn’t been argued so much as asserted, by analogy to human evolution as our only evidence on how human-level general intelligence might develop. I think this analogy is uninformative, because human researchers are capable of enormously better foresight and supervision than was evolution, and anticipate careful empirical studies of this question in silico despite the difficulty of interpreting the results.

Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.

Conclusion: high confidence in doom is unjustified

By “doom” I mean a scenario in which all humans are comprehensively disempowered by AI before the end of this century. I expect that human extinction would follow from comprehensive disempowerment within decades. It Looks Like You’re Trying To Take Over The World is a central example; human accident or misuse might lead to similarly bad outcomes but I’m focusing on technical alignment here.

Estimates of “P(doom)” are based on only tenuous evidence. I think a range is consistent with available evidence and reasonable priors; largely because in my view it’s unclear whether the problem is difficult like the steam engine, or net-energy-gain from fusion, or proving whether $P = N P$ , or perhaps more difficult still. I tried writing tighter estimates but couldn’t construct anything I’d endorse.

Emotionally, I’m feeling pretty optimistic. While the situation is very scary indeed and often stressful, the x-risk mitigation community is a lovely and growing group of people, there’s a large frontier of work to be done, and I’m pretty confident that at least some of it will turn out to be helpful. So let’s get (back) to work!

Contrary to stereotypes about grad school, I was really enjoying it! I’m also pretty sad to shut down the startup I’d spun out (hypofuzz.com), though open-sourcing it is some consolation - if it wasn’t for x-risk I wouldn’t be in AI at all. ↩︎
I work directly on AI x-risk, and separately I give 10% of my income to GiveWell’s “All Grants” fund (highest EV), and a further 1% to GiveDirectly (personal values). I value this directly; I also believe that credible signals of altruism beyond AI are important for the health of the community. ↩︎
This is mostly a question of your emotional relationship to the facts of the matter; an accurate assessment of the situation is of course instrumentally vital and to me also terminally desirable. ↩︎
The effect is that the company can raise money from investors and prioritize the mission over shareholder profits. ↩︎
I'm thinking here of the helpful/harmless tradeoff from (eg) fig 26a of our RLHF paper; P(IK) calibration, scalable supervision results, etc. – places where you need pretty good capabilities to do useful experiments. ↩︎
I also think that this is a very important habit for anyone working with SOTA or near-SOTA systems. When whoever it is eventually tries building a TAI or AGI system, I would strongly prefer that they have a lot of hands-on practice aligning weaker AI systems as well as an appreciation that this time is likely to be different. ↩︎
(e.g. here) This likely arose from RLHF training on a pattern of single-step justifying or modifying claims when challenged by shallow train-time human supervision; when pressed harder in deployment the generalization is to repeat and escalate this pattern to the point of absurdity. ↩︎
e.g. GPT-3, Minerva, AlphaFold, code models – all language- rather than image-based, which seems right to me. Since I first drafted this essay there's also ChatGPT, which continues to make headlines even in mainstream newspapers. ↩︎
For example software and systems engineering, law, finance, recruiting, ops, etc. – the scale and diversity of AI safety projects in the large language model era demands a wider diversity of skills and experience than earlier times. ↩︎