My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
I share similar concerns that Anthropic doesn't seem to institutionally prioritize thinking about the future or planning, and their public outputs to date are not encouraging evidence of careful thinking about misalignment. That said, I'm pretty sympathetic to the idea that while this isn't great, this isn't that bad because careful thinking about exactly what needs to happen in the future isn't that good of an approach for driving research direction or organization strategy. My biggest concern is that while careful planning is perhaps not that important now (you can maybe get most of the value by doing things that seem heuristically good), we'll eventually need to actually do a good job thinking through what exactly should be implemented when the powerful AIs are right in front of us. I don't think we'll be able to make the ultimate strategic choices purely based on empirical feedback loops.
It feels vaguely reasonable to me to have a belief as low as 15% on "Superalignment is Real Hard in a way that requires like a 10-30 year pause." And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.
I don't really see why this is a crux. I'm currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn't really change my strategic orientation. Maybe you're focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
The relevant questions are:
(By 1 year pause, I mean devoting 1 year of delay worth of resources to safety. This would include scenarios where progress is halted for a year and all AI company resources are devoted to safety. It would also count if 50% of compute and researchers are well targeted toward safety for 2 years at around the point of full AI R&D automation, which would yield a roughly 1 year slow down. By "well timed", I'm including things like deciding to be slower for 2 years or fully paused for 1.)
I would guess that going from a token effort to a well-done 1 year pause for safety reasons (at an organization taking AI risk somewhat seriously and which is moderately competent) reduces risk from around 40% to 20%. (Very low confidence numbers.) And, it's much easier to go from a token level of effort to a 1 year pause than to get a 20 year pause, so focusing on the ~1 year pause is pretty reasonable. For what it's worth, it's not super obvious to me that Anthropic thinks of themselves as going for a 1 year pause. I don't think there's a different strategy Anthropic could follow which would increase the chance of successfully getting buy-in for a 20 year pause to a sufficient degree that focusing on this over the 1 year pause makes sense.
I'm pretty skeptical of the "extreme philosophical competence" perspective. This is basically because we "just" need to be able to hand off to an AI which:
Creating such a system seems unlikely to require "extreme philosophical competence" even if such a system has to apply "extreme philosophical competence" to sufficiently align successors.
Josh has some posts from a similar perspective, though I don't fully agree with him (at least at a vibes level), see: here, here, and here.
There is also obviously a case that a 20 year pause is bad if poorly managed, though I don't think this is very important for this discussion. ↩︎
I'm pretty skeptical of the "extreme philosophical competence" perspective. This is basically because we "just" need to be able to hand off to an AI which is seriously aligned (e.g., it faithfully pursues our interests on long open-ended and conceptually loaded tasks that are impossible for use to check).
The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.
Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat. And, the people who actively believe in "alignment is philosophically loaded" showed up to complain that this ignored the heart of the problem.
My current summary of the arguments (which I put ~60% on, and I think Eliezer/Oli/Wentworth treat much more confidently and maybe believe a stronger version of) are something like:
I think my current model of you (Ryan) is like:
"Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from "don't think about scheming ever" to "secretly think about scheming" to avoid getting caught, and that probably just won't work?"
(or, in example of the second joshC post, they need to learn to be really reliably good at truth-tracking-patterns and articulating their reasoning, after internalizing that for thousands of reps, an AI is just gonna have a hard time jumping to reasoning that isn't truth tracking).
I don't have a clear model of how you respond to point #4 – that we'll just reliably find them to be scheming if we succeed at the interpretability steps, and not have a good way of dealing with it. (Maybe you just don't think this is as overwhelmingly likely?)
Interested in whatever Real You's cruxes are, 1-2 steps removed.
The "extreme philosophical competence" hypothesis is that you need such competence to achieve "seriously aligned" in this sense. It sounds like you disagree, but I don't know why since your reasoning just sidesteps the problem.
Yes, my reasoning is definitely part, but not all of the argument. Like the thing I said is a sufficient crux for me. (If I thought we had to directly use human labor to align AIs which were qualitatively wildly superhuman in general I would put much more weight on "extreme philosophical competence".)
Looking over the comments of the first joshc post, it seems like that's also basically asserted it wasn't necessary by fiat.
I agree that Josh's first post doesn't argue about this, but the later two posts kinda do? That said, I think Josh doesn't really state the argument very clearly in a single place anywhere (at least not in the way I would when trying to cover this objection).
"Training models to do specific things, cleverly, actually just makes it pretty hard for them to develop scheming or other motivated misalignments – they have to jump all the way from "don't think about scheming ever" to "secretly think about scheming" to avoid getting caught, and that probably just won't work?"
Hmm, no this isn't really that much of my model. Or at least, I don't feel myself thinking thoughts like this.
Interested in whatever Real You's cruxes are, 1-2 steps removed.
I think this is probably not the best place to discuss this, but I'll try laying out the quick version of the argument, mostly so I can practice explaining myself here.
Our goal is to make a trustworthy AI that we defer to. This AI need to be:
We'll do this using two components:
I'd first claim that we at least have mechanisms (that don't require extreme philosophical competence) for greatly reducing the risk of egregious misalignment in expectation. The strategies available are:
Analogous setups includes:
If we had a large range of pretty good test beds of analogous naturally occuring misalignment that cover the failure modes we're most worried about, then iterating on countermeasures in these test beds seems quite useful. (The countermeasures would be drawn from the principled methods.)
A serious problem is that we're not necessarily able to consistently detect misalignment. We also need to be able to test whether misalignment is less likely after some countermeasure, so being able to reproducably catch the misalignment at sufficiently low cost is important.[1] This is probably the central difficulty in test beds. We also need to ensure that our countermeasures aren't effectively training against our tests.[2]
(To get a lot of signal, our test beds would need to be representative of the failure modes we're worried about or there would need to be some reason to expect the countermeasures we try to generalize to all failure modes)
Another objection is: "Actually, we didn't have any methods available to try which could end up with a model that (always) isn't egregiously misaligned. So, even if you can iterate a bunch, you'll just either find that nothing works or you'll just fool yourself." I'm somewhat skeptical of this objection if we actually have a bunch of signal in the test beds and can afford to iterate. It feels to me like in most worlds where we have serious misalignment problems, we can get a good enough set of methods to try that we "just" need like <5 bits. As in, if we were to randomly sample sets of measures (and parameters for these measures, etc) from stuff that seem promising, it seems unlikely we'd need more than 32 tries before something actually worked. This means you maybe don't need to apply crazy amounts of optimization pressure. I don't have a cached quick justification.
Oops, I ran out of my time box for this comment. I didn't get to principled measures. I can send you my doc though.
Oops, I ran out of my time box for this comment. Hopefully this section is somewhat comprehensible though I don't get into detail.
We need to test that our AI does reasonable and aligned stuff on really hard to check tasks. So, how can we get confidence in this given that the tasks are hard to check?
We'd then need to iterate against these tests in a way that doesn't overfit.
Technically, we just need some signal here, but if catching the misalignment was very costly initially, P(catch) increases with spend, and our initial catch doesn't make it much easier to catch peturbations of the model that might behave slightly differently, then we probably wouldn't get enough signal. That said, if we can sometimes detect misalignment in some models, iteration still could be quite useful ↩︎
More precisely, we need to ensure that our countermeasures aren't just shifting from a type of misalignment we can detect to a type we can't. Qualitatively analyzing the countermeasures and our tests should help here. ↩︎
I don't really see why this is a crux. I'm currently at like ~5% on this claim (given my understanding of what you mean), but moving to 15% or even 50% (while keeping the rest of the distribution the same) wouldn't really change my strategic orientation. Maybe you're focused on getting to a world with a more acceptable level of risk (e.g., <5%), but I think going from 40% risk to 20% risk is better to focus on.
I think you kinda convinced me here this reasoning isn't (as stated) very persuasive.
I think my reasoning had some additional steps like:
IMO actively torch the "long pause" worlds
Not sure how interesting this is to discuss, but I don't think I agree with this. Stuff they're doing does seem harmful to worlds where you need a long pause, but feels like at the very least Anthropic is a small fraction of the torching right? Like if you think Anthropic is making this less likely, surely they are a small fraction of people pushing in this direction such that they aren't making this that much worse (and can probably still pivot later given what they've said so far).
Thanks. I'll probably reply to different parts in different threads.
For the first bit:
My guess is that the parts of the core leadership of Anthropic which are thinking actively about misalignment risks (in particular, Dario and Jared) think that misalignment risk is like ~5x smaller than I think it is while also thinking that risks from totalitarian regimes are like 2x worse than I think they are. I think the typical views of opinionated employees on the alignment science team are closer to my views than to the views of leadership. I think this explains a lot about how Anthropic operates.
The rough number you give are helpful. I'm not 100% sure I see the dots you're intending to connect with "leadership thinks 1/5-ryan-misalignment and 2x-ryan-totalitariansm" / "rest of alignment science team closer to ryan" -> "this explains a lot."
Is this just the obvious "whelp, leadership isn't bought into this risk model and call most of the shots, but in conversations with several employees that engage more with misalignment?". Or was there a more specific dynamic you thought it explained?
Yep, just the obvious. (I'd say "much less bought in" than "isn't bought in", but whatever.)
I don't really have dots I'm trying to connect here, but this feels more central to me than what you discuss. Like, I think "alignment might be really, really hard" (which you focus on) is less of the crux than "is misalignment that likely to be a serious problem at all?" in explaining. Another way to put this is that I think "is misalignment the biggest problem" is maybe more of the crux than "is misalignment going to be really, really hard to resolve in some worlds". I see why you went straight to your belief though.
My high-level skepticism of their approach is A) I don't buy that it's possible yet to know how dangerous models are, nor that it is likely to become possible in time to make reasonable decisions, and B) I don't buy that Anthropic would actually pause, except under a pretty narrow set of conditions which seem unlikely to occur.
As to the first point: Anthropic's strategy seems to involve Anthropic somehow knowing when to pause, yet as far as I can tell, they don't actually know how they'll know that. Their scaling policy does not list the tests they'll run, nor the evidence that would cause them to update, just that somehow they will. But how? Behavioral evaluations aren't enough, imo, since we often don't know how to update from behavior alone—maybe the model inserted the vulnerability into the code "on purpose," or maybe it was an honest mistake; maybe the model can do this dangerous task robustly, or maybe it just got lucky this time, or we phrased the prompt wrong, or any number of other things. And these sorts of problems seem likely to get harder with scale, i.e., insofar as it matters to know whether models are dangerous.
This is just one approach for assessing the risk, but imo no currently-possible assessment results can suggest "we're reasonably sure this is safe," nor come remotely close to that, for the same basic reason: we lack a fundamental understanding of AI. Such that ultimately, I expect Anthropic's decisions will in fact mostly hinge on the intuitions of their employees. But this is not a robust risk management framework—vibes are not substitutes for real measurement, no matter how well-intentioned those vibes may be.
Also, all else equal I think you should expect incentives might bias decisions the more interpretive-leeway staff have in assessing the evidence—and here, I think the interpretation consists largely of guesswork, and the incentives for employees to conclude the models are safe seem strong. For instance, Anthropic employees all have loads of equity—including those tasked with evaluating the risks!—and a non-trivial pause, i.e. one lasting months or years, could be a death sentence for the company.
But in any case, if one buys the narrative that it's good for Anthropic to exist roughly however much absolute harm they cause—as long as relatively speaking, they still view themselves as improving things marginally more than the competition—then it is extremely easy to justify decisions to keep scaling. All it requires is for Anthropic staff to conclude they are likely to make better decisions than e.g., OpenAI, which I think is the sort of conclusion that comes pretty naturally to humans, whatever the evidence.
This sort of logic is even made explicit in their scaling policy:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold without implementing measures equivalent to the Required Safeguards such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.
Personally, I am very skeptical that Anthropic will in fact end up deciding to pause for any non-trivial amount of time. The only scenario where I can really imagine this happening is if they somehow find incontrovertible evidence of extreme danger—i.e., evidence which not only convinces them, but also their investors, the rest of the world, etc.—such that it would become politically or legally impossible for any of their competitors to keep pushing ahead either.
But given how hesitant they seem to commit to any red lines about this now, and how messy and subjective the interpretation of the evidence is, and how much inference is required to e.g. go from the fact that "some model can do some AI R&D task" to "it may soon be able to recursively self-improve," I feel really quite skeptical that Anthropic is likely to encounter the sort of knockdown, beyond-a-reasonable-doubt evidence of disaster that I expect would be needed to convince them to pause.
I do think Anthropic staff probably care more about the risk than the staff of other frontier AI companies, but I just don't buy that this caring does much. Partly because simply caring is not a substitute for actual science, and partly because I think it is easy for even otherwise-virtuous people to rationalize things when the stakes and incentives are this extreme.
Anthropic's strategy seems to me to involve a lot of magical thinking—a lot of, "with proper effort, we'll surely surely figure out what to do when the time comes." But I think it's on them to demonstrate to the people whose lives they are gambling with, how exactly they intend to cross this gap, and in my view they sure do not seem to be succeeding at that.
I agree with this (and think it's good to periodically say all of this straightforwardly).
I don't know that it'll be particularly worth your time, but, the thing I was hoping for this post was to ratchet the conversation-re-anthropic forward in, like, "doublecrux-weighted-concreteness." (i.e. your arguments here are reasonably crux-and-concrete, but don't seem to be engaging much with the arguments in this post that seemed more novel and representative of where anthropic employees tend to be coming from, instead just repeated AFAICT your cached arguments against Anthropic)
I don't have much hope of directly persuading Dario, but I feel some hope of persuading both current and future-prospective employees who aren't starting from the same prior of "alignment is hard enough that this plan is just crazy", and for that to have useful flow-through effects.
My experience talking at least with Zac and Drake has been "these are people with real models, who share many-but-not-all-MIRI-ish assumptions but don't intuitively buy that the Anthropic's downsides are high, and would respond to arguments that were doing more to bridge perspectives." (I'm hoping they end up writing comments here outlining more of their perspective/cruxes, which they'd expressed interest in in the past, although I ended up shipping the post quickly without trying to line up everything)
I don't have a strong belief that contributing to that conversation is a better use of your time than whatever else you're doing, but it seemed sad to me for the conversation to not at least be attempted.
(I do also plan to write 1-2 posts that are more focused on "here's where Anthropic/Dario have done things that seem actively bad to me and IMO are damning unless accounted for," that are less "attempt to maintain some kind of discussion-bridge", but, it seemed better to me to start with this one)
Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview
I think as stated this is probably true of the large majority of people, including e.g. the large majority of the most historically harmful people. "Worldviews" sometimes reflect underlying beliefs that lead people to choose actions, but they can of course also be formed post-hoc, to justify whatever choices they wished to make.
In some cases, one can gain evidence about which sort of "worldview" a person has, e.g. by checking it for coherency. But this isn't really possible to do with Dario's views on alignment, since to my knowledge, excepting the Concrete Problems paper he has actually not ever written anything about the alignment problem.[1] Given this, I think it's reasonable to guess that he does not have a coherent set of views which he's neglected to mention, so much as the more human-typical "set of post-hoc justifications."
(In contrast, he discusses misuse regularly—and ~invariably changes the subject from alignment to misuse in interviews—in a way which does strike me as reflecting some non-trivial cognition).
Counterexamples welcome! I've searched a good bit and could not find anything, but it's possible I missed something.
I think... agree denotationally and (lean towards) disagreeing connotationally? (like, seems like this is implying "and because he doesn't seem like he obviously has coherent views on alignment-in-particular, it's not worth arguing the object level?")
(to be clear, I don't super expect this post to affect Dario's decisionmaking models, esp. directly. I do have at least some hope for Anthropic employees to engage with these sorts of models/arguments, and my sense from talking them is that a lot of the LW-flavored arguments have often missed their cruxes)
No, I agree it's worth arguing the object level. I just disagree that Dario seems to be "reasonably earnestly trying to do good things," and I think this object-level consideration seems relevant (e.g., insofar as you take Anthropic's safety strategy to rely on the good judgement of their staff).
I think (moderately likely, though not super confident) it makes more sense to model Dario as:
"a person who actually is quite worried about misuse, and is making significant strategic decisions around that (and doesn't believe alignment is that hard)"
than as "a generic CEO who's just generally following incentives and spinning narrative post-hoc rationalizations."
Yeah, I buy that he cares about misuse. But I wouldn't quite use the word "believe," personally, about his acting as though alignment is easy—I think if he had actual models or arguments suggesting that, he probably would have mentioned them by now.
I don't particularly disagree with the first half, but your second sentence isn't really a crux for me for the first part.
It feels vaguely reasonable to me to have a belief as low as 15% on "Superalignment is Real Hard in a way that requires like a 10-30 year pause." And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.
Yeah, I think the only way I maybe find the belief combination "15% that alignment is Real Hard" and "racing makes sense at this moment" compelling is if someone thinks that pausing now would be too late and inefficient anyway. (Even then, it's worth considering the risks of "What if the US aided by AIs during takeoff goes much more authoritarian to the point where there'd be little difference between that and the CCP?") Like, say you think takeoff is just a couple of years of algorithmic tinkering away and compute restrictions (which are easier to enforce than prohibitions against algorithmic tinkering) wouldn't even make that much of a difference now.
However, if pausing now is too late, we should have paused earlier, right? So, insofar as some people today justify racing via "it's too late for a pause now," where were they earlier?
Separately, I want to flag that my own best guess on alignment difficulty is somewhere in between your "Real Hard" and my model of Anthropic's position. I'd say I'm overall closer to you here, but I find the "10-30y" thing a bit too extreme. I think that's almost like saying, "For practical purposes, we non-uploaded humans should think of the deep learning paradigm as inherently unalignable." I wouldn't confidently put that below 15% (we simply don't understand the technology well enough), but I likewise don't see why we should be confident in such hardness, given that ML at least gives us better control of the new species' psychology than, say, animal taming and breeding (e.g., Carl Shulman's arguments somewhere -- iirc -- in his podcasts with Dwarkesh Patel). Anyway, the thing that I instead think of as the "alignment is hard" objection to the alignment plans I've seen described by AI companies, is mostly just a sentiment of, "no way you can wing this in 10 hectic months while the world around you goes crazy." Maybe we should call this position "alignment can't be winged." (For the specific arguments, see posts by John Wentworth, such as this one and this one [particularly the section, "The Median Doom-Path: Slop, Not Scheming"].)
The way I could become convinced otherwise is if the position is more like, "We've got the plan. We think we've solved the conceptually hard bits of the alignment problem. Now it's just a matter of doing enough experiments where we already know the contours of the experimental setups. Frontier ML coding AIs will help us with that stuff and it's just a matter of doing enough red teaming, etc."
However, note that even when proponents of this approach describe it themselves, it sounds more like "we'll let AIs do most of it ((including the conceptually hard bits?))" which to me just sounds like they plan on winging it.
My own take is I do endorse a version of the "pausing now is too late objection", more specifically I think that for most purposes, we should assume pauses are too late to be effective when thinking about technical alignment, and a big portion of the reason is that I don't think we will be able to convince many people that AI is powerful enough to need governance without them first hand seeing massive job losses, and at that point we are well past the point of no return for when we could control AI as a species.
In particular, I think Eliezer is probably vindicated/made a correct prediction around how people would react to AI in there's no fire alarm for AGI (more accurately, the fire alarm will go off way too late to serve as a fire alarm.)
More here:
I don't disagree that totalitarian AI would be real bad. It's quite plausible to me that the "global pause" crowd are underweighting how bad it would be.
I think an important crux here is on how bad a totalitarian AI would be compared to a completely unaligned AI. If you expect a totalitarian AI to be enough of an s-risk that it is something like 10 times worse than an AI that just wipes everything out, then racing starts making a lot more sense.
The broad thrust of my questions are:
Anthropic Research Strategy
"Is Technical Philosophy actually that big a deal?"
Governance / Policy Comms
Does Anthropic shorten timelines, by working on automatic AI research?
I think "at least a little", though not actually that much.
There's a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic's presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don't launch a broken product. Having more "real" competition seems like it'd make people more willing to cut corners to avoid getting scooped on product releases)
(I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn't significant at this point, and while I'm mad about the previous stuff it's not actually a crux for their current strategy)
Subquestions:
How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it's a small number, then I think Anthropic's contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on 'frontier AI', which pushes OpenAI to release things with less safety testing)
Is Anthropic institutionally capable of noticing "it's really time to stop our capabilities research," and doing so, before it's too late?
I know they have the RSP. I think there is a threshold of danger where I believe they'd actually stop.
The problem is, before we get to "if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed" territory, there will be a period of "Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don't want to fall behind". And meanwhile, it's also just sort of awkward/difficult[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.
How realistic is it to have a lead over "labs at more dangerous companies?" (Where "more dangerous" might mean more reckless, or more totalitarian)
This is where I feel particularly skeptical. I don't get how Anthropic's strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead.
Relatedly... (sort of a subquestion but also an important top-level question)
Does racing towards Recursive Self Improvement makes timelines worse (as opposed to "shorter"?)
Maybe Anthropic pushing the frontier doesn't shorten timelines (because there's already at least a few other organizations who are racing with each other, and no one wants to fall behind).
But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. "gaining a more durable advantage") might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.
This seems more true to me, than the "does Anthropic shorten timelines?" question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more "real," and give FOMO to other companies/countries.[11]
But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:
Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?
In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.
I don't know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world).
This is the hardest to have good models about. Partly it's just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it's the sort of thing people are more likely to not be fully public about.
Some recent interviews have had him saying "Guys this is a radically different kind of technology, we need to come together and think about this. It's bigger than one company should be deciding what to do with." There's versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here.
He doesn't talk about x-risk, or much about uncontrollable AI. The "Core views on AI safety" lists "alignment might be very hard" as a major plausibility they are concerned with, and implies it ends up being like 1/3 or something of their
Subquestions:
I am kinda intrigued by how controversial this post seems (based on seeing the karma creep upwards and then back down over the past day). I am curious if the downvoters tend more like:
One crux is how soon do we need to handle the philosophical problems? My intuition says that something, most likely corrigibility in the Max Harms sense, will enable us to get pretty powerful AIs while postponing the big philosophical questions.
Are there any pivotal acts that aren’t philosophically loaded?
My intuition says there will be pivotal processes that don't require any special inventions. I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).
The combination of worried governments and massive AI-enhanced surveillance seems likely to be effective.
If we need a decades-long-pause, then even the world will need to successfully notice and orient to that fact. By default I expect tons of economic and political pressure towards various actors trying to to get more AI power even if there’s broad agreement that it’s dangerous.
I expect this to get easier to deal with over time. Maybe job disruptions will get voters to make AI concerns their top priority. Maybe the AIs will make sufficiently convincing arguments. Maybe a serious mistake by an AI will create a fire alarm.
I expect that AIs will be obedient when they initially become capable enough to convince governments that further AI development would be harmful (if it would in fact be harmful).
Seems like "the AIs are good enough at persuasion to persuade governments and someone is deploying them for that" is right when you need to be very high confidence they're obedient (and, don't have some kind of agenda). If they can persuade governments, they can also persuade you of things.
I also think it gets into a point where I'd sure feel way more comfortable if we had more satisfying answers to "where exactly are we supposed to draw the line between 'informing' and 'manipulating'" (I'm not 100% sure what you're imagining here tho)
I'm assuming that the AI can accomplish its goal by honestly informing governments. Possibly that would include some sort of demonstration that the of the AI's power that would provide compelling evidence that the AI would be dangerous if it wasn't obedient.
I'm not encouraging you to be comfortable. I'm encouraging you to mix a bit more hope in with your concerns.
So, I have a lot of complaints about Anthropic, and about how EA / AI safety people often relate to Anthropic (i.e. treating the company as more trustworthy/good than makes sense).
At some point I may write up a post that is focused on those complaints.
But after years of arguing with Anthropic employees, and reading into the few public writing they've done, my sense is Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview.
So I want to just argue with the object-level parts of that worldview that I disagree with.
I think the Anthropic worldview is something like:
My current understanding of the corresponding Anthropic strategy is:
I haven't seen a clear plan for what comes exactly after step #5. I've seen Dario say in interviews "we will need to Have A Some Kind of Conversation About AI, Together." This is pretty vague. It might be vague because Dario doesn't have a plan there, or might be that the asks are too overton-shattering to say right now and he is waiting till there's clearer evidence to throw at people.
I think it's plausible Dario's take is like "look it's not that useful to have a plan in advance here because it'll depend a ton on what the gameboard looks like at the time."
I: Arguments for "Technical Philosophy"
The crucial questions in my mind here are:
Technical
Geopolitical
Tying together in:
I don't disagree that totalitarian AI would be real bad. It's quite plausible to me that the "global pause" crowd are underweighting how bad it would be.
But I'm personally at like "it's at least ~60% that super alignment is Real Hard", and that the Anthropic playbook and research culture is not well suited for solving it. And this makes the "how bad is totalitarian AI?" question kind of moot.
It feels vaguely reasonable to me to have a belief as low as 15% on "Superalignment is Real Hard in a way that requires like a 10-30 year pause." And, at 15%, it still feels pretty crazy to be oriented around racing the way Anthropic is.
I once talked to Zac Hatfield-Dodds who agreed with me about "15% on 'superalignment requires 10+ year pause" being a crux," but their number was like 5%[5]. And, yeah – if I earnestly believed 5%, I would honestly be unsure how to compare risk of x-risk from risk of catastrophic misuse. I was pretty happy to find such a crisp disagreement – a crux! a double crux even!
But, my guess is that Anthropic employees have biased reasons for thinking the risk is so low, or can be treated as low in practice. See Optimistic Assumptions, Longterm Planning, and "Cope". But, over my years of arguing with them, I've found it harder to present a slam-dunk case than I initially expected, and I think it'd be good to actually argue some of the object level details.
I will articulate some arguments here, and mostly I am hoping for more argument and back-and-forth in the comments. I think it'd be cool if this ended with someone representative of Anthropic having more of an extended debate with someone with stronger takes on the technical side of things than me.
10-30 years of serial research, or "extreme philosophical competence."
Anthropic seems like they are making costly enough signals of "we may need to slow down for 6-12 months or so, and we want other people to be ready for that too", that I believe they take that earnestly.
But they aren't saying anything like "hey everyone, we maybe will have to stop for like over a decade in over to figure things out." And this is the thing I most wish they would do.
The notion of "we need a lot of serial research time" was written up by Nate Soares in a post on differential technological development. The argument is: even though some types of research will get easier to do when we have more "realistic/representative" AI, or more researchers available to focus on the problem, there are some kinds of research that really require one person to do a lot of deep thinking, integrating concepts, synthesizing, which historically has seemed to take blocks of time that are measured in decades.
Right now, we're in an "iterative empiricism" regime with AI, where you can see some problems, try some obvious solutions, see how those solutions work, and iterate on the next generation. The problem is that at some point, AI will probably hit a critical threshold where it rapidly recursively self-improves.
Does your alignment process safely scale to infinity?
A lot of safeguards that hold up when the next generation of AI is only slightly more powerful, break down when the next generation has scaled to "unboundedly powerful."
I think Dario agrees with this part. My sense from scattered conversations and public comms and reading between the lines is "Anthropic will be able to notice and slow down before we get to this point, and make sure they have a good scalable oversight system in place before you begin the recursive self-improvement process."
Where I think we disagree, is how badly the Anthropic playbook is suited for handling this shift.
I think for the shift from "near human" to "massively superhuman" to go safely, your alignment team needs to be very competent at "technical philosophy."
By "technical philosophical competence," I think I mostly mean: "Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know."
Posts like AGI Ruin: A List of Lethalities (and, basically the entire LessWrong sequences) have tried to argue this point. Here is my attempt at a quick recap of why this is important:
We need AI (or, at least some extremely powerful technology) that can help us end the acute risk period.
"Okay, but what does the alignment difficulty curve look like at the point where AI is powerful enough to start being useful for Acute Risk Period reduction?"
A counterpoint I got from Anthropic employee Drake Thomas (note: not speaking for Anthropic or anything, just discussing his own models), is something like:
I think this is a kind of reasonable point and I'm interested in takes from more MIRI-ish crowd people on it. But, I still feel quite worried.
Are there any pivotal acts that aren't philosophically loaded?
A lot of hope (I think both for me, and my current understanding of Drake's take), is "can you build an AI that helps invent powerful (non-AI) technology, where it's sort of straightforward to use the technology to stop people from building reckless AI.
The only technology that I've imagined thus far that feels plausible is "invent uploading or bio enhancements that make humans a lot smarter" (essentially approaching "aligned superintelligence" by starting from humanity, rather than ML).
This does feel intuitively plausible to me, but:
1. I think, to build AI powerful enough to invent such technology quickly[7], you need to be able to point the AI at abstract targets that are robust under ontological change. I don't actually expect these are any easier than the more stereotypical "[moral] philosophy questions people have argued about for thousands of years."
2. Even if powerful enough AI to help invent such technology is (currently) safe, it's probably at least pretty close to "would be actively dangerous with slightly different training or scaffolding."
3. You still end up needing to solve the problem of "prevent humans or AIs from building reckless AI, indefinitely", even if you kick the can to uplifted human successors. And even if you've gotten a lot of buy-in from major governments [8]and such, you do need to be capable of robustly stopping many smart actors, who work persistently over a long time. This requires a lot of optimization power, enough that even if it's not getting into "unboundedly huge" territory. I dunno, if your plan is routing through uplifted human successors figuring it out I still think you need to start contending with the details of this now.
4. If we need a decades-long-pause, then even the world will need to successfully notice and orient to that fact. By default I expect tons of economic and political pressure towards various actors trying to to get more AI power even if there's broad agreement that it's dangerous. If AI is "technically philosophically challenging", then it's important for Anthropic to understand that and to use it's "seat at the table" strategy to try to (ideally) help convey this (or, maybe "argue from authority.")
5. Even if you, you probably want to get actual fully superhuman AI solving tons of problems.
Your org culture needs to handle the philosophy
Anthropic is banking on "Medium-strong AI can help us handle Strong AI." I'm not sure whether the leadership is thinking more about using medium-AI to "figure out how to align strong-AI" or more like using it to "do narrower things that buy us more time."
To be clear: I am also banking on leveraging "medium-strong AI" to help, at this point. But, I think leveraging it to help align superhuman AI requires directly tackling the philosophically hard parts.
I get a sense that Anthropic research culture is sort of allergic to philosophy. I think this allergy is there for a reason – there's a lot of people making longwinded arguments that are, in fact, BS. The bitter lesson taught the ML community that simple, scalable methods that leverage computation tend to ultimately outperform hand-crafted, theoretically sophisticated approaches.
But, when you get to the point where you're within one "accidental leave-the-AI-training too long" away from superintelligence, you think you really do need world-class technically-grounded philosophy of a kind that I'm not sure humanity has even seen yet.
Philosophers-as-a-field have not reached consensus on a lot of important issues. This includes both moral "what is The Good" kind of questions, but also more basic seeming things like "what even is an object?". IMO, this doesn't mean "you can't trust philosophy", it means "you should expect it to be hard, and you need to deal with it anyway."
I'm actually worried about the default result of trying to leverage LLM-agents to help solve these sorts of problems to make people stupider, instead of smarter. My experience is that LLMs have nudged me towards trying to solve problems of the shape that LLMs are good at solving. There's already a massive set of biases nudging people to try substitute easier versions of the alignment problem that don't help much, and LLMs will exacerbate that.
I can buy "when you have human-level AI, you get to run tons of empirical experiments that tell you a lot about an AI's psychology and internals", that you don't get to run on humans. But, those experiments need to pointed towards having a deep, robust understanding in order for it to be safe to scale AI past the human levels. When I hear vague descriptions of the sort of experiments Anthropic seems to be running, they do not seem pointed in that direction.
(what would count as the right sort of experiments? I don't really know. I am hoping for some followup debate between prosaic and agent-foundations-esque researchers on the details of this)
Also, like, you should be way more pessimistic about how this is organizationally hard
I wrote "Carefully Bootstrapped Alignment" is organizationally hard and Recursive Middle Manager Hell kind of with the explicit goal of slowing down how rapidly Anthropic scaled, because I think it's very hard for large orgs to handle this kind of nuance well.
That didn't work, at least in the sense of "not hiring a lot of people." Two years ago, when I last talked to a bunch of Anthropic peeps about this, my sense was that there were indeed some of the pathologies I was worried about.
Since then, I've been at least somewhat pleasantly surprised about vague vibes about how Dario controls Anthropic (that is to say: seems like he tries to actually run it with his models).
I still think it is extremely hard to steer a large organization, and my sense is Dario's attention is still naturally occupied by tons of things about "how to run a generally successful company" (which is already quite hard), which I doubt leaves nearly enough room for either the theoretic questions of "what is really needed to align superintelligence" and "what implications does that have for a company that needs to hit a very narrow research target in a short timeline?"
Listing Cruxes & Followup Debate
Recapping the overall point here:
If alignment were sufficiently hard in particular ways, I don't think Anthropic's cluster of strategies makes sense. If alignment is hard, it doesn't matter if China wins the race. China still loses and so does everyone else. The most important thing would be somehow slowing down across board.
It doesn't matter if we get a few years of beneficial technical progress if some US lab or government project eventually runs an AI with fewer safeguards, or if humanity cedes control of it's major institutions to AI processes.
The thing I am hoping to come out of the comments here is getting more surface area on people's cruxes, while engaging with the entirety of the problem. I'd like it if people got more specific about where they disagree.
If you disagree with my framing here, that's fine, but if so I'd like to see your own framing that engages with the entirety of the problem (and framed around "what would be sufficient to think Anthropic should significantly change it's research or policy comms?"
Drake's arguments about "what capability levels do we actually need to execute useful pivotal acts? What does the alignment-difficult curve look like right around that area?" were somewhat new-to-me, and I don't think I've seen a thorough writeup from the "AI is very hard" crowd that really engaged with that.
I'll be writing a top-level comment that list out many of my own cruxes.
Anthropic people self report as thinking "alignment may be hard", but, I'm comparing this to the MIRI cluster who are like "it is so hard your plan is fundamentally bad, please stop."
This is a guess after talking with Drake Thomas about his sense of the Anthropic view on comms strategy during the drafting of this post, not a deeply-integrated piece of Ray's model of Anthropic.
This is based on comments like in Dario's post on export controls:
This is noticeably different than what I'd be spending chips on.
He added recently:
The Anthropic beta-readers to objected to this one. One said:
I can imagine AI that merely correctly finds existing relevant facts and arguments and being logically coherent, without being deeply good at original thinking, which doesn't require extreme philosophical competence. But, I don't think this will be enough to invent novel tech faster than someone else will build reckless AI.
I certainly want this, but I don't know that we'll actually get it.