Joshua Achiam is the OpenAI Head of Mission Alignment

I start off this post with an apology for two related mistakes from last week.

The first is the easy correction: I incorrectly thought he was the head of ‘alignment’ at OpenAI rather than his actual title ‘mission alignment.’

Both are important, and make one’s views important, but they’re very different.

The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.

While in isolation I believe this is a reasonable way to interpret this quote, this issue is important to get right especially if I’m going to say things like that. Looking at it only that way was wrong. I both used a poor method to contact Joshua for comment that failed to reach him when I had better options, and I failed to do searches for additional past comments that would provide additional context.

I should have done better on both counts, and I’m sorry.

Indeed, exactly because OpenAI is so important, and to counter the potential spread of inaccurate information, I’m offering this deep dive into Joshua Achiam’s public statements. He has looked at a draft of this to confirm it has no major errors.

Here is a thread Joshua wrote in November 2022 giving various links to AI safety papers and resources. The focus is on concrete practical ‘grounded’ stuff, also it also includes a course by Dan Hendrycks that involves both levels.

Image

Joshua Achiam Has a Very Different Model of AI Existential Risk

Having looked at many additional statements, Joshua clearly believes that misalignment risk from AGI is real. He has said so, and he has been working on mitigating that risk. And he’s definitely been in the business many times of pointing out when those skeptical of existential risk get sufficiently far out of line and make absolute statements or unfair personal or cultural attacks.

He does appear to view some models and modes of AI existential risk, including Yudkowsky style models of AI existential risk, as sufficiently implausible or irrelevant as to be effectively ignorable. And he’s shown a strong hostility in the x-risk context to the rhetoric, arguments tactics and suggested actions of existential risk advocates more broadly.

So for example we have these:

Joshua Achiam (March 23, 2024): I think the x-risk discourse pendulum swung a little too far to “everything is fine.” Total doomerism is baseless and doomer arguments generally poor. But unconcerned optimism – or worse, “LeCun said so” optimism – is jarring and indefensible.

Joshua Achiam (June 7, 2023): see a lot of talk about “arguments for” or “arguments against” x-risk and this is not sensible imho. talk about likelihoods of scenarios, not whether they definitely will or definitely won’t happen. you don’t know.

Joshua Achiam (April 25, 2023): I also think a hard take off is extremely unlikely and largely ruled out on physical grounds, but Yann [LeCun], saying “that’s utterly impossible!” has gotta be like, the least genre-savvy thing you can do.

Also x-risk is real, even if unlikely. Vulnerable world hypothesis seems true. AGI makes various x-risks more likely, even if it does not create exotic nanotech gray goo Eliezerdoom. We should definitely reduce x-risk.

Joshua Achiam (March 14, 2021): If we adopt safety best practices that are common in other professional engineering fields, we’ll get there. Surfacing and prioritizing hazards, and making that analysis legible, has to become the norm.

I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it.

I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. Throwing all of humanity’s future babies out with the bathwater.

On the one hand, the first statement is a very clear ‘no everything will not automatically be fine’ and correctly identifies that position as indefensible. The others are helpful as well. The first statement is also a continued characterization of those worried as mostly ‘doomers’ with generally poor arguments.

The second is correct in principle as well. If there’s one thing Yann LeCun isn’t, it’s genre savvy.

Joshua is Strongly Dismissive of Alternative Models of AI X-Risk

In practice, however, the ‘consider the likelihood of each particular scenario’ approach tends to default everything to the ‘things turn out OK’ bracket minus the particular scenarios one can come up with.

It is central to my perspective that you absolutely cannot do that. I am very confident that the things being proposed do not default to good outcomes. Good outcomes are possible, but to get them we will have to engineer them.

There is no contradiction between ‘existential risk is unlikely’ and ‘we should reduce existential risk.’ It is explicit that Joshua thinks such risks are unlikely. Have we seen him put a number on it? Yes, but I found only the original quote I discussed last time and a clarification thereof, which was:

Joshua Achiam: Ah – my claims are

* P(everyone dead in 10 years) is extremely small (1e-6),

* P(everyone dead in 100 years) is much less than 100%,

* Most discourse around x-risk neglects to consider or characterize gradual transformations of humanity that strike me as moderately plausible.

I also think x-risk within 100 years could potentially have AGI in the causal chain without being an intentional act by AGI (eg, humans ask a helpful, aligned AGI to help us solve a scientific problem whose solution lets us build a superweapon that causes x-risk).

This makes clear he is dismissing in particular ‘all humans are physically dead by 2032’ rather than ‘the world is on a path by 2032 where that outcome (or another where all value is lost) is inevitable.’ I do think this very low probability is highly alarming, and in this situation I don’t see how you can possibly have model error as low as 1e-6 (!), but it is less crazy given it is more narrow.

The ‘much less than 100%’ doom number in 100 years doesn’t rule out my own number. What it tells me more than anything, on its own, is that he’s grown understandably exhausted with dealing with people who do put 99% or 99.9% in that spot.

But he’s actually making much stronger claims here, in the context of an EA constructive criticism thread basically telling them not to seek power because EA was too dysfunctional (which makes some good points and suggestions, but also proves far too much which points to what I think is wrong in the thread more broadly):

Joshua Achiam: (This is not to say there are no x-risks from AGI – I think there are – but anyone who tells you probabilities are in the 5-10% range or greater that AGI will immediately and intentionally kill everyone is absolutely not thinking clearly)

The idea that a 5% probability of such an outcome, as envisioned by someone else for some other person’s definition of AGI, proves they are ‘not thinking clearly,’ seems like another clear example of dismissiveness and overconfidence to me. This goes beyond not buying the threat model that creates such predictions, which I think is itself a mistake. Similarly:

Joshua Achiam (November 12, 2022): Again, this is not a claim that x-risk isn’t real, that AGI doesn’t lead to x-risk, or that AGI doesn’t have potentially catastrophic impacts, all of which I think are plausible claims. But the claimed timelines and probabilities are just way, way out of connection to reality.

At this point, I’ve heard quite a lot of people at or formerly at OpenAI in particular, including Sam Altman, espouse the kinds of timelines Joshua here says are ‘way, way out of connection to reality.’ So I’m curious what he thinks about that.

The fourth earlier claim, that AI could be a link in the causal chain to x-risk without requiring the AI commit an intentional act, seems very obviously true. If anything it highlights that many people place importance on there being an ‘intentional act’ or similar, whereas I don’t see that distinction as important. I do think that the scenario he’s describing there, where the superweapon becomes possible but we otherwise have things under control, is a risk level I’d happily accept.

The third claim is more interesting. Most of the talk I hear about ‘we’ll merge with the machines’ or what not doesn’t seem to me to make sense on any meaningful level. I see scenarios where humanity has a ‘gradual transformation’ as where we successfully solve ‘phase one’ and have the alignment and control issues handled, but then weird dynamics or changes happen in what I call ‘phase two’ when we have to get human dynamics in that world into some form of long term equilibrium, and current humanity turns out not to be it.

I do agree or notice I am confused which of those worlds count as valuable versus not. I’ve been mentally basically putting those mostly into the ‘win’ bucket, if you don’t do that then doom estimates go up.

Would Ordinary Safety Practices Would Be Sufficient for AI?

I would hope we can all agree they are necessary. They don’t seem sufficient to me.

Consider Joshua’s belief (at least in 2021) that if adapt general best safety practices from other industries, we’ll ‘get there.’ While they are much better than nothing, and better than current practices in AI, I very strongly disagree with this. I do think that given what else is happening at OpenAI, someone who believes strongly in ‘general best practices’ for safety is providing large value above replacement.

Standard safety policies cannot be assumed. Some major labs fall well short of this, and have made clear they have no intention of changing course. There is clear and extreme opposition, from many circles (not Joshua), to any regulatory requirements that say ‘you must apply otherwise ordinary safety protocols to AI.’

It seems clearly good to not throw out these standard policies, on the margin? It would be a great start to at least agree on that. If nothing else those policies might identify problems that cause us to halt and catch fire.

But I really, really do not think that approach will get it done on its own, other than perhaps via ‘realize you need to stop,’ that the threat models this time are very expansive and very different. I’d certainly go so far as to say that if someone assigns very high probabilities to that approach being sufficient, that they are not in my mind thinking clearly.

Consider also this statement:

Joshua Achiam (August 9, 2022): hot take: no clear distinction between alignment work and capabilities work yet. might not be for a year or two.

Joshua Achiam (March 5, 2024): hot take: still true.

The obvious way to interpret this statement is, in addition to the true statement that much alignment work also enhances capabilities, that the alignment work that isn’t also capabilities work isn’t real alignment work? Downthread he offers good nuance. I do think that most current alignment work does also advance capabilities, but that the distinction should mostly be ‘clear’ even if there are importantly shades of gray and you cannot precisely define a seperator.

Visions of the Future

In terms of ‘things that seem to me like not thinking clearly’:

Joshua Achiam (August 17, 2023): It bears repeating: “Her (2013)” is the only AI movie that correctly predicts the future.

interstice: I agree with Robin Hanson’s take that it’s like a movie about a world where schoolchildren can buy atom bombs at the convenience store, but is bizarrely depicted as otherwise normal, with the main implication of the atom bombs being on the wacky adventures of the kids.

Joshua Achiam: It’s about the world where prosaic alignment works well enough to avoid doom, but leads to the AIs wanting to do their own thing, and the strange messy consequences in the moment where humans and AIs realize that their paths diverge.

Caleb Moses: I’d say this is mainly because it’s primarily concerned with predicting humans (which we know a lot about) rather than AI (which we don’t know a lot about)

Joshua Achiam: 1000%.

So that’s the thing, right? Fictional worlds like this almost never actually make sense on closer examination. The incentives and options and actions are based on the plot and the need to tell human stories rather than following good in-universe logic. That the worlds in question are almost always highly fragile, the worlds really should blow up, and the AIs ensure the humans work out okay in some sense ‘because of reasons’ because it feels right to a human writer and their sense of morality or something rather than that this would happen.

I worry this kind of perspective is load bearing, given he thinks it is ‘correctly predicting the future,’ the idea that ‘prosaic alignment’ will result in sufficiently strong pushes to doing some common sense morality style not harming of the humans, despite all the competitive dynamics among AIs and various other things they value and grow to value, that things turn out fine by default, in worlds that to me seem past their point of no return and infinitely doomed unless you think the AIs themselves have value.

Alternatively, yes, Her is primarily about predicting the humans. And perhaps it is a good depiction of how humans would react to and interact with AI if that scenario took place. But it does a very poor job predicting the AIs, which is the part that actually matters here?

For the opposite perspective, see for example Eliezer Yudkowsky here last month.

Joshua Achiam versus Eliezer Yudkowsky

We definitely have a pattern of Joshua taking rhetorical pot-shots at Yudkowsky and AI. Here’s a pretty bad one:

Joshua Achiam (March 29, 2023): Eliezer is going to get AI researchers murdered at some point, and his calls for extreme violence have no place in the field of AI safety. We are now well past the point where it’s appropriate to take him seriously, even as a charismatic fanfiction author.

No, I do not mean as a founder of the field of alignment. You don’t get to claim “field founder” status if you don’t actually work in the field. Calling for airstrikes on rogue datacenters is a direct call for violence, a clear message that violence is an acceptable solution.

His essays are completely unrelated to all real thrusts of effort in the field and almost all of his object-level technical predictions over the past twenty years have been wrong. Founder of rationalism? Sure. Alignment? Absolutely not.

I think this kind of rhetoric about ‘calls for violence’ is extremely bad and wrong. Even for example here, where the thread’s primary purpose is to point out that certain accusations against EA (that they ‘underemphasized AI x-risk’ and pretended to care about other things) are indeed quite ridiculous, you see him refer to Eliezer “Bomb the Datacenters” Yudkowsky.

What Yudkowsky said was that if there was an international agreement that you don’t develop AGI, you would if you ran out of other alternatives use physical force to enforce that agreement. That is how every law and every treaty or agreement works, and indeed the only way they can work.

Richard Ngo (replying to Joshua): You are fundamentally misunderstanding how policy discussions work and thereby propagating the meme that you’re trying to suppress.

You cannot interpret international policy proposals as calls for individual action otherwise *any* opinion on IR == advocating murder.

Joshua Achiam: I don’t think this is a misunderstanding. If you say, “The people who are building AI are going to kill my children and your children, violence is acceptable and even necessary to shut this down,” that is not. Goddamn. Okay.

Richard Ngo: EVERY person who wants the police to arrest people who break the law is in favor of violence. But advocating for laws is not advocating for violence. The same is true about advocating for international treaties. You’re creating the meme we’re both trying to prevent. Please stop.

This is so frustrating. Unlawful violence quite obviously is unacceptable even if it would work, and also it very obviously wouldn’t work. And we keep saying that.

In terms of Yudkowsky’s technical predictions, I noticed I disagreed, and rather than argue details I asked Claude and o1-preview this question (fully one-shot only):

“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“

Skipping to the conclusions (you can of course try the prompt yourself, or try your own):

Claude: Overall evaluation: On a scale from “mostly true” to “mostly false,” I would rate Yudkowsky’s predictions as leaning towards “mostly true,” but with important caveats. Many of his core concerns and predictions about the challenges of AI development and safety have been validated by ongoing research and developments in the field. However, some of his more extreme or urgent predictions about rapid capability jumps or imminent existential risks have not yet materialized in the ways he initially described.

o1-preview: Considering the above assessments, Eliezer Yudkowsky’s predictions are a mix of validated concerns and speculative forecasts. The trends in AI development up to March 2023 provide partial support for his views, particularly regarding the rapid advancement of AI capabilities and the challenges of alignment.

On a scale from “mostly true” to “mostly false,” I would evaluate these predictions as leaning towards “mostly true.” While not all predictions have been conclusively proven, the trajectory of AI research and the growing acknowledgment of AI safety issues suggest that his insights are largely valid and warrant serious consideration.

Given how difficult predictions are to make especially about the future, that’s not bad, and certainly quite different from ‘almost all wrong’ to the point of needing everyone else dismiss him as a thinker.

One of Eliezer’s key concepts is instrumental convergence. In this thread Achiam argues against the fully maximalist form of instrumental convergence:

Joshua Achiam (March 9, 2023): For literally every macrostate goal (“cause observable X to be true in the universe”) you can write an extended microstate goal that specifies how it is achieved (“cause observable X to be true in the universe BY MEANS OF action series Y”).

It doesn’t seem clear or obvious whether the space of microstates is dense in undesired subgoals. If the space of goals that lead to instrumental drives is a set of measure zero in this space, slight misalignment is almost surely never going to result in the bad thing.

And that claim – “We don’t know if goal space is dense in inert goals or dense in goals that lead to instrumental drives” – is the main point here. WE DON’T KNOW.

The alignment X-risk world takes “instrumental goals are inevitable” as a shibboleth, an assumption that requires no proof. But it is an actual question that requires investigation! Especially if claims with huge ramifications depend on it.

It is technically true that you can impose arbitrarily strict implementation details and constraints on a goal, such that instrumental convergence ceases to be a useful means of approaching the goal, and thus you should expect not to observe it.

Without getting into any technical arguments, it seems rather absurd to suggest the set of goals that imply undesired subgoals within plausibly desired goal space would have measure zero? I don’t see how this survives contact with common sense or relation to human experience or typical human situations. Most humans spend most of their lives pursuing otherwise undesired subgoals and subdrives that exist due to other goals, on some level. The path to achieving almost any big goal, or pursuing anything maximalist, will do the same.

When I think about how an AI would achieve a wide variety of goals humans might plausibly assign to it, I see the same result. We’ve also seen observations (at least the way interpret it) of instrumental convergence in existing models now, when given goals, reasonably consistently among the reports I see that give the model reason to do so.

Am I holding out some probability that instrumental convergence mostly won’t be a thing for highly capable AIs? I have to, because this is not a place you can ‘prove’ anything as such. But it would be really boggling to me for it to almost never show up, if we assigned various complex and difficult tasks, and gave the models capabilities where instrumental convergence was clearly the ‘correct play,’ without any active attempts to prevent instrumental convergence from showing up.

I agree we should continue to test and verify, and would even if everyone agreed it was super likely. But convergence failing to show up would blow my mind hardcore.

People Are Going to Give AI Power

In the before times, people said things like ‘oh you wouldn’t connect your AI to the internet, you’d be crazy.’ Or they’d say ‘you wouldn’t make your AI into an agent and let it go off with your [crypto] wallet.’

Those predictions did not survive contact with the enemy, or with reality. Whoops.

Joshua Achiam (April 28, 2023): 🌶A problem in the AI safety discourse: many are assuming a threat model where the AI subtly or forcibly takes resources and power from us, and this is the thing we need to defend against. This argument has a big hole in it: it won’t have to take what it is given freely.

The market is selecting for the development and deployment of large-scale AI models that will allow increasingly-complex decisions and workflows to be handled by AI with low-to-no human oversight. The market *explicitly wants* to give the AI power.

If your strategy relies on avoiding the AI ever getting power, influence, or resources, your strategy is dead on arrival.

This seems insanely difficult to avoid. As I have tried to warn many times, once AI is more effective at various tasks than humans, any humans who don’t turn those tasks over to AIs get left behind. That’s true for individuals, groups, corporations and nations. If you don’t want the AIs to be given that power, you have two options: You can prevent the AIs from being created, or you can actively bar anyone from giving the AIs that power, in a way that sticks.

Indeed, I would go further. The market wants the AIs to be given as much freedom and authority as possible, to send them out to compete for resources and influence generally, for various ultimate purposes. And the outcome of those clashes and various selection effects and resource competitions, by default, dooming us.

Your third option is the one Joshua suggests, that you assume the AIs get the power and plan accordingly.

Joshua Achiam: You should be building tools that ensure AI behavior in critical decision-making settings is robust, reliable, and well-specified.

Crucially this means you’ll need to develop domain knowledge about the decisions it will actually make. Safety strategies that are too high-level – “how do we detect power-seeking?” are useless by comparison to safety strategies that are exhaustive at object level.

How do we get it to make financial decisions in ways that don’t create massive wipeout risks? How do we put limits on the amount of resources that it can allocate to its own compute and retraining? How do we prevent it from putting a political thumb on the scale?

In every domain, you’ll have to build datasets, process models, and appropriate safety constraints on outcomes that you can turn into specific training objectives for the model.

Seems really hard on multiple levels. There is an implicit ‘you build distinct AIs to handle distinct narrow tasks where you can well-define what they’re aiming for’ but that is also not what the market wants. The market wants general purpose agents that will go out and do underspecified tasks to advance people’s overall situations and interests, in ways that very much want them to do all the things the above wants them not to do. The market wants AIs advising humans on every decision they make, with all the problems that implies.

If you want AIs to only do well-specified things in well-specified domains according to socially approved objectives and principles, how do you get to that outcome? How do you deal with all the myriad incentives lining up the other way, all the usual game theory problems? And that’s if you actually know how to get the AIs to be smart enough and perceptive enough to do their work yet respond to the training sets in a way that gets them to disregard, even under pressure, the obviously correct courses of action on every other level.

These are super hard and important questions and I don’t like any of the answers I’ve seen. That includes Joshua’s suggested path, which doesn’t seem like it solves the problem.

The place it gets weird was in this follow-up.

Joshua Browder: I decided to outsource my entire personal financial life to GPT-4 (via the DoNotPay chat we are building).

I gave AutoGPT access to my bank, financial statements, credit report, and email.

Here’s how it’s going so far (+$217.85) and the strange ways it’s saving money.

Joel Lehman: Welp.

Joshua Achiam: To put a fine point on it – this is one of the reasons I think x-risk from the competition-for-resources scenario is low. There just isn’t a competition. All the conditions are set for enthusiastic collaboration. (But x-risk from accidents or human evil is still plausible.)

Roon: Yeah.

But that’s exactly how I think the competition for resources x-risk thing manifests. Browder outsources his everything to AgentGPT-N. He tells it to go out and use his money to compete for resources. So does everyone else. And then things happen.

So the argument is that these AIs will then ‘enthusiastically collaborate’ with each other? Why should we expect that? Is this a AIs-will-use-good-decision-theory claim? Something else? If they do all cooperate fully with each other, how does that not look like them taking control to maximize some joint objective? And so on.

In not directly related but relevant to similar issues good news, he notes that some people are indeed ‘writing the spec’ which is the kind of work he seems to think is most important?

Joshua Achiam (Dec 31, 2022): “We just have to sit down and actually write a damn specification, even if it’s like pulling teeth. It’s the most important thing we could possibly do,” said almost no one in the field of AGI alignment, sadly.

Joshua Achiam (Dec 10, 2023): this has changed in a year! alignment folks are talking about building the spec now. bullish on this.

Tegmark just gave a lightning talk on it. Also @davidad’s agenda aims in this direction

I do think it’s very cool that several people are taking a crack at writing specifications. I have no idea how their specs could be expected to work and solve all these problems, but yes people are at least writing some specs.

Value is Complicated

Here is a thread by Joshua Achiam from July 2023, which I believe represents both a really huge central unsolved problem and also a misunderstanding:

Joshua Achiam: this is coming from a place of love: I wish more people in the alignment research universe, who care deeply that AI will share human values, would put more effort into understanding and engaging with different POVs that represent the wide umbrella of human values.

And, sort of broadly, put more effort into embracing and living human values. A lot of alignment researchers seem to live highly out-of-distribution lives, with ideas and ideals that reject much of what “human values” really has to offer. Feels incongruous. People notice this.

“excuse me SIR, the fundamental problem we’re trying to solve is to get it to NOT KILL LITERALLY EVERYONE, and we can worry about those cultural values when we’ve figured that out” ultimate cop out, you’re avoiding the central thing in alignment.

If you can’t get the AI to share human cultural values, your arguments say we’re all going to die. how do you expect to solve this problem if you don’t really try to understand the target? what distinguishes human values from other values?

Are you trying to protect contemporary human aesthetics? the biological human form? our sociopolitical beliefs? if you are trying to protect our freedom to voluntarily change these at will, what counts as sufficiently free? our opinions are staggeringly path-dependent.

Are some opinion-formation paths valid according to your criteria and some paths invalid? When you fear AI influence, do you have a theory for what kind of influence is legitimate and what isn’t?

That said, to avoid misinterpretation: this is not a diss, alignment is an important research field, and x-risk from AGI is nonnegligible. I think the field will surface important results even if it fails in some ways. but this failure lowkey sucks and I think it is a tangible obstacle to success for the agenda of many alignment researchers. you often seem like you don’t know what you are actually trying to protect. this is why so many alignment research agendas come across as incredibly vague and underspecified.

And I think it is a tangible obstacle to success for the agenda of many alignment researchers. you often seem like you don’t know what you are actually trying to protect. this is why so many alignment research agendas come across as incredibly vague and underspecified.

I would strongly disagree, and say this is the only approach I know that takes the problem of what we value seriously, and that a false sense of exactly what you are protecting, or trying to aim now at protecting a specific specified target, would make things less likely to work. You’d pay to know what you really think. Us old school rationalists, starting with Eliezer Yudkowsky, have been struggling with the ‘what are human values’ problem as central to alignment, for a long time.

Sixteen years ago, Eliezer Yudkowsky wrote the Value Theory sequence, going deep on questions like what makes things have value to us, how to reconcile when different entities (human or otherwise) have very different values, and so on. If you’re interested in these questions, this is a great place to start. I have often tried to emphasize that I continue to believe that Value is Fragile, whereas many who don’t believe in existential risk think value is not fragile.

It is a highly understood problem among our crowd that ‘human values’ is both very complex and a terrifyingly hard thing to pin down, and that people very strongly disagree about what they value.

Also it is a terrifyingly easy thing to screw up accidentally, and we have often said that this is one of the important ways to build AGI and lose – that you choose a close and well-meaning but incorrect specification of values, or your chosen words get interpreted that way, or someone tries to get the AGI to find those values by SGD or other search and it gets a well-meaning but incorrect specification.

Thus, the idea to institute Coherent Extrapolated Volition, or CEV, which is very roughly ‘what people would collectively choose as their values, given full accurate information and sufficient time and skill to contemplate the question.’

In calculating CEV, an AI would predict what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI’s utility function.

Why would you do that? Exactly because of the expectation that if you do almost anything else, you’re not only not taking everyone’s values into account, you don’t even understand your own well enough to specify them. I certainly don’t. I don’t even have confidence that CEV, if implemented, would result in that much of the things that I actually value, although I’d take it. And yes, this whole problem terrifies me even in good scenarios.

What am I fighting to preserve right now? I am fighting for the ability to make those choices later. That means the humans stay alive and they stay in control. And I am choosing to be less concerned about exactly which humans get to choose which humans get to choose, and more concerned with humans getting to properly choose at all.

Because I expect that if humans don’t make an active choice, or use a poor specification of preferences that gets locked in, then the value that results is likely zero. Whereas if humans do choose intentionally, even humans whose values I strongly disagree with and that are being largely selfish, I do expect those worlds to have strongly positive value. That’s a way in which I think value isn’t so fragile. So yes, I do think the focus should be ensuring someone gets to choose at all.

Also, I strongly believe for these purposes in a form of the orthogonality thesis, which here seems obviously true to me. In particular: Either you can get the AI to reflect the values of your choice, or you can’t. You don’t need to know which values you are aiming for in order to figure that part out. And once you figure that part out you can and should use the AI to help you figure out your values.

Meanwhile, yes, I spend rather a lot of time thinking about what is actually valuable to me and others, without expecting us humans to find the answer on our own, partly because one cannot avoid doing so, partly because it is decision relevant in questions like ‘how much existential risk should we accept in the name of ‘beating China’’?

In a world where everyone wants the AI to ‘do our alignment homework’ for us, one must ask: What problems must we solve before asking the AI to ‘do our homework’ for us, versus which questions then allow us to safety ask the AI to do that? Almost everyone agrees, in some form, that the key is solving the problems that clear the way to letting AI fully help solve our other problems.

And no, I don’t like getting into too much detail about my best guess about what I value or we collectively should value in the end, both because I think value differences should be respected and because I know how distracting and overwhelming those discussions and fights get if you let them start.

Mostly, I try to highlight those who are expressing values I strongly disagree with – in particular, those that favor or are fine with human extinction. I’m willing to say I’m not okay with that, and I don’t find any of the ‘but it’s fine’ proposals to be both acceptable and physically realistic so far.

Is all this a ‘cop out’? I would say, absolutely not.

Do people ‘notice’ that you are insufficiently focused on these questions? Oh, sure. They notice that you are not focused on those political fights and arguments. Some of them will not like that, because those questions are what they care about. The alternative is that they notice the opposite. That’s worse.

Others appreciate that you are focused on solving problems and growing or preserving the pie, rather than arguing values and focusing on political battles.

Yes, if we succeed in getting to continue to live, as he says here we will then have to agree on how to divide the bounty and do the realignments (I would add, voluntarily or otherwise), same as we do today. But the parties aren’t in position to negotiate about this now, we don’t know what is available and we don’t know what we want and we don’t have anyone who could credibly negotiate for any of the sides or interests and so on. Kicking the ‘who benefits’ can down the road is a time tested thing to do when inventing new technologies and ensuring they’re safe to deploy.

Conclusion

The interactions I’ve had with Joshua after my initial errors leave me optimistic for continued productive dialogue. Whatever our disagreements, I believe Joshua is trying to figure things out and ensure we have a good future, and that all the public statements analyzed above were intended to be helpful. That is highly refreshing.

Those statements do contain many claims with which I very strongly disagree. We have very different threat models. We take very different views of various predictions and claims, about both the past and the future. At least in the recent past, he was highly dismissive of commonly expressed timeline projections, risk models and risk level assessments, including my own and even more those of many of his colleagues. At core, while I am very happy he at least does think ‘ordinary safety practices’ are necessary and worthwhile, he thinks ‘ordinary safety practices’ would ‘get us there’ and I very much do not expect this. And I fear the views he expresses may lead to shutting out many of those with the most important and strongest concerns.

These disagreements have what seem like important implications, so I am glad I took the time to focus on them and lay them out in detail, and hopefully start a broader discussion.

New Comment
14 comments, sorted by Click to highlight new comments since:

“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“

I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says

Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."

Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.

However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.

It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.

I think Claude likes saying nice things about people, so it's worth trying to control for that.

In case Joshua Achiam ends up reading this post, my question for him is:

My understanding is that you think P(misaligned AGI will kill all humans by 2032) is extremely low, like 1e-6.

Is this because:

  • We won't have AGI by 2032?
  • Much-smarter-than-human AIs could takeover, but not AGI and much-smarter-than-human AIs won't exist prior to 2032?
  • Misalignment could cause problems, but not extinction or literal AI takeover? (E.g., because misalignment won't be this bad.)
  • AI takeover is plausible but wouldn't kill every person?

What is your P(violent misaligned AI takeover | AGI by 2032)?[1]


Some clarification:

I'm using AGI to mean top-human-expert level AI. (As in, can obsolete top human experts in most non-physical tasks.) OpenAI typically uses a similar definition.

By "misalignment", I mean "AIs that conspire against you and your countermeasures", not "AIs which aren't sufficiently robust to jailbreaks".

My guess is that your view is that "misalignment" could be bad, but not existential, while other risks from AGI (e.g. new superweapons used in great power conflict) could be existential.

My view is that a misaligned AI that succeeds in takeover probably wouldn't kill literally every person, though takeover has a high probability of killing billions, killing everyone is plausible, and this would eliminate human control of the future which is likely extremely bad.

So, I prefer to talk about the chances of "violent misaligned AI takeover" where violent means "a bunch of people die or are harmed".

(This is based on this twitter thread I wrote.)


  1. Note that this includes takeover by later systems which are more powerful than AGI. I'm just fixing AGI being created before some date.. ↩︎

The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.

Uhh, with the additional context from this post in mind, I would argue that your initial interpretation was entirely correct. Like, this is precisely what I'd expected from reading that first quote.

Not the worst-case scenario of Yann LeCun, admittedly, but getting there.

Edit: Actually it's a bit worse than I'd expected. "Advocating for an international treaty is a call for violence" is just an embarrassing take.

My view on this:

Edit: Actually it's a bit worse than I'd expected. "Advocating for an international treaty is a call for violence" is just an embarrassing take.

The context is fairly critical here, and I broadly share iceman's perspective regarding the article's call for an international treaty:

https://www.lesswrong.com/posts/gZkYvA6suQJthvj4E/#SCnooEfdjJaQHS4TB

Which is while he didn't explicitly call for violence, he did argue for very drastic actions, and contra others, treaties are not usually enforced with nuclear war, for obvious reasons.

Eliezer did not call for enforcement via nuclear war. He said that:

Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.

and later clarified:

My TIME piece did not suggest nuclear strikes against countries that refuse to sign on to a global agreement against large AI training runs. It said that, if a non-signatory country is building a datacenter that might kill everyone on Earth, you should be willing to preemptively destroy that datacenter; the intended reading is that you should do this even if the non-signatory country is a nuclear power and even if they try to threaten nuclear retaliation for the strike. This is what is meant by "Make it explicit... that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs."

Violently enforcing certain particularly important principles on non-signatories is entirely within the norm, the ban on international-trade-endangering piracy being the prime example. The idea that applying a qualitatively similar standard to AI risk is "deranged" is only valid if you don't believe that catastrophic AI risk is real: if you don't believe that a rogue superintelligence somewhere in North Korea can hurt you in the US.

Anyway, that's not even the crux here. The crux is that there's a day-and-night difference between:

  1. Arguing that the geopolitical entities, whose monopoly on violence we already accept as foundational to the social contract keeping our civilization together, should add another point to the list of things they enforce.
  2. Arguing for violating the social contract to carry out unilateral violent action.

The difference between those is far beyond the fine points of whether it's okay or not to enforce an international treaty on nukes-having non-signatories. And the worst falsehoods being spread are those misrepresenting (1) as (2), and Joshua Achiam's quotes above likewise seem to fail to see the difference between the two (though I don't think he's doing that maliciously).

Violently enforcing certain particularly important principles on non-signatories is entirely within the norm

True as stated, though I'm not aware of examples of this being enforced on non-signatories which are nuclear powers. This is just quantitatively riskier, not a notable change in norms.

And I agree this seems clearly non-outrageous if we replace AGI training datacenter with something like "wet lab credibly planning on developing a virus that would kill literally every human if it escaped where this wet lab isn't planning on taking any serious precautions against a lab leak".

I think this is a disagreement I have:

This is just quantitatively riskier, not a notable escalation.

I think this is actually a fairly extreme escalation compared to how states deal with threats, and whether or not you think this is a good policy, it is a very, very important escalation step, and that this is evidence for it being a very extreme escalation:

True as stated, though I'm not aware of examples of this being enforced on non-signatories which are nuclear powers. 

Sorry, I actually meant "not a notable change in norms". I agree that it is quantiatively much costlier from the perspective of the US.

Hm, a disagreement I have is that the norms around escalating to nuclear war are way, way stronger than basically any other norm in international relations, and there's a reason basically all states do their actions through proxies/covert wars, because the taboo on nuclear war is way stronger than a lot of other norms in the international setting.

I agree that it is norms violating for a country to respond to a conventional strike on their datacenter with a nuclear response. This is different from the statement that the conventional strike from the other country is norms violating.

I don't think conventional strikes on military assets of nuclear power are that norms violating. In fact, recently, a huge number of missiles were launched at a nuclear power. (Iran launched them at Israel which is widely believed to have nukes.)

(I believe the US has never directly launched a strike on a nuclear power within their territory. However, it has indirectly assisted with such strikes in the Russia Ukraine war and participated in proxy wars.)

Yes, I was solely referring to nuclear strikes.

The point that would justify an airstrike isn't violation of a treaty, but posing an immediate and grave risk to the international community. The treaty is only the precondition that makes effective and coordinated action possible.

I do think that most current alignment work does also advance capabilities, but that the distinction should mostly be ‘clear’ even if there are importantly shades of gray and you cannot precisely define a seperator.


For a large subclass of work, I actually disagree with this claim, and I think areas where you wouldn't accelerate capabilities with alignment work is work on reducing deceptively aligned AI takeover risk/sharp left turn risk, for the reasons that @RogerDearnaley stated here:

https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1#jaqADvsmqmqMKRimH

So that’s the thing, right? Fictional worlds like this almost never actually make sense on closer examination. The incentives and options and actions are based on the plot and the need to tell human stories rather than following good in-universe logic. That the worlds in question are almost always highly fragile, the worlds really should blow up, and the AIs ensure the humans work out okay in some sense ‘because of reasons’ because it feels right to a human writer and their sense of morality or something rather than that this would happen.

I worry this kind of perspective is load bearing, given he thinks it is ‘correctly predicting the future,’ the idea that ‘prosaic alignment’ will result in sufficiently strong pushes to doing some common sense morality style not harming of the humans, despite all the competitive dynamics among AIs and various other things they value and grow to value, that things turn out fine by default, in worlds that to me seem past their point of no return and infinitely doomed unless you think the AIs themselves have value.

I think one key crux here is whether you think partial alignment successes are possible. If AI alignment ends up binary, I would agree that Her is basically an incoherent description of the future.

If AI alignment ends up more of a continuous quantity such that reasonably large partial successes are probable, then Her is more coherent as a plausible future.

I admit I tend to favor the continuous side of the debate more than the discrete side, and tend to see discreteness as an abstraction over the actual continuous outcomes.

To be clear, Her is not a totally coherent story, but I do think that relatively minor changes are enough to make it more coherent.

On this:


Without getting into any technical arguments, it seems rather absurd to suggest the set of goals that imply undesired subgoals within plausibly desired goal space would have measure zero? I don’t see how this survives contact with common sense or relation to human experience or typical human situations.

The answer is because probabilities get weird in infinity, because probability can be 0% on an event happening, even if the event happening is possible, and probability can be 100% on an event happening, even if there are scenarios where the event can happen.

One perfect example is that if you pick a random real number, you will have 0% on getting any real number, no matter which it is, and if you try to search for an irrational number, you will have a 100% chance of getting an irrational real number, but that doesn't mean you can't sample a rational number.

I don't think that it is actually measure 0, but that's for different reasons.

My response to Joshua Achiam's point on why we get instrumental convergence is o1 exists. 

While it often doesn't want to use it's instrumental convergence, it does have the capability to cause basic instrumental convergence, and in AI scaling, noisy capabilites foreshadow robust and powerful capabilities.

Indeed, I would go further. The market wants the AIs to be given as much freedom and authority as possible, to send them out to compete for resources and influence generally, for various ultimate purposes. And the outcome of those clashes and various selection effects and resource competitions, by default, dooming us.

I think it depends on whether we are assessing it using X-risk standards or GCR standards.

On the one hand, I absolutely believe that we could well get into GCR territory, for something like this reason below:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher

But I don't think it will get into X-risk territory, because I both expect AI agents to be more controlled than the molochian story tells us, and also because I expect some winners who will be able to go to the stars and have their values imprint on the stars.

Sixteen years ago, Eliezer Yudkowsky wrote the Value Theory sequence, going deep on questions like what makes things have value to us, how to reconcile when different entities (human or otherwise) have very different values, and so on. If you’re interested in these questions, this is a great place to start. I have often tried to emphasize that I continue to believe that Value is Fragile, whereas many who don’t believe in existential risk think value is not fragile.

It is a highly understood problem among our crowd that ‘human values’ is both very complex and a terrifyingly hard thing to pin down, and that people very strongly disagree about what they value.

Also it is a terrifyingly easy thing to screw up accidentally, and we have often said that this is one of the important ways to build AGI and lose – that you choose a close and well-meaning but incorrect specification of values, or your chosen words get interpreted that way, or someone tries to get the AGI to find those values by SGD or other search and it gets a well-meaning but incorrect specification.

I actually don't agree with both Joshua Achiam and a lot of LWers on the assumption that value is complicated (at least in generative structure), and also don't think this failure mode, conditional on no deceptive/proxy misalignment is likely to happen at all, because capabilities people are incentivized to make our specifications better, and I also think that the function of human values are actually fairly simple, when we abstract away the complicated mechanisms.

I agree with the literal claim in Value is Fragile post that this is probably true with reasonably high probability, but not so much with the other conclusions often held in the cluster.

Also, I think one very large meta-crux that underlies a lot of other cruxes is whether the best strategy is to do locally optimal things and iterate, closest to the model-free RL approaches, or whether you should just build an explicit model and optimize in the model, closest to model-based RL/search approaches.

Cf these comments:

https://www.lesswrong.com/posts/gmHiwafywFo33euGz/aligned-foundation-models-don-t-imply-aligned-systems#xoe3e8uJp4xnx4okn

https://www.lesswrong.com/posts/8A6wXarDpr6ckMmTn/another-argument-against-utility-centric-alignment-paradigms#8KMQEYCbyQoLccbPa

I think this is a meta-crux underlying all other disagreements in strategies for AI alignment, and I think unfortunately we will probably have to solve this the hard way by executing both strategies in reality in parallel, to see which one wins out as AI progresses.

Edit:

I have to say something about this:

That is how every law and every treaty or agreement works, and indeed the only way they can work.

No, this isn't how every treaty works.

Treaty violations amongst states are usually not enforced with the threat of war, for fairly obvious reasons. Instead, they are settled some other way.