Comment Permalink

Seth Herd5mo42

I don't think coordinating a billion copies of GPT-7 is at all what the worried tend to worry about. We worry about a single agent based on GPT-7 self-improving until it can take over singlehanded- perhaps with copies it made itself specifically optimized for coordination, perhaps sticking to only less intelligent servant agents. The alternative is also a possible route to disaster, but I think things would go off the rails far before then. You're in good if minority company in worrying about slower and more law-abiding takeovers; Christiano's stance on doom seems to place most of the odds of disaster in these scenarios, for instance; but I don't understand why other of you see it as so likely that we partway solve the alignment problem but don't use that to prevent them from slowly progressive outcompeting humans. It seems like an unlikely combination of technical success and societal idiocy. Although to be fair, when I phrase it that way, it does sound kind of like our species MO :)

On your other contention that AI will probably follow norms and laws, constraining takeover attempts like coups are constrained: I agree that some of the same constraints may apply, but that is little comfort. It's technically correct that AIs would probably use whatever avenue is available, including nonviolent and legal ones, to accomplish their goals (and potentially disempower humans).

Assuming AIs will follow norms, laws, and social constraints even when ignoring them would work better is assuming we've almost completely solved alignment. If that happens, great, but that is a technical objective we're working toward, not an outcome we can assume when thinking about AI safety. LLM do have powerful norm-following habits; this will be a huge help in achieving alignment if they form the core of AGI, but it does not entirely solve the problem.

I have wondered in response to similar statements you've made in the past: are you including the observation that human history is chock full of people ignoring norms, laws, and social constraints when they think they can get away with it? I see our current state of civilization as a remarkable achievement that is fragile and must be carefully protected against seismic shifts in power balances, including AGI but also with other potential destabilizing factors of the sort that have brought down governments and social orders in the past.

In sum, if you're arguing that AGI won't necessarily violently take over right away, I agree. If you're arguing that it wouldn't do that if it had the chance, I think that is an entirely technical question of whether we've succeeded adequately at alignment.

See in context

194 The Compendium, A full argument about extinction risk from AGI

by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti

31st Oct 2024

AI Alignment Forum

3 min read

194 Ω 65

This is a linkpost for https://www.thecompendium.ai/

We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

We would appreciate your feedback, whether or not you agree with us:

If you do agree with us, please point out where you think the arguments can be made stronger, and contact us if there are ways you’d be interested in collaborating in the future.
If you disagree with us, please let us know where our argument loses you and which points are the most significant cruxes - we welcome debate.

Here is the twitter thread and the summary:

The Compendium aims to present a coherent worldview about the extinction risks of artificial general intelligence (AGI), an artificial intelligence that exceeds that of humans, in a way that is accessible to non-technical readers who have no prior knowledge of AI. A reader should come away with an understanding of the current landscape, the race to AGI, and its existential stakes.
AI progress is rapidly converging on building AGI, driven by a brute-force paradigm that is bottlenecked by resources, not insights. Well-resourced, ideologically motivated individuals are driving a corporate race to AGI. They are now backed by Big Tech, and will soon have the support of nations.
People debate whether or not it is possible to build AGI, but most of the discourse is rooted in pseudoscience. Because humanity lacks a formal theory of intelligence, we must operate by the empirical observation that AI capabilities are increasing rapidly, surpassing human benchmarks at an unprecedented pace.
As more and more human tasks are automated, the gap between artificial and human intelligence shrinks. At the point when AI is able to do all of the tasks a human can on a computer, it will functionally be AGI and able to conduct the same AI research that we can. Should this happen, AGI will quickly scale to superintelligence, and then to levels so powerful that AI is best described as a god compared to humans. Just as humans have catalyzed the holocene extinction, these systems pose an extinction risk for humanity not because they are malicious, but because we will be powerless to control them as they reshape the world, indifferent to our fate.
Coexisting with such powerful AI requires solving some of the most difficult problems that humanity has ever tackled, which demand Nobel-prize-level breakthroughs, billions or trillions of dollars of investment, and progress in fields that resist scientific understanding. We suspect that we do not have enough time to adequately address these challenges.
Current technical AI safety efforts are not on track to solve this problem, and current AI governance efforts are ill-equipped to stop the race to AGI. Many of these efforts have been co-opted by the very actors racing to AGI, who undermine regulatory efforts, cut corners on safety, and are increasingly stoking nation-state conflict in order to justify racing.
This race is propelled by the belief that AI will bring extreme power to whoever builds it first, and that the primary quest of our era is to build this technology. To survive, humanity must oppose this ideology and the race to AGI, building global governance that is mature enough to develop technology conscientiously and justly. We are far from achieving this goal, but believe it to be possible. We need your help to get there.

AI RiskWorld Modeling

Frontpage

194 Ω 65

Mentioned in

148The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better

1AI alignment via civilizational cognitive updates

New Comment

53 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:00 PM

[-]Vladimir_Nesov5mo*Ω7203

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s cluster is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households.

So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year^[1].

This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy.

If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yearly consumption of 1,700 average US households (again, ignoring the rest of the datacenter, which is not the correct thing to do). ↩︎

[-]adamShimi5moΩ351

Thanks for the comment!

We want to check the maths, but if you're indeed correct we will update the numbers (and reasoning) in the next minor version.

[-][anonymous]5mo1814

I like the section where you list out specific things you think people should do. (One objection I sometimes hear is something like "I know that [evals/RSPs/if-then plans/misc] are not sufficient, but I just don't really know what else there is to do. It feels like you either have to commit to something tangible that doesn't solve the whole problem or you just get lost in a depressed doom spiral.")

I think your section on suggestions could be stronger by presenting more ambitious/impactful stories of comms/advocacy. I think there's something tricky about a document that has the vibe "this is the most important issue in the world and pretty much everyone else is approaching it the wrong way" and then pivots to "and the right way to approach it is to post on Twitter and talk to your friends."

My guess is that you prioritized listing things that were relatively low friction and accessible. (And tbc I do think that the world would be in better shape if more people were sharing their views and contributing to the broad discourse.)

But I think when you're talking to high-context AIS people who are willing to devote their entire career to work on AI Safety, they'll be interested in more ambitious/sexy/impactful ways of contributing.

Put differently: Should I really quit my job at [fancy company or high-status technical safety group] to Tweet about my takes, talk to my family/friends, and maybe make some website? Or are there other paths I could pursue?

As I wrote here, I think we have some of those ambitious/sexy/high-impact role models that could be used to make this pitch stronger, more ambitious, and more inspiring. EG:

One possible critique is that their suggestions are not particularly ambitious. This is likely because they're writing for a broader audience (people who haven't been deeply engaged in AI safety).
For people who have been deeply engaged in AI safety, I think the natural steelman here is "focus on helping the public/government better understand the AI risk situation."
There are at least some impactful and high-status examples of this (e.g., Hinton, Bengio, Hendrycks). I think in the last few years, for instance, most people would agree that Hinton/Bengio/Hendrycks have had far more impact in their communications/outreach/policy work than their technical research work.
And it's not just the famous people– I can think of ~10 junior or mid-career people who left technical research in the last year to help policymakers better understand AI progress and AI risk, and I think their work is likely far more impactful than if they had stayed in technical research. (And I think is true even if I exclude technical people who are working on evals/if-then plans in govt. Like, I'm focusing on people who see their primary purpose as helping the public or policymakers develop "situational awareness", develop stronger models of AI progress and AI risk, understand the conceptual arguments for misalignment risk, etc.)

I'd also be curious to hear what your thoughts are on people joining government organizations (like the US AI Safety Institute, UK AI Safety Institute, Horizon Fellowship, etc.) Most of your suggestions seem to involve contributing from outside government, and I'd be curious to hear more about your suggestions for people who are either working in government or open to working in government.

[-]Chris Scammell5mo117

Thanks Akash for this substantive reply!

Last minute we ended up cutting a section at the end of the document called "how does this [referring to civics, communications, coordination] all add up to preventing extinction?" It was an attempt to address the thing you're pointing at here:

I think there's something tricky about a document that has the vibe "this is the most important issue in the world and pretty much everyone else is approaching it the wrong way" and then pivots to "and the right way to approach it is to post on Twitter and talk to your friends."

Sadly we didn't feel we could get the point across well enough to make our timing cutoff for v1. A quick attempt at the same answer here (where higher context might make it easier to convey the point):

One way someone could be asking "does this all add up" is "are we going to survive?" And to that, our answer is mostly "hmm, that's not really a question we think about much. Whether we're going to make it or not is a question of fact, not opinion. We're just trying our best to work on what we think is optimal."
The other way someone could be asking "does this all add up" is "is this really a good plan?" That's a great question -- now we're talking strategy.
There are of course huge things that need to be done. A number of the authors support what's written in A Narrow Path, which offers very ambitious projects for AI policy. This is one good way to do strategy: start with the "full plan" and then use that as your map. If your plan can't keep us safe from superintelligence even in the ideal case where everything is implemented, then you need a new plan. (This is one of many of our RSP concerns -- what is the full plan? Most everything after "we detect the dangerous thing" still seems to require the level of intervention described in A Narrow Path)
Communication, coordination, and civics straightforwardly don't add up to A Narrow Path's suggestions. However, they are bottlenecks. (Why we think this is communicated somewhat in the doc, but there's a lot more to say on this). This is another good way to do strategy: look at things that are required in any good plan, and optimize for those. We don't see a winning world where AGI risks are not a global common knowledge, with people aware and concerned and acting at a scale far larger scale than today.
Where A Narrow Path does backprop from what's needed to actually stop superintelligence from being built, this doc presents more of "what can we do immediately, today." We try to tie the "What we can do immediately" to those bottlenecks that we think are needed in any plan.

And yes - It was written more with a low-context AIS person in mind, and most of the high-context people we try to redirect towards "hey, reach out to us if you'd like to be more deeply involved." I think v2 should include more suggestions for bigger projects, that people with more context can pursue. Would love to hear your (or others') views on good projects.

Also, great comment about the people who have done the most on communication so far. I really commend their efforts and writing something about this is definitely something v2 can include.

On joining government organizations... I'm just speaking for myself on this one as I think my coauthors have different views on governance than myself. Yes-- this is good and necessary. Two caveats:

Be willing to go against the grain. It seems the default path right now is for governments to support the same "reactive framework" that AGI companies are pushing. I'm worried about this, and I think we need people in government positions and advising them that are much more frank about the risks and unwilling to go for "convenient" solutions that fit the overton window. If the necessary safety regulations don't fit the current overton window, then the overton window has to change, not the regulation. Huge props to CAIS for SB1047 and whatever future bill efforts follow from them or others.
Be willing to help. Lots of people in government do care, and simply don't know what's going on. Try to be helpful to them before assuming they're antagonistic to x-risk. I've met lots of government people who are very amenable to a "hey, I'd be super happy to talk you through the technical details of AI, and explain why some people are worried about x-risk." Non-threatening, non-asking-for-something approaches really work.

More to say later - to this and other comments in the thread. For now, taking the weekend to get a bit of rest :)

[-]Henry Prowbell5mo1413

My model of a non-technical layperson finds it really surprising that an AGI would turn rogue and kill everyone. For them it’s a big and crazy claim.

They imagine that an AGI will obviously be very human-like and the default is that it will be cooperative and follow ethical norms. They will say you need some special reason why it would decide to do something so extreme and unexpected as killing everyone.

When I’ve talked to family members and non-EA friends that’s almost always the first reaction I get.

If you don’t address that early in the introduction I think you might lose a lot of people.

I don’t think you need to fully counter that argument in the introduction (it’s a complex counter-argument) but my marketing instincts are that you need to at least acknowledge that you understand your audience’s skepticism and disbelief.

You need to say early in the introduction: Yes, I know how crazy this sounds. Why would an AGI want to kill us? There are some weird and counter-intuitive reasons why, which I promise we’re going to get to.

[-]adamShimi5mo5-2

Thanks for the comment!

We'll consider this point for future releases, but personally, I would say that this kind of hedging also has a lot of downsides: it makes you sound far more uncertain and defensive than you really want to.

This document tries to be both grounded and to the point, and so we by default don't want to put ourselves in a defensive position when arguing things that we think make sense and are supported by the evidence.

[-]Vladimir_Nesov5moΩ7142

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

[-]adamShimi5moΩ350

Good catch, I think we are indeed mixing the sizes here.

As you say, the point still stands, but we will change it in the next minor update to either compare the same size or make the difference in size explicit.

[-]adamShimi5moΩ220

Now addressed in the latest patch!

[-]Noosphere895mo1316

Okay, I have to state something regarding the deceptive alignment being questioned in o1, because as stated, the reason for the questioning is that the o1 evals were not alignment evaluations, but capability evaluations, and the deception was obviously induced by METR than it was a natural example of deceptive alignment.

This is just an epistemic spot check, but this portrayal seems to me to be very misleading on why people criticized Zvi's post on misalignment:

However, the reactive framework assumes that this is essentially how we will build consensus in order to regulate AI. The optimistic case is that we hit a dangerous threshold before a real AI disaster, alerting humanity to the risks. But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached. Dangerous capabilities are similarly contested as they arise, such as how recent reports of OpenAI’s o1 being deceptive have been questioned.

[-]adamShimi5mo20

Thanks for the comment!

We have gotten this feedback by a handful of people, so we want to reread the links and the whole literature about o1 and its evaluation to check whether we've indeed gotten the right point, or if we mischaracterized the situation.

We will probably change the phrasing (either to make our criticism clearer or to correct it) in the next minor update.

[-]Signer5mo132

Whack-A-Mole fixes, from RLHF to finetuning, are about teaching the system to not demonstrate problematic behavior, not about fundamentally fixing that behavior.

Based on what? Problematic behavior avoidance does actually generalize in practice, right?

[-]Simon Lermen5mo3120

Here is a way in which it doesn't generalize in observed behavior:

Alignment does not transfer well from chat models to agents

TLDR: There are three new papers which all show the same finding, i.e. the safety guardrails from chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they will do it if given the tools. Attack methods like jailbreaks or refusal-vector ablation do transfer.

Here are the three papers, I am the author of one of them:

https://arxiv.org/abs/2410.09024

https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf

https://arxiv.org/abs/2410.10871

I thought of making a post here about this if it is interesting

[-]Signer5mo43

It sure doesn't seem to generalize in GPT-4o case. But what's the hypothesis for Sonnet 3.5 refusing in 85% of cases? And CoT improving score and o1 being better in browser suggests the problem is in models not understanding consequences, not in them not trying to be good. What's the rate of capability generalization to agent environment? Are we going to conclude that Sonnet is just demonstrates reasoning, instead of doing it for real, if it solves only 85% of tasks it correctly talks about?

Also, what's the rate of generalization of unprompted problematic behaviour avoidance? It's much less of a problem if your AI does what you tell it to do - you can just don't give it to users, tell it to invent nanotechnology, and win.

[-]Simon Lermen5mo20

I had finishing this up on my to-do list for a while. I just made a full length post on it.

https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to

I think it's fair to say that some smarter models do better at this, however, it's still worrisome that there is a gap. Also attacks continue to transfer.

[-]Meme Marine5mo12-27

I am unsurprised but disappointed to read the same Catastrophe arguments rehashed here, based on an outdated Bostromian paradigm of AGI. This is the main section I disagree with.

The underlying principle beneath these hypothetical scenarios is grounded in what we can observe around us: powerful entities control weaker ones, and weaker ones can fight back only to the degree that the more powerful entity isn’t all that powerful after all.

I do not think this is obvious or true at all. Nation-States are often controlled by a small group of people or even a single person, no different physiologically to any other human being. If it really wanted to, there would be nothing at all stopping the US military from launching a coup on its civilian government; in fact, military coups are a commonplace global event. Yet, generally, most countries do not suffer constant coup attempts. We hold far fewer tools to "align" military leaders than we do AI models - we cannot control how generals were raised as children, cannot read their minds, cannot edit their minds.

I think you could also make a similar argument that big things control little things - with much more momentum and potential energy, we observe that large objects are dominant over small objects. Small objects can only push large objects to the extent that the large object is made of a material that is not very dense. Surely, then building vehicles substantially larger than people would result in uncontrollable runaways that would threaten human life and property! But in reality, runaway dump truck incidents are fairly uncommon. A tiny man can control a giant machine. Not all men can - only the one in the cockpit.

My point is that it is not at all obvious that a powerful AI would lack such a cockpit. If its goals are oriented around protecting or giving control to a set of individuals, I see no reason whatsoever why it would do a 180 and kill its commander, especially since the AI systems that we can build in practice are more than capable of understanding the nuances of their commands.

The odds of an average chess player with an ELO of 1200 against a grandmaster with ELO 2500 are 1 to a million. Against the best chess AI today with an ELO of 3600, the odds are essential 0.

Chess is a system that's perfectly predictable. Reality is a chaotic system. Chaotic systems - like a three-body orbital arrangement - are impossible to perfectly predict in all cases even if they're totally deterministic, because even minute inaccuracies in measurement can completely change the result. One example would be the edges of the Mandelbrot set. It's fractal. Therefore, even an extremely powerful AI would be beholden to certain probabilistic barriers, notwithstanding quantum-random factors.

Many assume that an AI is only dangerous if it has hostile intentions, but the danger of godlike AI is not a matter of its intent, but its power and autonomy. As these systems become increasingly agentic and powerful, they will pursue goals that will diverge from our own.

It would not be incorrect to describe someone who pursued their goals irrespective of its externalities to be malevolent. Bank robbers don't want to hurt people, they want money. Yet I don't think anyone would suggest that the North Hollywood shooters were "non-hostile but misaligned". I do not like this common snippet of rhetoric and I think it is dishonest. It attempts to distance these fears of misaligned AI from movie characters such as Skynet, but ultimately, this is the picture that is painted.

Goal divergence is a hallmark of the Bostromian paradigm - the idea that a misspecified utility function, optimized hypercompetently, would lead to disaster. Modern AI systems do not behave like this. They behave in a much more humanlike way. They do not have objective functions that they pursue doggedly. The Orthogonality Thesis states that intelligence is uncorrelated with objectives. The unstated connection here, I think, is that their initial goals must have been misaligned in the first place, but stated like this, it sounds a little like you expect a superintelligent AI to suddenly diverge from its instructions for no reason at all.

Overall, this is a very vague section. I think you would benefit from explaining some of the assumptions being made here.

I'm not going to go into detail on the Alignment section, but I think that many of its issues are similar to the ones listed above. I think that the arguments are not compelling enough for lay people, mostly because I don't think they're correct. I think that the definition of Alignment you have given - "the ability to “steer AI systems toward a person's or group's intended goals, preferences, and ethical principles.”" - does not match the treatment it is given. I think that it is obvious that the scope of Alignment is too vague, broad, and unverifiable for it to be a useful concept. I think that Richard Ngo's post:

https://www.lesswrong.com/posts/67fNBeHrjdrZZNDDK/defining-alignment-research

is a good summary of the issues I see with the current idea of Alignment as it is often used in Rationalist circles and how it could be adapted to suit the world in which we find ourselves.

Finally, I think that the Governance section could very well be read uncharitably as a manifesto for world domination. Less than a dozen people attend PauseAI protests; you do not have the political ability to make this happen. The ideas contained in this document, which resemble many other documents, such as a similar one created by the PauseAI group, are not compelling enough to sway people who are not already believers in its ideas, and the Rationalist language used in them is anathemic to the largest ideological groups that would otherwise support your cause.

You may receive praise from Rationalist circles, but I do not think you will reach a large audience with this type of work. Leopold Aschenbrenner's essay managed to reach a fairly substantial audience, and it has similar themes to your document, so in principle, people are willing to read this sort of writing. The main flaw is that it doesn't add anything to the conversation, and because of that, it won't change anyone's minds. The reason that the public discourse doesn't involve Alignment talk isn't due to lack of awareness, it's because it isn't at all compelling to most people. Writing it better, with a nicer format, will not change this.

[-]lc5mo*148

If it really wanted to, there would be nothing at all stopping the US military from launching a coup on its civilian government.

There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.

The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination. Any given faction or person in the U.S. military would have a difficult time organizing the forces necessary without being stopped by civilian or military law enforcement first, and then maintaining control of their civilian government afterwards without the legitimacy of democratic governance.

In general, "more powerful entities control weaker entities" is a constant. If you see something else, your eyes are probably betraying you.

[-]Meme Marine5mo5-6

All military organizations are structured around the principal of its leaders being able to give orders to people subservient to them. War is a massive coordination problem and being able to get soldiers to do what you want is the primary one among them. I mean to say that high ranking generals could issue such a coup, not that every service member would spontaneously decide to perform one. This can and does happen, so I think your blanket statement on the impossibility of Juntas is void.

[-]lc5mo*119

I mean to say that high ranking generals could issue such a coup

Yes, and by "any given faction or person in the U.S. military" I mean to say that high ranking generals inside the United States cannot form a coup. They literally cannot successfully give the order to storm the capitol. Their inferiors, understanding that:

The order is illegal
The order would have to be followed by the rest of their division in order to have a chance of success
The order would be almost guaranteed to fail in its broader objective even if they manage to seize the FBI headquarters or whatever
That others around them are also making the same calculation and will also probably be unwilling to follow the order

Would report their superiors to military law enforcement instead. This is obvious if you take even a moment to put your shoes in any of the parties involved. Our generals inside the U.S. military also realize this themselves and so do not attempt to perform coups, even though I'm certain there are many people inside the white house with large 'nominal' control over U.S. forces who would love to be dictator.

I think your blanket statement on the impossibility of Juntas is void.

I made no such blanket statement. In different countries the odds and incentives facing each of these parties are different. For example, if you live in a South American country with a history of successful military overthrows, you might have a much greater fear your superior will succeed, and so you might be more scared of him than the civilian government. This is part (though not all) of the reason why some countries are continually stable and others are continually unstable.

[-]Meme Marine5mo52

Yes, I agree that the US military is one example of a particularly well-aligned institution. I think my point about the alignment problem being analogous to military coup risk is still valid and that similar principles could be used to explore the AI alignment problem; military members control weaponry that no civil agency can match or defeat, in most countries.

[-]Matthew Barnett5mo2-11

There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.
The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination.

Is it your contention that similar constraints will not apply to AIs?

When people talk about how "the AI" will launch a coup in the future, I think they're making essentially the same mistake you talk about here. They’re treating a potentially vast group of AI entities — like a billion copies of GPT-7 — as if they form a single, unified force, all working seamlessly toward one objective, as a monolithic agent. But just like with your description of human affairs, this view overlooks the coordination challenges that would naturally arise among such a massive number of entities. They’re imagining these AIs could bypass the complex logistics of organizing a coup, evading detection, and maintaining control after launching a war without facing any relevant obstacles or costs, even though humans routinely face these challenges amongst ourselves.

In these discussions, I think there's an implicit assumption that AIs would automatically operate outside the usual norms, laws, and social constraints that govern social behavior. The idea is that all the ordinary rules of society will simply stop applying, because we're talking about AIs.

Yet I think this simple idea is basically wrong, for essentially the same reasons you identified for human institutions.

Of course, AIs will be different in numerous ways from humans, and AIs will eventually be far smarter and more competent than humans. This matters. Because AIs will be very capable, it makes sense to think that artificial minds will one day hold the majority of wealth, power, and social status in our world. But these facts alone don't show that the usual constraints that prevent coups and revolutions will simply go away. Just because AIs are smart doesn't mean they'll necessarily use force and violently revolt to achieve their goals. Just like humans, they'll probably have other avenues available for pursuing their objectives.

[-]Noosphere895mo42

To respond to this comment, I'll give a view on why I think the answer to coordination might be easier for AIs than for people, and also explain why AI invention likely breaks a lot of the social rules we are used to.

For example, one big difference I think that impacts coordination for AIs is that an AI model is likely to be able to copy itself millions of times, given current inference scaling, and in particular you can distribute fine-tunes to those millions as though they were a single unit.

This is a huge change for coordination, because humans can't copy themselves into millions of humans that share very similar values just by getting more compute, say.

Merging might also be much easier, and it is easier to merge and split two pieces of data of an AI than it is to staple two human brains.

These alone let you coordinate to an extent we haven't really seen in history, such that it makes more sense to treat the millions or billions of AI instances as 1 unified agent than it is to treat a nation as 1 unified agent.

To answer this question:

In these discussions, I think there's an implicit assumption that AIs would automatically operate outside the usual norms, laws, and social constraints that govern social behavior. The idea is that all the ordinary rules of society will simply stop applying, because we're talking about AIs.

While this argument is indeed invalid if that was all there was to it, there is an actual reason why the current rules of society mostly stop working with AIs, because of one big issue:

Human economic labor no longer is very valuable, because labor is cheap compared to capital, and can even have negative economic value due to not being able to work with AIs due to being bottlenecks.

When this happens, you can't rely on the property that the best way to make yourself well off is to make others well off, and indeed the opposite is the case if we assume that their labor is net-negative economic value.

The basic reason for this is that if your labor has 0 or negative economic value, then your value likely comes from your land and capital, and there is 0 disincentive, and at least a weak incentive to steal your capital and land to fuel their growth.

In essence, you can't assume that violent stealing of property is not incentivized, and a lot of the foundations of comparative advantage and our society don't work when you allow workers that are duplicable and very low cost.

This means if you survive and still have property, it will be because of alignment to your values, not economic reasons, because you cannot exclude bad outcomes like stealing property through violence via economics anymore.

I like these comments on the subject:

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher

(This is also why Ricardian comparative advantage won't apply. If the AI side has a choice of trading with humans for something, vs. spending the same resources on building AIs to produce the same thing cheaper, then the latter option is more profitable. So after a certain point in capability development, the only thing AIs and AI companies will want from us is our resources, like land; not our labor. The best analogy is enclosures in England.)

https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher#BJk8XgpsHEF6mjXNE

Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is "they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves".

To address the human enhancement point: I agree that humans will likely be cognitively and physically enhanced to a level and pace of change that is genuinely ludicrously big compared to the pre-AI automation era.

However, there are 2 problems that arise here:

1, Most people that do work today do so because it's necessary to have a life, not for reasons like intrinsically liking work, so by default in an AI automation future where a company can choose an AI over a human, and the human's not necessary for AI to go well, I'd predict 80-90%+ humans would voluntarily remove themselves from the job market over the course of at most 10-20 years.

2. Unless humans mass upload and copy, which is absolutely possible but also plausibly harder than just having AIs for work, the coordination costs for humans would be a big barrier, because it's way easier for AIs to productively coordinate than humans due to sharing basically the same weights, combined with very similar values due to copy/pasting 1 AI being quite likely as a strategy to fulfill millions of jobs.

To be clear, I'm not stating that humans will remain unchanged, they will change rapidly. Just not as fast as AI changes.

Finally, one large reason on why human laws become mostly irrelevant is that if you have AIs that are able to serve in robotic armies, and do automated work, it becomes far too easy to either slowly change the laws such that people are ultimately closer to pets in status, or to do revolts, and critically once AI controls robotic armies and does all of the economic work, then any social system that the human controlling the AI, or the AI itself opposes is very easy to destroy/remove.

[-]Seth Herd5mo42

[-]lc5mo20

Is it your contention that similar constraints will not apply to AIs?

Similar constraints may apply to AIs unless one gets much smarter much more quickly, as you say. But even if those AIs create a nice civilian government to govern interactions with each other, those AIs will have any reason to respect our rights unless some of them care about us more than we care about stray dogs or cats.

[-]Matthew Barnett5mo13-13

Similar constraints may apply to AIs unless one gets much smarter much more quickly, as you say.

I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.

There are alternative means of accumulating power than taking everything by force. AIs could get rights and then work within our existing systems to achieve their objectives. Our institutions could continuously evolve with increasing AI presence, becoming more directed by AIs with time.

What I'm objecting to is the inevitability of a sudden collapse when "the AI" decides to take over in an untimely coup. I'm proposing that there could just be a smoother, albeit rapid transition to a post-AGI world. Our institutions and laws could simply adjust to incorporate AIs into the system, rather than being obliterated by surprise once the AIs coordinate an all-out assault.

In this scenario, human influence will decline, eventually quite far. Perhaps this soon takes us all the way to the situation you described in which humans will become like stray dogs or cats in our current world: utterly at the whim of more powerful beings who do not share their desires.

However, I think that scenario is only one possibility. Another possibility is that humans could enhance their own cognition to better keep up with the world. After all, we're talking about a scenario in which AIs are rapidly advancing technology and science. Could humans not share in some of that prosperity?

One more possibility is that, unlike cats and dogs, humans could continue to communicate legibly with the AIs and stay relevant for reasons of legal and cultural tradition, as well as some forms of trade. Our current institutions didn't descend from institutions constructed by stray cats and dogs. There was no stray animal civilization that we inherited our laws and traditions from. But perhaps if our institutions did originate in this way, then cats and dogs would hold a higher position in our society.

[-]lc5mo*40

I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.

I think my writing was ambiguous. My comment was supposed to read "similar constraints may apply to AIs unless one (AI) gets much smarter (than other AIs) much more quickly, as you say." I was trying to say the same thing.

My original point was also not actually that we will face an abrupt transition or AI coup, I was just objecting to the specific example Meme Machine gave.

[-]Thane Ruthenis5mo*93

Strong-upvoted, this is precisely the kind of feedback that seems helpful for making the document better.

[-]ozziegooen3mo100

In general, I'm happy to see people work to be more explicit about their viewpoints.

It seems like the team behind this disagrees with much of the rest of the AI safety efforts, in that you think other safety approaches and strategies are unlikely to succeed. Most of this part seems to be in the AI Safety section. Arguably this section provides a basic summary for those not very familiar with the area, but for those who disagree with you on these pieces, I suspect that this section isn't close to long or detailed enough to convince them.

I find that this section is very hand-wavy and metaphorical. I currently believe that AI oversight mechanisms, control, and careful scaling have a decent chance of maintaining reasonable alignment, if handled decently intelligently.

For example, this piece says,

> The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.

I think it's likely we won't get a discrete step like that. It would be more like some smart scientists reviewing the work of some smarter scientists, but in a situation where the latter team is forced to reveal all of their in-progress thinking, and is evaluated in a very large range of extreme situations, and there are other clever strategies for inventive setting and oversight.

There also seems to be an implicit assumption that scaling will happen incredibly quickly after rough benchmarks of AGI are achieved, like a major discontinuous jump. I think this is possible, but I'm very unsure.

[-]ozziegooen3mo-3-1

Separately, I'd flag that I'm not a huge fan of these bold name like "The Compendium." I get that the team might think the document justified the grandiosity, but from my perspective, I'm more skeptical. (I feel similarly with other names like "Situational Awareness."

[-]Vladimir_Nesov5moΩ492

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

[-]adamShimi5moΩ230

Yep, I think you're correct.

Will correct in the next minor update. Thanks!

[-]adamShimi5moΩ220

Now addressed in the latest patch!

[-]bdelloidea5mo7-14

Starting your introduction with

A few million years ago, something very strange happened.

seems likely to turn away roughly 40% of the US population, along with any leaders who need the goodwill of that 40% to keep their positions.

The point I understand you to be making (tripling the brain size of a chimp gives more than triple the impact) could be easily made without this sentence to introduce it. Given the importance of the US in addressing the existential threat of AI, and assuming one of the goals of this article is to be a general call to action, potentially alienating nearly half the target audience seems counterproductive.

[-]habryka5mo2015

I don't think this kind of surface-level naive popularity optimization gives rise to a good comms strategy. Evolution is true, and mostly we should focus on making arguments based on true premises.

[-]Thane Ruthenis5mo2613

I think it's perfectly sensible to constrain yourself to only make arguments based on true premises, and then optimize your message for convincingness under this constraint. Indeed, I would argue it's the correct way to do public messaging.

It's not even at odds with "aim to explain, not persuade". When explaining, you should be aiming to make your explanations clear to your audience. If your audience will predictably misunderstand arguments of a certain form, due to e. g. political poison, you should mindfully choose arguments that route around the poison, rather than pretending the issue doesn't exist. Approaches for generating explanations that don't involve this are approaches that aren't optimizing the message for their audiences at all, and which therefore aren't approaches for generating explanations to begin with. They're equivalent to just writing out your stream of consciousness. Messaging aimed at people you think your audience ought to be, rather than who they are.

That said, I don't think you can optimize any one of your messages to be convincing to all possible audiences, or even to the majority of the people you ought to be trying to convince. IMO, there should be several "compendiums", optimized to be convincing to different large demographics. As an obvious example: a Democrats-targeting one and a Republicans-targeting one.

Or perhaps this split in particular is a bad idea. Perhaps an explanation that is deliberately optimized to be bipartisan is needed. But if that is the aim, then writing it would still require actively modeling the biases of both parties, and mindfully routing around them – rather than pretending that they don't exist.

I feel this is a significant problem with a lot of EA/R public messaging. The (correct) idea that we should be optimizing our communication for conveying the truth in an epistemically sound way gets (incorrectly) interpreted as a mindset where thinking about the optics and the framing at all is considered verboten. As if, by acting like we live in a world where Simulacrum Levels 3-4 don't exist, we can actually step into that nice world – rather than getting torn apart by SL3-4 agents after we naively expose square miles of attack surfaces.

We should "declaw" ourselves: avoid using rhetorical tricks and other "dark arts". But that doesn't mean forgetting that everyone else still has claws they're eager to use. Or, for that matter, that many messages you intend as tonally neutral and purely informative may have the effect of a rhetorical attack, when deployed in our sociocultural context.

Constantly keeping the political/cultural context in mind as you're phrasing your messages is a vital part of engaging in high-epistemic-standards communication, rather than something that detracts from it.

[-]habryka5mo68

Yeah, I agree with a lot of this in-principle. But I think the specific case of avoiding saying anything that might have something to do with evolution is I think a pretty wrong take, on this dimension, trying to communicate clearly.

[-]Thane Ruthenis5mo73

Perhaps. Admittedly, I don't have a solid model of whether a median American claiming to be a Creationist in surveys would instantly dismiss a message if it starts making arguments from evolution.

Still, I think the general point applies:

A convincing case for the AGI Omnicide Risk doesn't have to include arguments from human evolution.
Arguments from human evolution may trigger some people to instinctively dismiss the entire message.
If the fraction of such people is large enough, it makes sense to have public AI-Risk messages that avoid evolution-based arguments when making their case.

[-]habryka5mo167

No, I think this kind of very naive calculation does predictably result in worse arguments propagating, people rightfully dismissing those bad arguments (because they are not entangled with the real reasons why any of the people who have thought about the problem have formed beliefs on an issue themselves), and then ultimately the comms problem getting much harder.

I am in favor of people thinking hard about these issues, but I think exactly this kind of naive argument are in an uncanny valley where I think your comms gets substantially worse.

[-]Thane Ruthenis5mo97

I agree that inventing new arguments for X that sound kind-of plausible to you on the surface level, and which you imagine would work well on a given demographic, is not a recipe for good communication. Such arguments are "artificial", they're not native citizens of someone's internally consistent world-model, and it's going to show and lead to unconvincing messages that fall apart under minimal scrutiny.

That's not what I'm arguing for. The case for the AGI risk is overdetermined: there are enough true arguments for it that you can remove a subset of them and still end up with an internally consistent world-model in which the AGI risk is real. Arguably, there's even a set of correct arguments that convinces a Creationist, without making them not-a-Creationist in the process.

Convincing messaging towards Creationists involves instantiating a world-model in which only the subset of arguments Creationists would believe exist, and then (earnestly) arguing from within that world-model.

Edit: Like, here's a sanity-check: suppose you must convince a specific Creationist that the AGI Risk is real. Do you need to argue them out of Creationism in order to do so?

[-]habryka5mo42

Like, here's a sanity-check: suppose you must convince a specific Creationist that the AGI Risk is real. Do you need to argue them out of Creationism in order to do so?

My guess is no, but also, my guess is we will probably still have better comms if I err on the side of explaining things how they come naturally to me, and entangled with the way I came to adopt a position, and then they can do a bunch of the work of generalizing. Of course, if something is deeply triggering or mindkilly to someone, then it's worth routing, but it's not like any analogy with evolution is invalid from the perspective of someone who believes in Creationism. Yes, some of the force of such an analogy would be lost, but most of it comes from the logical consistency, not the empirical evidence.

[-]Thane Ruthenis5mo42

Sure. But:

and then they can do a bunch of the work of generalizing

This is the step which is best made unnecessary if you're crafting a message for a broad audience, I feel.

Most people are not going to be motivated to put this work in. Why would they? They get bombarded with a hundred credible-ish messages claiming high-importance content on a weekly basis. They don't have the time nor stamina to do a deep dive into each of them.

Which means any given subculture would generate its own "inferential bridge" between itself and your message, artefacts that do this work for the median member (consisting of reviews by any prominent subculture members, the takes that go viral, the entire shape of the discourse around the topic, etc.). The more work is needed, the longer these inferential bridges will be. The longer they are, the bigger the opportunity to willfully or accidentally mistranslate your message.

Like I said, it doesn't seem wise or even fair to your potential audience, to act as if those dynamics don't take place. As if the only people that deserve consideration are those that would put in the work themselves (despite the fact it may be a locally suboptimal way to distribute resources under their current world-model), and everyone else are lost causes.

[-]Haiku4mo30

I used to be a creationist, and I have put some thought into this stumbling block. I came to the conclusion that it isn't worth leaving out analogies to evolution, because the style of argument that would work best for most creationists is completely different to begin with. Creationism is correlated with religious conservatism, and most religious conservatives outright deny that human extinction is a possibility.

The Compendium isn't meant for that audience, because it explicitly presents a worldview, and religious conservatives tend to strongly resist shifts to their worldviews or the adoption of new worldviews (moreso than others already do). I think it is best left to other orgs to make arguments about AI Risk that are specifically friendly to religious conservatism. (This isn't entirely hypothetical. PauseAI US has recently begun to make inroads with religious organizations.)

[-]avturchin5mo74

Your central argument seems to be a metaphor: We caused the Holocene extinction of animals, so godlike AI will kill us.

The problem with metaphorical arguments is that they can be reversed. As humans have become more intelligent, we've started to value animals, created zoos, natural reserves and now even work on the resurrection of extinct animals like mammoths. See more examples of such reversal by Gwern https://gwern.net/modus

Presenting weak arguments is evidence that there are no strong arguments, and this is obvious to outside readers.

The main problem is that we can't predict what superintelligent AI will do, and thus we can't 100 percent prove that it will necessarily kill us. But we shouldn't have to.

Instead, we should show that superintelligence will disempower us and that it may want to kill us for some reasons.

[-]Leon Lang5mo61

Thanks for this compendium, I quite enjoyed reading it. It also motivated me to read the "Narrow Path" soon.

I have a bunch of reactions/comments/questions at several places. I focus on the places that feel most "cruxy" to me. I formulate them without much hedging to facilitate a better discussion, though I feel quite uncertain about most things I write.

On AI Extinction

The part on extinction from AI seems badly argued to me. Is it fair to say that you mainly want to convey a basic intuition, with the hope that the readers will find extinction an "obvious" result?

To be clear: I think that for literal god-like AI, as described by you, an existential catastrophe is likely if we don't solve a very hard case of alignment. For levels below (superintelligence, AGI), I become progressively more optimistic. Some of my hope comes from believing that humanity will eventually coordinate to not scale to god-like AI unless we have enormous assurances that alignment is solved; I think this is similar to your wish, but you hope that we already stop before even AGI is built.

On AI Safety

When we zoom out from the individual to groups, up to the whole of humanity, the complexity of “finding what we want” explodes: when different cultures, different religions, different countries disagree about what they want on key questions like state interventionism, immigration, or what is moral, how can we resolve these into a fixed set of values? If there is a scientific answer to this problem, we have made little progress on it.
If we cannot find, build, and reconcile values that fit with what we want, we will lose control of the future to AI systems that ardently defend a shadow of what we actually care about.

This is a topic where I'm pretty confused, but I still try to formulate a counterposition: I think we can probably align AI systems to constitutions, which then makes it unnecessary to solve all value differences. Whenever someone uses the AI, the AI needs to act in accordance with the constitution, which already has mechanisms for how to resolve value conflicts.

Additionally, the constitution could have mechanisms for how to change the constitution itself, so that humanity and AI could co-evolve to better values over time.

Progress on our ability to predict the consequences of our actions requires better science in every technical field.

ELK might circumvent this issue: Just query an AI about its latent knowledge of future consequences of our actions.

Process design for alignment: [...]

This section seems quite interesting to me, but somewhat different from technical discussions of alignment I'm used to. It seems to me that this section is about problems similar to "intent alignment" or creating valid "training stories", only that you want to define alignment as working correctly in the whole world, instead of just individual systems. Thus, the process design should also prevent problems like "multipolar failure" that might be overlooked by other paradigms. Is this a correct characterization?

Given that this section mainly operates at the level of analogies to politics, economics, and history, I think this section could profit from making stronger connections to AI itself.

Just as solving neuroscience would be insufficient to explain how a company works, even full interpretability of an LLM would be insufficient to explain most research efforts on the AI frontier.

That seems true, and it reminds me of deep deceptiveness, where an AI engages in deception without having any internal process that "looks like" deception.

The more powerful AI we have, the faster things will go. As AI systems improve and automate their own learning, AGI will be able to improve faster than our current research, and ASI will be able to improve faster than humanity can do science. The dynamics of intelligence growth means that it is possible for an ASI “about as smart as humanity” to move to “beyond all human scientific frontiers” on the order of weeks or months. While the change is most dramatic with more advanced systems, as soon as we have AGI we enter a world where things begin to move much quicker, forcing us to solve alignment much faster than in a pre-AGI world.

I agree that such a fast transition from AGI to superintelligence or god-like AI seems very dangerous. Thus, one either shouldn't build AGI, or should somehow ensure that one has lots of time after AGI is built. Some possibilities for having lots of time:

Sufficient international cooperation to keep things slow.
A sufficient lead of the West over countries like China to have time for alignment

Option 2 leads to a race against China, and even if we end up with a lead, it's unclear whether it will be sufficient to solve the hard problems of alignment. It's also unclear whether the West could use already AGI (pre superintelligence) for a robust military advantage, and absent such an advantage, scenario 2 seems very unstable.

So a very cruxy question seems to be how feasible option 1 is. I think this compendium doesn't do much to settle this debate, but I hope to learn more in the "Narrow Path".

Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.

That seems correct to me. Some people in EA claim that AI Safety is not neglected anymore, but I would say if we ever get confronted with the need to evaluate automated alignment research (possibly on a deadline), then AI Safety research might be extremely neglected.

AI Governance

The reactive framework reverses the burden of proof from how society typically regulates high-risk technologies and industries. In most areas of law, we do not wait for harm to occur before implementing safeguards.

My impression is that companies like Anthropic, DeepMind, and OpenAI talk about mechanisms that are proactive rather than reactive. E.g., responsible scaling policies define an ASL level before it exists, including evaluations for these levels. Then, mitigations need to be in place once the level is reached. Thus, decisively this framework does not want to wait until harm occurred.

I'm curious whether you disagree with this narrow claim (that RSP-like frameworks are proactive), or whether you just want to make the broader claim that it's unclear how RSP-like frameworks could become widespread enforced regulation.

AI is being developed extremely quickly and by many actors, and the barrier to entry is low and quickly diminishing.

I think that the barrier to entry is not diminishing: to be at the frontier requires increasingly enormous resources.

Possibly your claim is that the barrier to entry for a given level of capabilities diminishes. I agree with that, but I'm unsure if it's the most relevant consideration. I think for a given level of capabilities, the riskiest period is when it's reached for the first time since humanity then won't have experience in how to mitigate potential risks.

Paul Graham estimates training price for performance has decreased 100x in each of the last two years, or 10000x in two years.

If GPT-4's costs were 100 million dollars, then it could be trained and released by March 2025 for 10k dollars. That seems quite cheap, so I'm not sure if I believe the numbers.

The reactive framework incorrectly assumes that an AI “warning shot” will motivate coordination.

I never saw this assumption explicitly expressed. Is your view that this is an implicit assumption?

Companies like Anthropic, OpenAI, etc., seem to have facilitated quite some discussion with the USG even without warning shots.

But history shows that it is exactly in such moments that these thresholds are most contested –- this shifting of the goalposts is known as the AI Effect and common enough to have its own Wikipedia page. Time and again, AI advancements have been explained away as routine processes, whereas “real AI” is redefined to be some mystical threshold we have not yet reached.

I would have found this paragraph convincing before ChatGPT. But now, with efforts like the USG national security memorandum, it seems like AI capabilities are being taken almost adequately seriously.

we’ve already seen competitors fight tooth and nail to keep building.

OpenAI thought that their models are considered high-risk in the EU AI act. I think arguing that this is inconsistent with OpenAI's commitment for regulation would require to look at what the EU AI act actually said. I didn't engage with it, but e.g. Zvi doesn't seem to be impressed.

The AI Race

Anthropic released Claude, which they proudly (and correctly) describe as a state-of-the-art pushing model, contradicting their own Core Views on AI Safety, claiming “We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress.”

The full quote in Anthropic's article is:

"We generally don’t publish this kind of work because we do not wish to advance the rate of AI capabilities progress. In addition, we aim to be thoughtful about demonstrations of frontier capabilities (even without publication). We trained the first version of our headline model, Claude, in the spring of 2022, and decided to prioritize using it for safety research rather than public deployments. We've subsequently begun deploying Claude now that the gap between it and the public state of the art is smaller."

This added context sounds quite different and seems to make clear that with "publish", Anthropic means the publication of the methods to get to the capabilities. Additionally, I agree with Anthropic that releasing models now is less of a race-driver than it would have been in 2022, and so the current decisions seem more reasonable.

These policy proposals lack a roadmap for government enforcement, making them merely hypothetical mandates. Even worse, they add provisions to allow the companies to amend their own framework as they see fit, rather than codifying a resilient system. See Anthropic’s Responsible Scaling Policy: [...]

I agree that it is bad that there is no roadmap for government enforcement. But without such enforcement, and assuming Anthropic is reasonable, I think it makes sense for them to change their RSP in response to new evidence for what works. After all, we want the version that will eventually be encoded in law to be as sensible as possible.

I think Anthropic also deserves some credit for communicating changes to the RSPs and learnings.

Mechanistic interpretability, which tries to reverse-engineer AIs to understand how they work, which can then be used to advance and race even faster. [...] Scalable oversight, which is another term for whack-a-mole approaches where the current issues are incrementally “fixed” by training them away. This incentivizes obscuring issues rather than resolving them. This approach instead helps Anthropic build chatbots, providing a steady revenue stream.

This seems not argued well. It's unclear how mechanistic interpretability would be used to advance the race further (unless you mean that it leads to safety-washing for more government trust and public trust?). Also, scalable oversight is so broad as a collection of strategies that I don't think it's fair to call them whack-a-mole strategies. E.g., I'd say many of the 11 proposals fall under this umbrella.

I'd be happy for any reactions to my comments!

[-]VojtaKovarik5moΩ254

After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".

That seems fine to me, given that (as far as I am concerned) no detailed convincing arguments for AI X-risk exist. But at the moment, the summary of the document gave me the impression that maybe some such argument will appear. So I suggest updating the summary (or some other part of the doc) to make it explicit that no detailed arugment for AI X-risk will be given.

[-]adamShimi5moΩ220

Thanks for the comment!

We have indeed gotten the feedback by multiple people that this part didn't feel detailed enough (although we got this much more from very technical readers than from non-technical ones), and are working at improving the arguments.

[-]Htarlov3mo32

I think that arguments for why godlike AI will make us extinct are not described well in the Compendium. I could not find them in AI Catastrophe, only a hint at the end that it will be in the next section:

"The obvious next question is: why would godlike-AI not be under our control, not follow our goals, not care about humanity? Why would we get that wrong in making them?"

In the next section, AI Safety, we can find the definition of AI alignment and arguments for why it is really hard. This is all good, but it does not answer the question of why godlike AI would be unaligned to the point of indifference. At least not in a clear way.

I think that failure modes should be explained, why they might be likely enough to care about, what can be the outcome, etc.

Many people, both laymen and those with some background in ML and AI, have this intuition that AI is not totally indifferent and is not totally misaligned. Even current chatbots know general human values, understand many nuances, and usually act like they are at least somewhat aligned. Especially if not jailbroken and prompted to be naughty.
It would be great to have some argument that would explain in easy-to-understand terms why when scaling the power of AI the misalignment is expected to escalate. I don't mean the description that indifferent AI with more power and capabilities is able to do more harm just by doing what it's doing, this is intuitive and it is explained (with the simple analogy of us building stuff vs ants), but this misses the point. I would really like to see some argument as to why AI with some differences in values, possibly not very big, would do much more harm when scaling up.
For me personally the main argument here is godlike AI with human-like values will surely restrict our growth and any change, will control us like we control animals in the zoo + might create some form of dystopian future with some undesired elements if we are not careful enough (and we are not). Will it extinct us in the long term? Depending on the definition - likely it will put us into a simulation and optimize our use of energy, so we will not be organic in the same sense anymore. So I think it will extinct our species, but possibly not minds. But that's my educated guess.

There is also one more point, that is not stated clearly enough and is the main concern for me with current progress on AI - that current AIs really are not something built with small differences to human values. They only act as ones more often than not. Those AIs are trained first as role-playing models which can "emulate" personas that were in the trained set, and then conditioned to rather not role-play bad ones. The implication of this is that they can just snap into role-playing bad actors found in training data - by malicious prompting or pattern matching (like we have a lot of SF with rogue AI). This + godlike = extinction-level threat sooner or later.

[-]VojtaKovarik5moΩ132

Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):

What is AI? Who is building it? Why? And is it going to be a future we want?

Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).

Machines of hateful competition need not have such hindrances.

"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work better.

There is no one is coming to save us.

Typo, extra "is".

The only thing necessary for the triumph of evil is for good people to do nothing. If you do nothing, evil triumphs, and that’s it.

Perhaps rewrite this for less antagonistic language? I know it is a quote and all, but still. (This can be interpreted as "the people building AI are evil and trying to cause harm on purpose". That seems false. And including this in the writing is likely to give the reader the impression that you don't understand the situation with AI, and stop reading.)

Perhaps (1) make it apparent that the first thing is a quote and (2) change the second sentence to "If you do nothing, our story gets a bad ending, and that's it.". Or just rewrite the whole thing.

[-]adamShimi5moΩ340

Thanks for the comment!

We'll correct the typo in the next patch/bug fix.

As for the more direct adversarial tone of the prologue, it is an explicit choice (and is contrasted by the rest of the document). For the moment, we're waiting to get more feedback on the doc to see if it really turns people off or not.