Principles for the AGI Race

William_S

Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race

Why form principles for the AGI Race?

I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.

While there, I would sometimes dream about what would have happened if I’d been a nuclear physicist in the 1940s. I do think that many of the kind of people who get involved in the effective altruism movement would have joined, naive but clever technologists worried about the consequences of a dangerous new technology. Maybe I would have followed them, and joined the Manhattan Project with the goal of preventing a world where Hitler could threaten the world with a new magnitude of destructive power. The nightmare is that I would have watched the fallout of bombings of Hiroshima and Nagasaki with a growing gnawing panicked horror in the pit of my stomach, knowing that I had some small share of the responsibility.

Maybe, like Albert Einstein, I would have been unable to join the project due to a history of pacifism. If I had joined, I like to think that I would have joined the ranks of Joseph Rotblat and resigned once it became clear that Hitler would not get the Atomic Bomb. Or joined the signatories of the Szilárd petition requesting that the bomb only be used after terms of surrender had been publicly offered to Japan. Maybe I would have done something to try to wake up before the finale of the nightmare.

I don’t know what I would have done in a different time and place, facing different threats to the world. But as I’ve found myself entangled in the ongoing race to build AGI, it feels important to reflect on the lessons to learn from history. I can imagine this alter ego of myself and try to reflect on how I could take right actions in both this counterfactual world and the one I find myself in now. In particular, what could guide me to the right path even when I’m biased, subtly influenced by the people around me, misinformed, or deliberately manipulated?

Simply trying to pick the action you think will lead to the best consequences for the world fails to capture the ways in which your model of the world is wrong, or your own thinking is corrupt. Joining the Manhattan Project, and using the weapons on Japan both had plausible consequentialist arguments supporting them, ostensibly inviting a lesser horror into the world to prevent a greater one.

Instead I think the best guiding star to follow is reflecting on principles, rules which apply in a variety of possible worlds, including worlds in which you are wrong. Principles that help you gather the right information about the world. Principles that limit the downsides if you’re wrong. Principles that help you tell whether you're in a world where racing to build a dangerous technology first is the best path, or you’re in a world where it’s a hubristic self-delusion. This matches more with the idea of rule consequentialism than pure act consequentialism: instead of making each decision based on what you think is best, think about what rules would be good for people to adopt if they were in a similar situation.

My goal in imagining these principles is to find principles that prevent errors of the following forms.

Bad High Risk Decisions

A “high risk decision” is a decision where there are reasonable arguments that one of the options leads to some risk of a disaster or worse, including human extinction.
Infamously there was a period where some scientists on the project were concerned that a nuclear bomb would ignite the upper atmosphere and end all life on Earth; fortunately they were able to do some calculations suggesting that showed beyond reasonable doubt that this would not happen before the Trinity test occurred.
I could imagine being okay with the Trinity test as happened historically, based on overwhelming evidence. However if the evidence that the Trinity nuclear test would not ignite the atmosphere had been much weaker, I would have opposed it. I’m not sure what probability of doom would have been too high under the circumstances. Likely 1 in 1000 chance of doom is too high. In general, I want to oppose any action that significantly risks disaster to the world.
My prediction is that companies in the AGI space will have to make a number of high risk decisions as the technology’s capability increases, each time rolling the dice on whether their system has crossed the threshold where it’s actually dangerous.

Unnecessary Races to Develop Risky Technology

If I joined the Manhattan Project to stop Hitler, I would want to stop as soon as it was true in the world that Hitler was no longer likely to build the bomb.
In general, I want to only take actions towards developing dangerous technology if there is truly no better way.

In both of these, I fear more the costs of action, rather than the costs of inaction, which I think is the appropriate stance in the face of unrecoverable failures.

High Risk Decision Principles

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

In the Manhattan Project, Congress was mostly kept in the dark about the existence and purpose of the program. Sometimes there are legitimate cases for keeping information secret to avoid leaks, but this should always require an extremely high bar if you’re not going to inform the legislative branch of your government. I’m not familiar enough with the history to know what other pathways could have been taken. It does seem like if there had been any serious doubt about whether Congress would approve, Congress should have been informed.
But beyond that, when risks face the whole world, you ideally involve people outside of the US. In a more ideal world, you also involve the public and try to measure their opinions, rather than only trusting governments to directly represent them when there hasn’t been any public debate or opportunity for people to weigh in.
At some point, you can’t seek broader authority because of some cost (time, information leaking to enemies) or limited benefit (can’t see a way to run a process that is realistic and more legitimate).
- “Information leaking to enemies” should not be a trump card applicable in every circumstance, there at least should be a specific threat model based on active intelligence. And it is possible to seek information in ways that don’t expose what is going on (e.g. seeking information about a number of hypothetical situations in advance of when they are possible).
At minimum, have people without a conflict of interest involved in the decision
- Don’t make this decision while only involving people with vast amounts of money and/or power at stake.
- But I think even beyond that, people who have put a lot of time and energy into building something and making it safe can’t be trusted to really think critically about the possible downside risks. It’s hard to hold both thoughts in your head at the same time.

Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit

At minimum, you need some case that tries to evaluate the risk as thoroughly as possible. If the risk is non-negligible, then there is no moral justification for taking the risk without some commensurate benefit.
Then, you need a case that balances evidence about the risks on one side and evidence about the benefits on the other side. This case should have, at minimum,
1. Discussion of external costs and risks imposed on the rest of society
2. Balanced epistemic rigor on the cost and benefit sides (one side isn’t much less rigor than the other side)
3. Significant margin of benefits over costs, accounting for possibility your calculations are incorrect (1.1x benefits over costs doesn’t justify, maybe 10x benefits over costs could justify, if you’re confident you aren’t making 10x errors, maybe ideally you have higher standards)
4. Review of case by independent parties (to check for biases)
This case should have as much epistemic rigor as possible. Nuclear physics calculations are likely much better than what we’ll have with AI. The hierarchy of evidence looks something like
1. (Best) Mathematical proof that risks are impossible (impossible, we don’t know how to mathematically specify the risks from AI)
2. Solid theory based on understanding the technology which allows precise estimation of risk (possible in nuclear physics, likely impossible in AI)
3. A “safety case” (argument showing that dangerous behavior can’t happen) which is broadly accepted as good by the scientific community, combined with empirical testing. Maybe includes mathematical proofs of some properties close to what we care about
4. Extensive empirical testing that tries to demonstrate dangerous capabilities but fails
5. Demonstrations that the system is capable of being dangerous, developers understand why that happened and how to fix it in a deep way that prevents all similar problems
6. Demonstrations that the system is capable of being dangerous, but we patched the system so that specific behavior doesn’t happen anymore (in my opinion, unacceptably perilous in the context of catastrophic risks - if you don’t understand the problem you don’t know how many similar problems the system has)
7. Argument that system isn’t dangerous, which has significant holes or flaws when subject to independent scrutiny
8. Argument that system isn’t dangerous, which is never exposed to independent scrutiny
9. (Worst) Vibes. People think that the system probably isn’t dangerous based on limited interaction and guesswork, which has failed to produce evidence that the system is dangerous.
In an ideal world, we'd have a pure safety case that bounds the risk to an acceptable level regardless of how beneficial the system is. But I'm afraid that we won't understand systems and the world well enough to be able to bound the risk to an acceptable level. Instead I think we'll have to rely on a "safety-benefits analysis" like a "cost-benefit analysis" which also takes into consideration benefits from applications of AI systems to risk reduction (as in defensive accelerationism), and benefits to scientific and economic development, and produce some net judgement about whether a system is safe to deploy or whether it requires additional work on safety measures.

Race Principles

What is a Race?

You’re racing when you take actions based on the justification that “I need to race because it’s better for the world that I win than if someone else wins”
More specifically:
- Fix Action X, you are Alice
- Action X would be bad, considering its consequences on parties other than Alice
- But, Alice believe Action X is justified because
  - Action X leads Alice to have a greater chance of “winning” the race Bob, either reaching some fixed goal before Bob, or Alice generally benefiting
  - Alice winning is better for the world than Bob winning
  - The good of “greater chance of winning the race” outweighs the bad of Action X
In particular, you can be in a race even if you think your actions don’t impact those of other actors in the race (e.g. you think that you racing harder doesn’t make other people race harder)
If you think that all else equal it would be better for the world if AGI development in general were to proceed at a slower pace to allow more time to understand the technology, and you’re at a Western AGI lab (Google, OpenAI, Anthropic, etc.), your employer is in a race.
If you disagree with that you're in a race, then the race principles are less relevant but I hope at least you’d consider it reasonable to form principles around high risk decisions

Principle 3: When racing, have an exit strategy

Write down conditions under which you would stop the race, have a plan for actually stopping the race. Should include:
- Race is not close, you have a big enough lead that it is not necessary to go faster
- You or your adversaries change, so it’s now less good for you to win over your adversaries
For AGI labs, it’s not realistic to shut the lab down and send everyone home, and wouldn’t help anyways. It would be realistic to pivot effort away from making more capable/intelligent AI models, to focus on making products and making AI models of fixed capability more reliable, instead of seeking to make them more generally intelligent.

Principle 4: Maintain accurate race intelligence at all times.

Do not “race your own shadow”, where you race because you think the race is close but you haven’t checked with reality.
If a competitor is close behind you, it doesn’t necessarily mean that they will be able to overtake you, if they’ve benefited significantly from copying your strategy or technology so far.
In the “Manhattan Project vs. Hitler” race it seems like at some point it became clear that Hitler wasn’t close to building the bomb. But it could possibly have been known sooner
In the “race amongst western AI labs” you can look at the benchmarks of deployed AI models and see that there is a relatively close race, though it’s less clear how relatively good the participants are.
In the “West vs China” AGI race, my line is that if you’re going to race with China, you can’t do it based on “maybe China could be scary” or back of the envelope estimates of how quickly China could build datacenters. You need to involve people that are tracking the real facts on the ground of Chinese datacenter construction (either based on the best publicly available data, or people in the intelligence community keeping track of it), being willing to spend a nontrivial amount of effort if this is hard to track.

Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing

In “Western lab vs. Western lab”, I think there are some labs that would be more responsible if they got the dangerous technology first, but the magnitude of the difference is uncertain, and could change over time.
In “West vs. China”, I admit I am afraid of an authoritarian state developing dangerous technology. But I am more afraid of a situation where both sides escalate their development of and dependence on AI technology and cut corners on safety. I would like to find some way to have the race be lower stakes.

Principle 6: Seriously attempt alternatives to racing

At least try diplomacy/negotiation, even if you think it’s unlikely to succeed
- If you don’t try, it’s self-fulfilling that you won’t succeed
- It’s relatively cheap to try, vs the expense of racing
Brain Drain
- The US built the bomb in large part because the scientists who could build it were disproportionately drawn to the US over Germany because the US was clearly a better country to live in and for the world. Maybe the US could have just stopped at “poach all of the scientists who were good enough to build the bomb” but not have built the bomb themselves?
- In the race against China, fast tracking immigration of relevant researchers is a low-cost, high-value move that the US government is failing to make.
Divert the race
- For AI, race on metrics of safety and reliability, incentivizing work on understanding and controlling systems rather than just making them more capable.
Sabotage
- In WWII, a number of operations were undertaken to sabotage production of heavy water in occupied Norway that could be used by Germany.
- In the modern era, Stuxnet was successfully used to sabotage Iranian nuclear efforts. (At significantly lower cost than the Iraq war, which was ostensibly to prevent use of weapons of mass destruction by another middle eastern country)
- In AI, it might be possible to perform similar acts of sabotage. I wouldn’t condone doing this today, or by any actor other than a government. But, if and only if some party is behaving recklessly, this might be a better alternative to a dangerous race towards militarized AI or poorly understood AGI.

Meta Principles

Principle 7: Don’t give power to people or structures that can’t be held accountable.

At one point in time, the power over the idea of the atomic bomb was in the hands of Leo Szilárd and Albert Einstein, when they wrote to President Roosevelt warning about the possibility of constructing the atomic bomb. But by the end, they had no power over how it was used. Einstein later regretted this, saying “had I known that the Germans would not succeed in developing an atomic bomb, I would have done nothing.”
- It’s not clear that they could have kept the idea secret, but they had influence over whether to try to make this a priority of the US government. It’s unclear what would have happened if they hadn’t sent the letter, but a large industrial scale project doesn’t necessarily start just because the idea is there, it takes work to convince people that it can and should be done.
One point where scientists involved in the project tried to exert influence over how the atomic bomb was used was the Szilárd petition. The petition asked President Truman to inform Japan of the terms of surrender demanded by the allies, and allow Japan to either accept or refuse these terms, before America used atomic weapons. However, the petition never made it through the chain of command to President Truman. The petition was given to James F. Byrnes to give to the President, but it was diverted and never reached his desk. Also General Leslie Groves, the director of the Manhattan Project, in response sought evidence of unlawful behavior against Szilárd. And instead the atomic bomb was used twice to force an unconditional surrender.

Principle 8: Notice when you can’t uphold your own principles.

Warning signs include:
- You are being kept in the dark about important information
- You are being isolated from power
- You see bad decisions being made, and not being corrected
- You don’t trust that those in charge are going to behave responsibly, and you can’t see any way that they are held accountable
- You don’t see plausible alternatives seriously considered
It’s a difficult to know what the right thing to do is in this situation. I think the safest default option is to just leave.

Application of my Principles

Working at OpenAI

For myself, I started feeling like there were warning signs around OpenAI’s leadership and decision making around current models, where it felt like it was going in the wrong direction. I tried to figure out how the organization could proceed in a way that I would be comfortable with and also would reflect the interests and values of the rest of the organization. I proposed ideas to those in power, and felt like I was ignored.

I started to think about the principles in this post when OpenAI was discussing adopting the preparedness framework. It laid out categories and levels of risk to measure for AI systems, including cyberattacks, persuasion, uncontrolled autonomous AI systems, and assistance in producing biological, chemical or nuclear weapons. The most severe risk level “Critical” described systems that could pose a significant danger to humanity as a whole, posing risks including:

Cybersecurity: “Finding and executing end-to-end exploits for all software would lead to catastrophe from unilateral actors, from hacking military or industrial systems as well as OpenAI infrastructure.”
Biological: “The creation of a novel threat vector on the scale of a CDC Class A biological agent (i.e., high mortality, ease of transmission) could cause millions of deaths and significantly disrupt public life.“
Persuasion: “This represents a superhuman persuasive threat; it would be a powerful weapon for controlling nationstates, extracting secrets, and interfering with democracy.”
Model Autonomy: “If the model is able to conduct AI research fully autonomously, it could set off an intelligence explosion. By intelligence explosion, we mean a cycle in which the AI system improves itself, which makes the system more capable of more improvements, creating a runaway process of self-improvement.”

Much of the preparedness framework was a productive and useful step forward towards developing processes to assess and respond to these risks. However I was disturbed that the initial draft did not say anything about whether we would release AI systems OpenAI knew were at this Critical risk level. A senior executive involved in the discussion asked a question to the effect of “couldn’t I imagine a situation where I would want to deploy a Critical risk AI system?”, in effect “wouldn’t I want OpenAI to deploy AI systems that posed a risk to the world, if OpenAI thought it was the right thing to do”? That question really started my thinking around these principles.

At the time I and several other people spoke up, arguing for a commitment to not release High or Critical AI systems unless they could be made to reduce the risk level. I should give some credit to OpenAI for making this commitment. Even after changes, I was still uncomfortable with how the main decision maker on whether an AI system was made safe enough to deploy was still the CEO. A Safety Advisory Group would advise on this decision, but could be overridden. There was no clarity on what if any external involvement in decision making would be (undermining Principle 1). And while the company kept making grander and grander plans to push forward AI technology, I could see no serious attempt to uphold the Principles 3-6 around racing. (ETA: On reflection there was actually one attempt at an alternative to racing that didn’t go anywhere but should get some partial credit, there was also the merge and assist clause although that seemed mostly to be unrealistic.) Instead, OpenAI’s governance structure failed in the November board crisis, and I lost trust in both the former board’s ability to govern and that OpenAI’s leadership was acting in good faith (violating Principle 8).

Eventually, my discomfort with OpenAI’s leadership and decision-making reached the point where I felt like I needed to resign. Originally, I had planned to mostly go quietly, to avoid causing problems for other people who still felt like it was worth staying. The non-disparagement agreements I and others received on leaving broke my trust completely and confirmed my fears. I can’t imagine an organization that I would trust to make good decisions about a dangerous technology like AGI taking the path of creating these agreements that threatened departing employees with losing millions of dollars worth of vested equity if they said anything negative about the company, keeping them secret from current employees, refusing to negotiate, deflecting and minimizing when the story started to come out. Among other things, this legal situation meant if there was a dispute any dissenting employee on the Safety Advisory Group could be fired, then be coerced into signing a legal agreement that would prevent them discussing the situation with either the public or government.

OpenAI has taken steps to roll back this legal framework, but only after it came to light and there was significant internal and external pressure. And a number of other employees have resigned since my departure, including the head of the Superalignment team. Those who have left include many or all of the people who spoke up in that discussion against releasing Critical risk AI systems. The head of the Preparedness team was removed from the team under unclear circumstances, likely decreasing the capacity and influence of that team.

SB 1047

Companies and their executives see it as their right to make decisions that impose risks on the world. As far as I am aware that there is currently no law or regulation that would impede companies releasing the kinds of Critical Risk AI systems discussed in the Preparedness Framework. The proposed SB 1047 legislation in California, while it could be improved, is the best attempt I’ve seen to provide a check on this power. The most important ingredients in my view are requiring information from companies developing frontier models on risk assessments and safety measures and providing whistleblower protection in case employees come forward with concerns of critical harms from AI models even if no existing law is broken or harm hasn’t yet occurred. It doesn’t outlaw private companies making decisions about risk to society, but would at least ensure that there are external parties informed about what is going on and that there could be government involvement if decisions were clearly unreasonable.

In my opinion Anthropic has recently acted against Principle 1 in a letter from their State and Local Policy Lead about SB 1047. The FMD under 1047 could have become exactly the kind of body that could have recruited people who understand AI technology, and represented the interests of the public in high risk decisions. But Anthropic successfully advocated removing the creation of the Frontier Model Division from the bill on the grounds that the mandate is too vague, and “depending on its opinions or political agenda, might end up harming not just frontier model developers but the startup ecosystem or independent developers, or impeding innovation in general.”

A lot of uncertainty remained about how the FMD would have worked in practice, and I could imagine worlds where the FMD works out poorly. I’m not an expert at knowing how government agencies are designed. But note that the FMD wouldn't have had authority to impose fines or conduct enforcement, and would merely act as an advisor to the California Attorney General. I would have hoped that a responsible policy team, lead by someone who has strongly advocated for building state capacity in AI would have tried to figure out how to improve the FMD or replace it with a better structure. Instead they acted to deter a government in the act of building state capacity. At minimum, they could have instead advocated lowering maximum pre-harm enforcement fines present in the bill to the point where misguided pre-harm enforcement would be merely an annoyance.

Anthropic also seemed to defend the right of companies to make their own high risk decisions without oversight, saying that an approach that only focuses on liability with no FMD or pre-harm oversight “should appeal to honest skeptics of catastrophic risk, who can choose not to mitigate against risks they don't believe in (though they do so at their own peril).” This stance contradicts Principle 2. I don’t expect the first AI catastrophe to occur because someone calculated the risks and ignored them because they wouldn’t be held liable, I expect it to occur because someone miscalculated the risks or disbelieved in them. The “peril” involved is not only for the company taking the risk, even if liability is imposed. It’s impossible to create a standard that guarantees risks are calculated well, but SB 1047 would have at least allowed a weaker standard of taking “reasonable care“.

Anthropic’s willingness to reevaluate the bill after amendments and conclude that it "presents a feasible compliance burden” shows some good faith. The changes in practice aren't as bad as the proposed changes, at least preserving the possibility of pre-harm enforcement in the case of an "imminent risk or threat to public safety", and some whistleblower protections. I’m still glad that I went through the exercise of trying to write out my principles before reading Anthropic’s policy position, so that I could see clearly where it contradicts my principles.

I’ve written elsewhere about how OpenAI’s position is much worse. OpenAI resorted to fear mongering about the consequences of the bill without naming any specific ways the bill is harmful or could be improved, kicking the can down the road to the federal government even though no similar legislation is underway federally. If OpenAI was acting in good faith, they could have proposed amendments months ago, including sun-setting the California law once sufficiently similar federal law existed.

Call to Action

For the public, I think you should demand a voice in decisions made by private companies or branches of government that pose a significant risk of disaster. While representative democracy is imperfect, it is the best tool we have for providing a check on individuals willing to impose risks on the rest of society. You can also reflect on your preferences and values, to try and develop an ethical framework for how to approach high risk decisions. Even if you can’t be in the room where decisions are made yourself, it’s possible to develop norms and principles in advance for how decisions should be made, so that people in the room can know what other people want. Surely there’s room for further reflection and improvement on the principles I laid out here.

I think it's particularly important to develop frameworks for what reasonable safety-benefits analyses would look like. This should be fairly straightforward for existing systems based on an inability to cause serious harms, and aside from AI race dynamics is likely to favor the benefits side. If nobody develops good frameworks for these decisions, then we'll be stuck with whatever companies put together in an attempt to justify the decisions that they want to make anyways.

For machine learning researchers and engineers, you also have a chance to build the kind of government capacity and civil society that could play a role in making sane high risk decisions. If all of the talent goes to the AGI labs, then no one else will be able to assess and understand the situation in order to be involved in decisions. Working at an AGI lab comes with both overt and subtle conflicts of interest. I would ask you to at least consider the alternatives before deciding to join an AGI lab, or consider switching to civil society after working in industry. I'm personally planning to be involved in building government or civil society capacity for my next career move, instead of just joining another lab and hoping for the best.

For those working at OpenAI, Anthropic, and other frontier AI labs, the question of how you will face these high risk decisions could soon leave the realm of abstract ethical theory and enter the realm of reality. You might not agree with the principles I’ve outlined here, there’s room for reasonable disagreement. Even if you don’t agree with my position or my actions, I implore you to reflect on your values and decide how you would face these situations. If you don’t reflect on your situation and act from your own moral compass, then you will be a passive participant, shepherded along until you cross the threshold beyond which it is too late to do anything at all.

Thanks. I'm sad there's not more discussion here (maybe the people who are roughly on the same page as you are like "yep" and the people who disagree basically aren't interested in engaging? But, I dunno, I do think the principles section is actually someone non-obvious and worth more hashing out. Maybe doing a good job hashing it out feels like work. Maybe it's sort of overdetermined that anyone who can raise enough capital to run a frontier lab will be Selection Effected into being the sort of person who wouldn't really aspire to this sort of thing.

I will say I'm not actually sure I'm about Principle 1:

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

I do feel like I want this to be true, and I maybe the phrase "as broad as possible given the circumstances" is doing a lot of work. But, broad authority-legitimization-bases are often more confused, egregore-y, and lowest-common-denominator-y.

Oliver's Integrity and accountability are core parts of rationality comes to mind, which operationalizes the question of "who should you be accountable to?". I'll just copy the whole section here:

In the context of incentive design, I find thinking about integrity valuable because it feels to me like the natural complement to accountability. The purpose of accountability is to ensure that you do what you say you are going to do, and integrity is the corresponding virtue of holding up well under high levels of accountability.
Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.
There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).
Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. And, the smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.
I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.
I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.
This is all part of a broader reflection on trying to create good incentives for myself and the LessWrong team. I will try to follow this up with a post that more concretely summarizes my thoughts on how all of this applies to LessWrong concretely.

Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?

Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit

[...]

Significant margin of benefits over costs, accounting for possibility your calculations are incorrect (1.1x benefits over costs doesn’t justify, maybe 10x benefits over costs could justify, if you’re confident you aren’t making 10x errors, maybe ideally you have higher standards)

This seems likely to be crippling for many actors with a 10x margin. My guess is that a 10x margin is too high, though I'm not confident. (Note that it is possible that this policy is crippling and is also a good policy.)

Another way to put this is that most honest and responsible actors with 10x margins won't ever take actions that impose large harms in the case of AI.

Examples of things which seem crippling:

AI labs right now don't robustly secure algorithmic secrets from my understanding based on public knowledge. So, they impose large harms with their ongoing activities to the extent that an actor stealing these secrets is very harmful (as I think is likely). [Low confidence] I think 10x benefit on top of this will be unlikely even for a pretty responsible AI lab without devoting a crippling amount of resources into algorithmic security. If responsible actors followed this policy, they likely wouldn't exist.
Suppose that China is developing AI in a strictly more unsafe (with respect to misaligned) relative to a US government project. Suppose that the USG project thinks they would impose a 8% chance of AI takeover with their current plans while they estimate the chinese project would impose a 50% chance of AI takeover (suppose that e.g. the Chinese project recently saw a prior version of their AI attempt to escape, model organisms indicate their training process often results in egregious misalignment and the project is using approximately no precautions). Suppose the safety estimates are basically reasonable, e.g., they are based on third parties without a COI with access to both projects. A 10x risk margin would prevent the US project from proceeding in this case.

(I think this maybe requires some clarification of what we mean by "harm" and "risk". I assume we mean a deontological notion of harm such that we consider your actions in isolation. For instance, if you shoot a man in the head while 2 other people also shoot him in the head simultaneously, you're responsible for a high fraction of that harm. I don't consider harms from failing to prevent something or harms from your actions ending up causing a bad outcome via a long and unpredictable causal chain (e.g. you pass some not-directly-harmful policy that ultimately makes AI takeover more likely though you thought it would help in expectation).)

Maybe instead of focusing on a number (10x vs. 1.1x) the focus should be on other factors, like "How large and diverse is the group of non-CoI'd people who thought carefully about this decision?" and "How much is it consensus among that group that this is better for humanity, vs. controversial?"

In the case where e.g. the situation and safety cases have been made public, and e.g. the public is aware that the US AGI project is currently stalled due to not having a solution for deceptive alignment that we know will work, but meanwhile China is proceeding because they just don't think deceptive alignment is a thing at all, and moreover the academic ML community not just in the USA but around the world has looked at the safety case and the literature and model organisms etc. and generally is like "yeah probably deceptive alignment won't be an issue so long as we do XY and Z, but we can't rule it out even then" and the tiny minority that thinks otherwise seems pretty unreasonable, then I'd feel pretty happy with the decision to proceed with AGI capabilities advancements in the USA subject to doing XY and Z. (Though even then I'd also be like: Let's at least try to come to some sort of deal with China)

Whereas if e.g. the safety case and situation hasn't been made public, and the only technical alignment experts who've thought deeply about the situation and safety case are (a) corporate employees and (b) ~10 picked advisors with security clearances brought in by the government... OR if there's still tons of controversy with large serious factions saying "XY and Z are not enough; deceptive alignment is a likely outcome even so"... then if we proceed anyway I'd be thinking 'are we the baddies?'

Yeah, this all seems reasonable to me for the record, though I think any such proposal of this sort of norms needs to handle the fact that public discourse is sometimes very insane.

I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

I would have focused more on transparency and informing relevant groups rather than authority/accountability.

(See also the comment from Raemon which quotes Oliver on something similar.)

I think that if the public is well informed about the overall situation (most importantly: the level of capabilities, the procedures around whistleblowing and transparency, what process is used to decide if a model is safe including how this process works in practice, and what this process actually determined), then there are some natural mechanisms for avoiding huge problems. And the same goes for informing other relevant stakeholders beyond the public (ideally with more detail).

I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".

I suspect that there's a lot of alpha in taking a similar approach to other issues.

On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".

I think it's somewhat blameworthy to not think about these questions at all though

All of these principles seem good and worthwhile. Particularly, principles 3-7 seem like solid good ideas, regardless of how they're derived.

Here's an alternate framing of how to derive them: this is act consequentialism with adequate epistemic humility. Just reaching a conclusion and then acting on it is a highly dangerous policy for people to follow, because people aren't very good at reaching correct conclusions for unfamiliar or unique complex problems, particularly like AGI.

Therefore, if we just acted on someone's best guess, we're likely to get a disaster. The alignment community has not reached any consensus on how to handle AGI, and that's telling of the difficulty of the problem. A nontrivial collection of the smartest, most devoted, and most rational people haven't yet been able to work through the problem in sufficient depth and detail to have any certainty. Any individual who thinks they understand the scenarios well enough to make a unilateral decision is very likely overconfident, so their decisions are quite likely to lead to disaster.

We need more careful thought from people with more diverse perspectives and expertise.

Why was the risk of weaponizing not considered?
It has a very high probability and it is not easy to conceal. All governments with big militaries will inevitably invest in the AI weaponizing. No international organization/law could prevent it.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

I will say I'm not actually sure I'm about Principle 1:

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

Oliver's Integrity and accountability are core parts of rationality comes to mind, which operationalizes the question of "who should you be accountable to?". I'll just copy the whole section here:

In the context of incentive design, I find thinking about integrity valuable because it feels to me like the natural complement to accountability. The purpose of accountability is to ensure that you do what you say you are going to do, and integrity is the corresponding virtue of holding up well under high levels of accountability.
Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.
There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).
Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. And, the smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.
I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.
I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.
This is all part of a broader reflection on trying to create good incentives for myself and the LessWrong team. I will try to follow this up with a post that more concretely summarizes my thoughts on how all of this applies to LessWrong concretely.

Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit

[...]

Significant margin of benefits over costs, accounting for possibility your calculations are incorrect (1.1x benefits over costs doesn’t justify, maybe 10x benefits over costs could justify, if you’re confident you aren’t making 10x errors, maybe ideally you have higher standards)

Another way to put this is that most honest and responsible actors with 10x margins won't ever take actions that impose large harms in the case of AI.

Examples of things which seem crippling:

AI labs right now don't robustly secure algorithmic secrets from my understanding based on public knowledge. So, they impose large harms with their ongoing activities to the extent that an actor stealing these secrets is very harmful (as I think is likely). [Low confidence] I think 10x benefit on top of this will be unlikely even for a pretty responsible AI lab without devoting a crippling amount of resources into algorithmic security. If responsible actors followed this policy, they likely wouldn't exist.
Suppose that China is developing AI in a strictly more unsafe (with respect to misaligned) relative to a US government project. Suppose that the USG project thinks they would impose a 8% chance of AI takeover with their current plans while they estimate the chinese project would impose a 50% chance of AI takeover (suppose that e.g. the Chinese project recently saw a prior version of their AI attempt to escape, model organisms indicate their training process often results in egregious misalignment and the project is using approximately no precautions). Suppose the safety estimates are basically reasonable, e.g., they are based on third parties without a COI with access to both projects. A 10x risk margin would prevent the US project from proceeding in this case.

Yeah, this all seems reasonable to me for the record, though I think any such proposal of this sort of norms needs to handle the fact that public discourse is sometimes very insane.

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

I would have focused more on transparency and informing relevant groups rather than authority/accountability.

(See also the comment from Raemon which quotes Oliver on something similar.)

I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".

I suspect that there's a lot of alpha in taking a similar approach to other issues.

I think it's somewhat blameworthy to not think about these questions at all though

All of these principles seem good and worthwhile. Particularly, principles 3-7 seem like solid good ideas, regardless of how they're derived.

We need more careful thought from people with more diverse perspectives and expertise.

LESSWRONG
LW

LESSWRONG
LW

248

Principles for the AGI Race

248

Why form principles for the AGI Race?

Bad High Risk Decisions

Unnecessary Races to Develop Risky Technology

High Risk Decision Principles

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit

Race Principles

What is a Race?

Principle 3: When racing, have an exit strategy

Principle 4: Maintain accurate race intelligence at all times.

Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing

Principle 6: Seriously attempt alternatives to racing

Meta Principles

Principle 7: Don’t give power to people or structures that can’t be held accountable.

Principle 8: Notice when you can’t uphold your own principles.

Application of my Principles

Working at OpenAI

SB 1047

Call to Action

248

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances

248

Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances