Thanks. I'm sad there's not more discussion here (maybe the people who are roughly on the same page as you are like "yep" and the people who disagree basically aren't interested in engaging? But, I dunno, I do think the principles section is actually someone non-obvious and worth more hashing out. Maybe doing a good job hashing it out feels like work. Maybe it's sort of overdetermined that anyone who can raise enough capital to run a frontier lab will be Selection Effected into being the sort of person who wouldn't really aspire to this sort of thing.
I will say I'm not actually sure I'm about Principle 1:
Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances
I do feel like I want this to be true, and I maybe the phrase "as broad as possible given the circumstances" is doing a lot of work. But, broad authority-legitimization-bases are often more confused, egregore-y, and lowest-common-denominator-y.
Oliver's Integrity and accountability are core parts of rationality comes to mind, which operationalizes the question of "who should you be accountable to?". I'll just copy the whole section here:
In the context of incentive design, I find thinking about integrity valuable because it feels to me like the natural complement to accountability. The purpose of accountability is to ensure that you do what you say you are going to do, and integrity is the corresponding virtue of holding up well under high levels of accountability.
Highlighting accountability as a variable also highlights one of the biggest error modes of accountability and integrity – choosing too broad of an audience to hold yourself accountable to.
There is tradeoff between the size of the group that you are being held accountable by, and the complexity of the ethical principles you can act under. Too large of an audience, and you will be held accountable by the lowest common denominator of your values, which will rarely align well with what you actually think is moral (if you've done any kind of real reflection on moral principles).
Too small or too memetically close of an audience, and you risk not enough people paying attention to what you do, to actually help you notice inconsistencies in your stated beliefs and actions. And, the smaller the group that is holding you accountable is, the smaller your inner circle of trust, which reduces the amount of total resources that can be coordinated under your shared principles.
I think a major mistake that even many well-intentioned organizations make is to try to be held accountable by some vague conception of "the public". As they make public statements, someone in the public will misunderstand them, causing a spiral of less communication, resulting in more misunderstandings, resulting in even less communication, culminating into an organization that is completely opaque about any of its actions and intentions, with the only communication being filtered by a PR department that has little interest in the observers acquiring any beliefs that resemble reality.
I think a generally better setup is to choose a much smaller group of people that you trust to evaluate your actions very closely, and ideally do so in a way that is itself transparent to a broader audience. Common versions of this are auditors, as well as nonprofit boards that try to ensure the integrity of an organization.
This is all part of a broader reflection on trying to create good incentives for myself and the LessWrong team. I will try to follow this up with a post that more concretely summarizes my thoughts on how all of this applies to LessWrong concretely.
Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit
[...]
Significant margin of benefits over costs, accounting for possibility your calculations are incorrect (1.1x benefits over costs doesn’t justify, maybe 10x benefits over costs could justify, if you’re confident you aren’t making 10x errors, maybe ideally you have higher standards)
This seems likely to be crippling for many actors with a 10x margin. My guess is that a 10x margin is too high, though I'm not confident. (Note that it is possible that this policy is crippling and is also a good policy.)
Another way to put this is that most honest and responsible actors with 10x margins won't ever take actions that impose large harms in the case of AI.
Examples of things which seem crippling:
(I think this maybe requires some clarification of what we mean by "harm" and "risk". I assume we mean a deontological notion of harm such that we consider your actions in isolation. For instance, if you shoot a man in the head while 2 other people also shoot him in the head simultaneously, you're responsible for a high fraction of that harm. I don't consider harms from failing to prevent something or harms from your actions ending up causing a bad outcome via a long and unpredictable causal chain (e.g. you pass some not-directly-harmful policy that ultimately makes AI takeover more likely though you thought it would help in expectation).)
Maybe instead of focusing on a number (10x vs. 1.1x) the focus should be on other factors, like "How large and diverse is the group of non-CoI'd people who thought carefully about this decision?" and "How much is it consensus among that group that this is better for humanity, vs. controversial?"
In the case where e.g. the situation and safety cases have been made public, and e.g. the public is aware that the US AGI project is currently stalled due to not having a solution for deceptive alignment that we know will work, but meanwhile China is proceeding because they just don't think deceptive alignment is a thing at all, and moreover the academic ML community not just in the USA but around the world has looked at the safety case and the literature and model organisms etc. and generally is like "yeah probably deceptive alignment won't be an issue so long as we do XY and Z, but we can't rule it out even then" and the tiny minority that thinks otherwise seems pretty unreasonable, then I'd feel pretty happy with the decision to proceed with AGI capabilities advancements in the USA subject to doing XY and Z. (Though even then I'd also be like: Let's at least try to come to some sort of deal with China)
Whereas if e.g. the safety case and situation hasn't been made public, and the only technical alignment experts who've thought deeply about the situation and safety case are (a) corporate employees and (b) ~10 picked advisors with security clearances brought in by the government... OR if there's still tons of controversy with large serious factions saying "XY and Z are not enough; deceptive alignment is a likely outcome even so"... then if we proceed anyway I'd be thinking 'are we the baddies?'
Yeah, this all seems reasonable to me for the record, though I think any such proposal of this sort of norms needs to handle the fact that public discourse is sometimes very insane.
Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances
I would have focused more on transparency and informing relevant groups rather than authority/accountability.
(See also the comment from Raemon which quotes Oliver on something similar.)
I think that if the public is well informed about the overall situation (most importantly: the level of capabilities, the procedures around whistleblowing and transparency, what process is used to decide if a model is safe including how this process works in practice, and what this process actually determined), then there are some natural mechanisms for avoiding huge problems. And the same goes for informing other relevant stakeholders beyond the public (ideally with more detail).
I really like the way that you've approached this pragmatically, "If you do X, which may be risky or dubious, at least do Y".
I suspect that there's a lot of alpha in taking a similar approach to other issues.
All of these principles seem good and worthwhile. Particularly, principles 3-7 seem like solid good ideas, regardless of how they're derived.
Here's an alternate framing of how to derive them: this is act consequentialism with adequate epistemic humility. Just reaching a conclusion and then acting on it is a highly dangerous policy for people to follow, because people aren't very good at reaching correct conclusions for unfamiliar or unique complex problems, particularly like AGI.
Therefore, if we just acted on someone's best guess, we're likely to get a disaster. The alignment community has not reached any consensus on how to handle AGI, and that's telling of the difficulty of the problem. A nontrivial collection of the smartest, most devoted, and most rational people haven't yet been able to work through the problem in sufficient depth and detail to have any certainty. Any individual who thinks they understand the scenarios well enough to make a unilateral decision is very likely overconfident, so their decisions are quite likely to lead to disaster.
We need more careful thought from people with more diverse perspectives and expertise.
Why was the risk of weaponizing not considered?
It has a very high probability and it is not easy to conceal. All governments with big militaries will inevitably invest in the AI weaponizing. No international organization/law could prevent it.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race
Why form principles for the AGI Race?
I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.
While there, I would sometimes dream about what would have happened if I’d been a nuclear physicist in the 1940s. I do think that many of the kind of people who get involved in the effective altruism movement would have joined, naive but clever technologists worried about the consequences of a dangerous new technology. Maybe I would have followed them, and joined the Manhattan Project with the goal of preventing a world where Hitler could threaten the world with a new magnitude of destructive power. The nightmare is that I would have watched the fallout of bombings of Hiroshima and Nagasaki with a growing gnawing panicked horror in the pit of my stomach, knowing that I had some small share of the responsibility.
Maybe, like Albert Einstein, I would have been unable to join the project due to a history of pacifism. If I had joined, I like to think that I would have joined the ranks of Joseph Rotblat and resigned once it became clear that Hitler would not get the Atomic Bomb. Or joined the signatories of the Szilárd petition requesting that the bomb only be used after terms of surrender had been publicly offered to Japan. Maybe I would have done something to try to wake up before the finale of the nightmare.
I don’t know what I would have done in a different time and place, facing different threats to the world. But as I’ve found myself entangled in the ongoing race to build AGI, it feels important to reflect on the lessons to learn from history. I can imagine this alter ego of myself and try to reflect on how I could take right actions in both this counterfactual world and the one I find myself in now. In particular, what could guide me to the right path even when I’m biased, subtly influenced by the people around me, misinformed, or deliberately manipulated?
Simply trying to pick the action you think will lead to the best consequences for the world fails to capture the ways in which your model of the world is wrong, or your own thinking is corrupt. Joining the Manhattan Project, and using the weapons on Japan both had plausible consequentialist arguments supporting them, ostensibly inviting a lesser horror into the world to prevent a greater one.
Instead I think the best guiding star to follow is reflecting on principles, rules which apply in a variety of possible worlds, including worlds in which you are wrong. Principles that help you gather the right information about the world. Principles that limit the downsides if you’re wrong. Principles that help you tell whether you're in a world where racing to build a dangerous technology first is the best path, or you’re in a world where it’s a hubristic self-delusion. This matches more with the idea of rule consequentialism than pure act consequentialism: instead of making each decision based on what you think is best, think about what rules would be good for people to adopt if they were in a similar situation.
My goal in imagining these principles is to find principles that prevent errors of the following forms.
Bad High Risk Decisions
Unnecessary Races to Develop Risky Technology
In both of these, I fear more the costs of action, rather than the costs of inaction, which I think is the appropriate stance in the face of unrecoverable failures.
High Risk Decision Principles
Principle 1: Seek as broad and legitimate authority for your decisions as is possible under the circumstances
Principle 2: Don’t take actions which impose significant risks to others without overwhelming evidence of net benefit
Race Principles
What is a Race?
Principle 3: When racing, have an exit strategy
Principle 4: Maintain accurate race intelligence at all times.
Principle 5: Evaluate how bad it is for your opponent to win instead of you, and balance this against the risks of racing
Principle 6: Seriously attempt alternatives to racing
Meta Principles
Principle 7: Don’t give power to people or structures that can’t be held accountable.
Principle 8: Notice when you can’t uphold your own principles.
Application of my Principles
Working at OpenAI
For myself, I started feeling like there were warning signs around OpenAI’s leadership and decision making around current models, where it felt like it was going in the wrong direction. I tried to figure out how the organization could proceed in a way that I would be comfortable with and also would reflect the interests and values of the rest of the organization. I proposed ideas to those in power, and felt like I was ignored.
I started to think about the principles in this post when OpenAI was discussing adopting the preparedness framework. It laid out categories and levels of risk to measure for AI systems, including cyberattacks, persuasion, uncontrolled autonomous AI systems, and assistance in producing biological, chemical or nuclear weapons. The most severe risk level “Critical” described systems that could pose a significant danger to humanity as a whole, posing risks including:
Much of the preparedness framework was a productive and useful step forward towards developing processes to assess and respond to these risks. However I was disturbed that the initial draft did not say anything about whether we would release AI systems OpenAI knew were at this Critical risk level. A senior executive involved in the discussion asked a question to the effect of “couldn’t I imagine a situation where I would want to deploy a Critical risk AI system?”, in effect “wouldn’t I want OpenAI to deploy AI systems that posed a risk to the world, if OpenAI thought it was the right thing to do”? That question really started my thinking around these principles.
At the time I and several other people spoke up, arguing for a commitment to not release High or Critical AI systems unless they could be made to reduce the risk level. I should give some credit to OpenAI for making this commitment. Even after changes, I was still uncomfortable with how the main decision maker on whether an AI system was made safe enough to deploy was still the CEO. A Safety Advisory Group would advise on this decision, but could be overridden. There was no clarity on what if any external involvement in decision making would be (undermining Principle 1). And while the company kept making grander and grander plans to push forward AI technology, I could see no serious attempt to uphold the Principles 3-6 around racing. (ETA: On reflection there was actually one attempt at an alternative to racing that didn’t go anywhere but should get some partial credit, there was also the merge and assist clause although that seemed mostly to be unrealistic.) Instead, OpenAI’s governance structure failed in the November board crisis, and I lost trust in both the former board’s ability to govern and that OpenAI’s leadership was acting in good faith (violating Principle 8).
Eventually, my discomfort with OpenAI’s leadership and decision-making reached the point where I felt like I needed to resign. Originally, I had planned to mostly go quietly, to avoid causing problems for other people who still felt like it was worth staying. The non-disparagement agreements I and others received on leaving broke my trust completely and confirmed my fears. I can’t imagine an organization that I would trust to make good decisions about a dangerous technology like AGI taking the path of creating these agreements that threatened departing employees with losing millions of dollars worth of vested equity if they said anything negative about the company, keeping them secret from current employees, refusing to negotiate, deflecting and minimizing when the story started to come out. Among other things, this legal situation meant if there was a dispute any dissenting employee on the Safety Advisory Group could be fired, then be coerced into signing a legal agreement that would prevent them discussing the situation with either the public or government.
OpenAI has taken steps to roll back this legal framework, but only after it came to light and there was significant internal and external pressure. And a number of other employees have resigned since my departure, including the head of the Superalignment team. Those who have left include many or all of the people who spoke up in that discussion against releasing Critical risk AI systems. The head of the Preparedness team was removed from the team under unclear circumstances, likely decreasing the capacity and influence of that team.
SB 1047
Companies and their executives see it as their right to make decisions that impose risks on the world. As far as I am aware that there is currently no law or regulation that would impede companies releasing the kinds of Critical Risk AI systems discussed in the Preparedness Framework. The proposed SB 1047 legislation in California, while it could be improved, is the best attempt I’ve seen to provide a check on this power. The most important ingredients in my view are requiring information from companies developing frontier models on risk assessments and safety measures and providing whistleblower protection in case employees come forward with concerns of critical harms from AI models even if no existing law is broken or harm hasn’t yet occurred. It doesn’t outlaw private companies making decisions about risk to society, but would at least ensure that there are external parties informed about what is going on and that there could be government involvement if decisions were clearly unreasonable.
In my opinion Anthropic has recently acted against Principle 1 in a letter from their State and Local Policy Lead about SB 1047. The FMD under 1047 could have become exactly the kind of body that could have recruited people who understand AI technology, and represented the interests of the public in high risk decisions. But Anthropic successfully advocated removing the creation of the Frontier Model Division from the bill on the grounds that the mandate is too vague, and “depending on its opinions or political agenda, might end up harming not just frontier model developers but the startup ecosystem or independent developers, or impeding innovation in general.”
A lot of uncertainty remained about how the FMD would have worked in practice, and I could imagine worlds where the FMD works out poorly. I’m not an expert at knowing how government agencies are designed. But note that the FMD wouldn't have had authority to impose fines or conduct enforcement, and would merely act as an advisor to the California Attorney General. I would have hoped that a responsible policy team, lead by someone who has strongly advocated for building state capacity in AI would have tried to figure out how to improve the FMD or replace it with a better structure. Instead they acted to deter a government in the act of building state capacity. At minimum, they could have instead advocated lowering maximum pre-harm enforcement fines present in the bill to the point where misguided pre-harm enforcement would be merely an annoyance.
Anthropic also seemed to defend the right of companies to make their own high risk decisions without oversight, saying that an approach that only focuses on liability with no FMD or pre-harm oversight “should appeal to honest skeptics of catastrophic risk, who can choose not to mitigate against risks they don't believe in (though they do so at their own peril).” This stance contradicts Principle 2. I don’t expect the first AI catastrophe to occur because someone calculated the risks and ignored them because they wouldn’t be held liable, I expect it to occur because someone miscalculated the risks or disbelieved in them. The “peril” involved is not only for the company taking the risk, even if liability is imposed. It’s impossible to create a standard that guarantees risks are calculated well, but SB 1047 would have at least allowed a weaker standard of taking “reasonable care“.
Anthropic’s willingness to reevaluate the bill after amendments and conclude that it "presents a feasible compliance burden” shows some good faith. The changes in practice aren't as bad as the proposed changes, at least preserving the possibility of pre-harm enforcement in the case of an "imminent risk or threat to public safety", and some whistleblower protections. I’m still glad that I went through the exercise of trying to write out my principles before reading Anthropic’s policy position, so that I could see clearly where it contradicts my principles.
I’ve written elsewhere about how OpenAI’s position is much worse. OpenAI resorted to fear mongering about the consequences of the bill without naming any specific ways the bill is harmful or could be improved, kicking the can down the road to the federal government even though no similar legislation is underway federally. If OpenAI was acting in good faith, they could have proposed amendments months ago, including sun-setting the California law once sufficiently similar federal law existed.
Call to Action
For the public, I think you should demand a voice in decisions made by private companies or branches of government that pose a significant risk of disaster. While representative democracy is imperfect, it is the best tool we have for providing a check on individuals willing to impose risks on the rest of society. You can also reflect on your preferences and values, to try and develop an ethical framework for how to approach high risk decisions. Even if you can’t be in the room where decisions are made yourself, it’s possible to develop norms and principles in advance for how decisions should be made, so that people in the room can know what other people want. Surely there’s room for further reflection and improvement on the principles I laid out here.
I think it's particularly important to develop frameworks for what reasonable safety-benefits analyses would look like. This should be fairly straightforward for existing systems based on an inability to cause serious harms, and aside from AI race dynamics is likely to favor the benefits side. If nobody develops good frameworks for these decisions, then we'll be stuck with whatever companies put together in an attempt to justify the decisions that they want to make anyways.
For machine learning researchers and engineers, you also have a chance to build the kind of government capacity and civil society that could play a role in making sane high risk decisions. If all of the talent goes to the AGI labs, then no one else will be able to assess and understand the situation in order to be involved in decisions. Working at an AGI lab comes with both overt and subtle conflicts of interest. I would ask you to at least consider the alternatives before deciding to join an AGI lab, or consider switching to civil society after working in industry. I'm personally planning to be involved in building government or civil society capacity for my next career move, instead of just joining another lab and hoping for the best.
For those working at OpenAI, Anthropic, and other frontier AI labs, the question of how you will face these high risk decisions could soon leave the realm of abstract ethical theory and enter the realm of reality. You might not agree with the principles I’ve outlined here, there’s room for reasonable disagreement. Even if you don’t agree with my position or my actions, I implore you to reflect on your values and decide how you would face these situations. If you don’t reflect on your situation and act from your own moral compass, then you will be a passive participant, shepherded along until you cross the threshold beyond which it is too late to do anything at all.