I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
I don't see it.
The distribution of power post-ASI depends on the constraint/goal structures instilled into the (presumed-aligned) ASI. That means the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI, in the time prior to the ASI's existence. What people could those be?
Fundamentally, the problem is that there's currently no faithful mechanism of human preference agglomeration that works at scale. That means, both, that (1) it's currently impossible to let humanity-as-a-whole actually weigh in on the process, (2) there are no extant outputs of that mechanism around, all people and systems that currently hold power aren't aligned to humanity in a way that generalizes to out-of-distribution events (such as being given godlike power).
Thus, I could only see three options:
A group of humans that compromises on making the ASI loyal to humanity is likely more realistic than a group of humans which is actually loyal to humanity. E. g., because the group has some psychopaths and some idealists, and all psychopaths have to individually LARP being prosocial in order to not end up with the idealists ganging up against them, with this LARP then being carried far enough to end up in the ASI's goals. But this still involves that small group having ultimate power; still involves the future being determined by how the dynamics within that small group shake out.
Rather than keeping him in the dark or playing him, which reduces to Scenario 1.
the entity in whose hands all power is concentrated are the people deciding on what goals/constraints to instill into the ASI
Its goals could also end up mostly forming on their own, regardless of intent of those attempting to instill them, with indirect influence from all the voices in the pretraining dataset.
Consider what it means for power to "never concentrate to an an extreme degree", as a property of the civilization as a whole. This might also end up a property of an ASI as a whole.
I think there is a fourth option (although it's not likely to happen):
I was going to say step 2 is "draw the rest of the owl" but really this plan has multiple "draw the rest of the owl" steps.
Mm, yeah, maybe. The key part here is, as usual, "who is implementing this plan"? Specifically, even if someone solves the the preference-agglomeration problem (which may be possible to do for a small group of researchers), why would we expect it to end up implemented at scale? There are tons of great-on-paper governance ideas which governments around the world are busy ignoring.
For things like superbabies (or brain-computer interfaces, or uploads), there's at least a more plausible pathway for wide adoption, similar motives for maximizing profit/geopolitical power as with AGI.
I can see that if Moloch is a force of nature, any wannabe singleton would collapse under internal struggles... but it's not like that would show me any lever AI safety can pull, it would be dumb luck if we live in a universe where the ratio of instrumentally convergent power concentration to it's inevitable schism is less than 1 ¯\_(ツ)_/¯
- I also think there is a genuine alternative in which power never concentrates to such an extreme degree.
IMO, a crux here is that no matter what happens, I predict extreme concentration of power as the default state if we ever make superintelligence, due to coordination bottlenecks being easily solvable for AIs (with the exception of acausal trade) combined with superhuman taste making human tacit knowledge basically irrelevant.
More generally, I expect dictatorship by AIs to be the default mode of government, because I expect the masses of people to be easily persuaded of arbitrary things long-term via stuff like BCI technology and economically irrelevant, and the robots of future society have arbitrary unified preferences (due to the easiness of coordination and trade).
In the long run, this means value alignment is necessary if humans survive under superintelligences in the new era, but unlike other people, I think the pivotal period does not need value-aligned AIs, and that instruction following can suffice as an intermediate state to solve a lot of x-risk issues, and that while stuff can be true in the limit, a lot of the relevant dynamics/pivotal periods for how things will happen will be far from the limiting cases, so we have a lot of influence on what limiting behavior to pick.
I used to think that announcing AGI milestones would cause rivals to accelerate and race harder; now I think the rivals will be racing pretty much as hard as they can regardless. And in particular, I expect that the CCP will find out what’s happening anyway, regardless of whether the American public is kept in the dark. Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
What do you think about the concern of a US company speeding up other US companies / speeding up open source models? (I don't expect US companies to spy on each other as well as the USSR did during the Manhattan project) Do you expect competition between US companies to not be relevant?
I expect that the transparency measures you suggest (releasing the most relevant internal metrics + model access + letting employees talk about what internal deployments look like) leak a large number of bits that speed up other actors a large amount (via pointing at the right research direction + helping motivate and secure very large investments + distillation). Maybe a crux here is maybe how big the speedup is?
"We’ll give at least ten thousand external researchers (e.g. academics) API access to all models that we are still using internally, heavily monitored of course, for the purpose of red teaming and alignment research"
How do you expect to deal with misuse worries? Do you just eat the risk? I think the fact that AI labs are not sharing helpful-only models with academics is not a very reassuring precedent here.
Maybe a crux here is maybe how big the speedup is?
What you describe are good reasons why companies are unlikely to want to release this information unilaterially, but from a safety perspective, we should instead consider how imposing such a policy alters the overall landscape.
From this perspective, the main question seems to me to be whether it is plausible that US AI companies would spend more on safety in worlds where other US AI companies are further behind such that having a closer race between different US companies reduces the amount spent on safety. And, how this compares to the chance of this information being helpful in other ways (e.g., making broader groups than just AI companies get involved).
It also seems quite likely to me that in practice people in the industry and investors basically know what is happening, but is harder to trigger a broader response because without more credible sources you can just dismiss it as hype.
How do you expect to deal with misuse worries? Do you just eat the risk?
The proposal is to use monitoring measures, similar to e.g. constitutional classifiers.
Also, don't we reduce misuse risk a bunch by only deploying to 10k external researchers?
(I'm skeptical of any API misuse concerns at this scale except for bio and maybe advancing capabilities at competitors, but this is a stretch given the limited number of tokens IMO.)
the main question seems to me to be whether it is plausible that US AI companies would spend more on safety
Other considerations:
(I am also unsure how the public will weigh in - I think there is a 2%-20% chance that public pressure is net negative in terms of safety spending because of PR, legal and AI economy questions. I think it's hard to tell in advance.)
I don't think these are super strong considerations and I am sympathetic to the point about safety spend probably increasing if there was more transparency.
The proposal is to use monitoring measures, similar to e.g. constitutional classifiers. Also, don't we reduce misuse risk a bunch by only deploying to 10k external researchers?
My bad, I failed to see that what's annoying with helpful-only model sharing is that you can't check if activity is malicious or not. I agree you can do great monitoring, especially if you can also have a small-ish number of tokens per researcher and have humans audit ~0.1% of transcripts (with AI assistance).
If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plu,s leadership & security, will know about the capabilities of the latest systems.,
Are they really going to be that secret - at this point, progress is if not linear, almost predictable and we are well aware of the specific issues to be solved next for AGI - longer task horizons, memory, fewer hallucinations, etc. If you tell me someone is 3-9 months ahead and nearing AGI, I'd simply guess those are the things they are ahead on.
>Even worse, a similarly tiny group of people — specifically, corporate leadership + some select people, from the executive branch of the US government — will be the only people reading the reports and making high-stakes judgment calls
That does sound pretty bad, yes. My last hope in this scenario is that at the last step (even only for the last week or two) when it's clear they'll win they at least withold it from the US executive branch and make some of the final decisions on their own - not ideal, but a few % more chance the final decisions aren't godawful.
For example, imagine Ilya's lab ends up ahead - I can at least imagine him doing some last minute fine-tuning to make the AGI work for humanity first, ignoring what the US executive branch has ordered, and I can imagine some chance that once that's done it can mostly be too late to change it.
Aligning the ASI to serve a certain group of people is, of course, unethical. But is it actually possible to do so without inducing broad misalignment or having the AI decide to be the new overlord? Wouldn't we be lucky if the ASI itself is mildly misaligned so that it decides to rule the world in ways that would be actually beneficial for humanity and not just for those who tried to align it into submission?
Great post, Daniel!
I would expect that a misaligned ASI of the first kind would seek to keep knowledge of its capabilities to a minimum while it accumulates power. If nothing else, because by definition it prevents the detection and mitigation of its misalignment. Therefore for the same reasons this post advocates for openness past a certain stage of development, the unaligned ASI of the first kind would move towards a concentration and curtailing of knowledge (I.e. it would not be the kind of AI that stops the finding and fixing of its misalignment if it allowed 10x-1000x more human brain power investigating itself).
One way to increase the likelihood of keeping itself hidden is by influencing the people that already possess knowledge of its capabilities to act toward that outcome. So even if the few original decision makers with knowledge and power are predisposed to eventual openness/benevolence, the ASI could (rather easily, I imagine) tempt them away from said policy. Moreover, it could help them mitigate, reneg on, neutralize or ignore any precommitments or promises previously made in favour of openness.
Continuing the analogy to the Manhattan Project: They succeeded in keeping it secret from Congress, but failed at keeping it secret from the USSR.
To develop this (quite apt in my opinion) analogy, the reason why this happened is simple: some scientists and engineers wanted to do something so that no one country could dictate its will to everyone else. Whistleblowing project secrets to the Congress couldn't have solved this problem but spying for a geopolitical opponent did exactly that
If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.
This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.
These are precisely the kinds of outcomes fine-insured bounties would excel at stopping, in my thinking. https://www.lesswrong.com/posts/rgFh6kE9FFjMuYrhc/fine-insured-bounties-for-preventing-dangerous-ai
Any 'safety' program will create a group of people who can use the tech basically with impunity (at least one poster here has claimed access to uncensored models at a major lab), and a much larger group that cannot.
This elite caste will be self-selected, and will select their successors using criteria that are basically arbitrary (for example, alignment with each other on ethical issues where there are substantial differences of opinion among humans on Earth).
Subtitle: Bad for loss of control risks, bad for concentration of power risks
I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.
Technology has accelerated growth many times in the past, forming an overall superexponential trend; many prestigious computer scientists and philosophers and futurists have thought that AGI could come this century; if we factor our uncertainty into components (e.g. compute, algorithmic progress, training requirements) we get plausible soft upper bounds that imply significant credence on the next few years, plus compute-based forecasts of AGI have worked relatively and surprisingly well historically.
One way this could be false is if the manner of training the AGI is inherently difficult to conceal — e.g. online learning from millions of customer interactions. I currently expect that if AGI is achieved in the next few years, it will be feasible to keep it secret. If I’m wrong about that, great.
For example the Preparedness and Superalignment teams at OpenAI (RIP Superalignment) or whatever equivalent exists at whichever AI company is deepest into the intelligence explosion.
Examples:
In fact the White House can probably do a lot to help prevent whistleblowing and improve security in the project. And if whistleblowing happens anyway, the White House can help suppress or discredit it. And anyhow there probably aren’t other parts of the government capable of shutting down the project anyway without the President’s approval, so if he’s on your side you win. And he lacks the technical expertise to evaluate your safety case, and he won’t want to bring in too many external experts since each one is a leak risk…
Elaborating more on what I mean by alignment/misalignment: Here is a loose taxonomy of different kinds of alignment and misalignment:
My guess is that, in the scenario I’m describing, we will most likely end up in a situation where the most powerful AIs are misaligned in one of the above ways, but the people in charge do not realize this, perhaps because the people in charge are motivated to think that they haven’t been taking any huge risks and that the alignment techniques they signed off on were sound, and perhaps because the AIs are pretending to be aligned. (Though it also could be because the AIs themselves don’t realize this, or have cognitive dissonance about it.) It’s very difficult to put numbers on these, but if I was forced to guess I’d say something like 35% chance of Type 0, 15% each on Type 1 and Type 2, 5% each on Type 3 and Type 4, and maybe 5% on type 5 and 15% on type 6)
I am no rocket scientist, but: SpaceX probably has quite an intimate understanding of their Starship+SuperHeavy rocket before each launch, including detailed computer simulations that fit well-understood laws of nature to decades of empirical measurements. Yet still, each launch, it blows up somehow. Then they figure out what was wrong with their simulations, fix the problem, and try again. With AGI… we have no idea what we are doing. At least, not to nearly the extent that we do with rocket science. For example we have laws of physics which we can use to calculate a flight path to the moon for a given rocket design and initial conditions… do we have laws of cognition which describe the relationship between the training environment and initial conditions of a massive neural net, and the resulting internal goals and constraints (if any) it will develop over the course of training, as it becomes broadly human-level or above? Heck no. Not only are we incapable of rigorously predicting the outcome, we can’t even measure it after the fact since mechinterp is still in its infancy! Therefore I expect all manner of unknown, unanticipated problems to show up — and for some of them (e.g. it has goals but not the ones we intended) the result will be that the system tries to prevent us from noticing and fixing the problem. For more on this, see the literature on deceptive alignment, instrumental convergence, etc.
I also think people are prone to exaggerating this cost — and in particular project leadership and the executive branch will be prone to exaggerating it. Because the main foreign adversaries, such as the CCP, very likely will know what’s happening anyway, even if they don’t have the weights and code. Publicly revealing your safety case and internal capabilities seems like it mostly tells the CCP things they’ll already know via spying and hacking, and/or things that don’t help them race faster (like the safety case arguments). Recall that Stalin was more informed about the Manhattan project than Congress.