- Constitutional AI: AI can be trained by feedback from other AI based on a "constitution" of rules and principles.
- (The number of proposed alignment solutions is very large, so the only ones listed here are the two pursued by OpenAI and Anthropic, respectively. ...)
I think describing Constitutional AI as "the solution pursued by Anthropic" is substantially false. Our 'core views' post describes a portfolio approach to safety research, across optimistic, intermediate, and pessimistic scenarios.
If we're in an optimistic scenario where catastrophic risk from advanced AI is very unlikely, then Constitutional AI or direct successors might be sufficient - but personally I think of such techniques as baselines and building blocks for further research rather than solutions. If we're not so lucky, then future research and agendas like mechanistic interpretability will be vital. This alignment forum comment goes into some more detail about our thinking at the time.
Thank you for the correction. I've changed it to "the only ones listed here are these two, which are among the techniques pursued by OpenAI and Anthropic, respectively."
(Admittedly, part of the reason I left that section small was because I was not at all confident of my ability to accurately describe the state of alignment planning. Apologies for accidentally misrepresenting Anthropic's views.)
It's a good start, but I don't think this is a reasonably exhaustive list, since I don't find myself on it :)
My position is closest to your number 3: "ASI will not want to take over or destroy the world." Mostly because "want" is a very anthropomorphic concept. The Orthogonality Thesis is not false, but inapplicable, since AI are so different from humans. They did not evolve to survive, they were designed to answer questions.
It will be possible to coordinate to prevent any AI from being given deliberately dangerous instructions, and also any unintended consequences will not be that much of a problem
I do not think it will be possible, and I expect some serious calamities from people intentionally or accidentally giving an AI "deliberately dangerous instructions". I just wouldn't expect it to result in systematic extermination of all life on earth, since the AI itself does not care in the same way humans do. Sure, it's a dangerous tool to wield, but it is not a malevolent one. Sort of 3-b-iv, but not quite.
But mostly the issue with doomerism I see is that the Knightian uncertainty on any non-trivial time frame: there will be black swans in all directions, just like there have been lately (for example, no one expected near-human-level LARPing that LLMs do, while not being in any way close to a sentient agent).
To be clear, I expect the world to change quickly and maybe even unrecognizably in the next decade or two, with lots of catastrophic calamities, but the odds of complete "destruction of all value", the way Zvi puts it, cannot be evaluated at this point with any confidence. The only way to get this confidence is to walk the walk. Pausing and being careful and deliberate about each step does not seem to make sense, at least not yet.
I see that as being related to current AIs not being particularly agentic. I agree in the short run, but in the long run there's a lot of pressure to make AIs more agentic and some of those dangerous instructions will be pointed in the direction of increased agency too.
There are around 50 counterarguments here, and if each has only 1 per cent to be true, here is approximately a 39.5% chance that at least one of them is actually true.
The main arguments on this list that I mostly agree with are probably these ones:
Many AIs will be developed within a short time, leading to a multipolar situation, and they will have no special ability to coordinate with each other. The various AIs continue to work within and support the framework of the existing economy and laws, and prefer to preserve rights and property for the purpose of precedent, out of self-interest. The system successfully prevents any single AI from taking over, and humanity is protected.
and
The AI Alignment Problem will turn out to be unexpectedly easy, and we will solve it in time. Additionally, whoever is "in the lead" will have enough extra time to implement the solution without losing the lead. Race dynamics won't mess everything up.
I have some quibbles with some of these claims, however. I don't expect there to be a single solution to AI alignment. Rather, I expect that there will be a spectrum of approaches and best practices that work to varying degrees, with none of them being perfect. I would put less emphasis than you do on the actions taken by the actor in the lead, and would point instead to broader engineering insights, norms among labs, and regulations, when explaining why alignment might work out.
Also, I expect AIs will be able to coordinate much better than humans in the long-run. I just doubt this means all AIs will merge into a single agent, dispensing with laws. Even if AIs do merge in such a way, I doubt they would do it in a way that made humanity go extinct, since I think the value alignment part will probably prevent that.
I'm not sceptical of all forms of AI unsafety, but against the claim of mass extinction with high probability.
The classic foom doom, argument involves an agentive AI that quickly becomes powerful through recursive self improvement and has a value/goal system that is unfriendly and incorrigible (ie there's an assumption that we only have one chance to get goals that are good enough for a superintellgience, because the seed AI will foom into an ASI , retaining its goals, and goals that are good enough for a dumber AI may be dangerous in a smarter one).
I don't see how the overall argument can have high probability, when it involves so many individual assumptions.
I don't think the OT is wrong, I do think it doesn't go far enough.
The standard OT is silent on the subject of the temporal dynamic or developmental aspects of minds -- meaning that AI doomers fill the gap with their usual assumption of goal stability. The standard OT can be considered a subset of a wider OT, that has the implication that all combinations of intelligence and goal (in)stability are possible: mindspace is not populated solely by goal-stable agents. But Foom Doom argument is posited on agents which have stable goals, together with the ability to self improve, so the wider OT weighs against foom doom, and the overall picture is mixed.
Goal stability under self improvement is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows his to engineer it, and humans appear not to have it. It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals. Therefore, no goal stable system of any complexity exists on this planet, and goal instability cannot be assumed as a default or given. So the orthogonality thesis is true of momentary combinations of goal and intelligence, given the provisos above, but not necessarily true of stable combinations.
But Foom Doom argument is posited on agents which have stable goals, together with the ability to self improve, so the wider OT weighs against foom doom, and the overall picture is mixed.
It's also not all that applicable to LLMs, which aren't very agentive: we can build tool AI that is nearly human level, because we have. We also have constitutional AI, which shows how AIs can improve their values/goals, contra the Yudkowsky side of the Yudkowsy/Loosemore debate
I find this analysis to be extremely useful. Obviously anything can be refined and expanded, but this is such a good foundation. Thank you.
I didn't find the view that AI will have human survival as an instrumental goal, for example, as workers or, more likely, as a possible trade with aliens or simulation owners. It will preserve humans to demonstrate its general friendliness to possible peers.
AI may also preserve humans for research proposes like running experiments in simulations.
Yeah, I think that's another example of a combination of going partway into "why would it do the scary thing?" (3) and "wouldn't it be good anyway?" (5). (A lot of people wouldn't consider "AI takes over but keeps humans alive for its own (perhaps scary) reasons" to be a "non-doom" outcome.) Missing positions like this one is a consequence of trying to categorize into disjoint groups, unfortunately.
To fizzlers: advance AI is internally unstable and can suddenly halt. The more advance is AI, the quicker it halts, as it reach its goal in shorter and shorter time.
For reference, here is a list of blog trying saying that AI Safety might be less important : https://stampy.ai/?state=87O6_9OGZ-9IDQ-9TDI-8TJV-
I think it might be helpful to have a variant of 3a that likewise says the orthogonality thesis is false, but is not quite so optimistic as to say the alternative is that AI will be "benevolent by default". One way the orthogonality thesis could be false would be that an AI capable of human-like behavior (and which could be built using near-future computing power, say less than or equal to the computing power needed for mind uploading) would have to be significantly more similar to biological brains than current AI approaches, and in particular would have to go through an extended period of embodied social learning similar to children, with this learning process depending on certain kinds of sociable drives along with other similar features like curiosity, playfulness, a bias towards sensory data a human might consider "complex" and "interesting", etc. This degree of convergence with biological structure and drives might make it unlikely it would end up optimizing for arbitrary goals we would see as boring and monomaniacal like paperclip-maximizing, but wouldn't necessarily guarantee friendliness towards humans either. It'd be more akin to reaching into a parallel universe where a language-using intelligent biological species had evolved from different ancestors, grabbing a bunch of their babies and raising them in human society--they might be similar enough to learn language and engage in the same kind of complex-problem solving as humans, but even if they didn't pursue what we would see as boring/monomaniacal goals, their drives and values might be different enough to cause conflict.
Eliezer Yudkowsky's 2013 post at https://www.facebook.com/yudkowsky/posts/10152068084299228 imagined a "cosmopolitan cosmist transhumanist" who would be OK with a future dominated by beings significantly different from us, but who still wants future minds to "fall somewhere within a large space of possibilities that requires detailed causal inheritance from modern humans" as opposed to minds completely outside of this space like paperclip maximizers (in his tweet this May at https://twitter.com/ESYudkowsky/status/1662113079394484226 he made a similar point). So one could have a scenario where orthogonality is false in the sense that paperclip maximizer type AIs aren't overwhelmingly likely even if we fail to develop good alignment techniques, but where even if the degree of convergence with biological brains is sufficient that we're likely to get a mind that a cosmopolitan cosmist transhumanist would be OK with (they would still pursue science, art etc.), we can't be confident we'll get something completely benevolent by default towards human beings. I'm a sort of Star Trek style optimist about different intelligent beings with broadly similar goals being able to live in harmony, especially in some kind of post-scarcity future of widespread abundance, but it's just a hunch--even if orthogonality is false in the way I suggested, I don't think there's any knock-down argument that creating a new form of intelligence would be free of risk to humanity.
Partly inspired by The Crux List, the following is a non-comprehensive taxonomy of positions which imply that we should not be worried about existential risk from artificial superintelligence.
Each position individually is supposed to be a refutation of AI X-risk concerns as a whole. These are mostly structured as specific points of departure from the regular AI X-risk position, taking the other areas as a given. This may result in skipping over positions which have multiple complex dependencies.
Some positions are given made-up labels, including each of the top-level categories: "Fizzlers", "How-skeptics", "Why-skeptics", "Solvabilists", and "Anthropociders".
(Disclaimer: I am not an expert on the topic. Apologies for any mistakes or major omissions.)
Taxonomy
Overlaps
These positions do not exist in isolation from each other, and lesser versions of each can often combine into working non-doom positions themselves. Examples: The beliefs that AI is somewhat far away, and that the danger could be solved in a relatively short period of time; or expecting some amount of intrinsic moral behaviour, and being somewhat more supportive of AI takeover situations; or expecting a fundamental intelligence ceiling close enough to humanity and having some element of how-skepticism; or expecting AI to be somewhat non-goal-oriented/non-agentic and somewhat limited in capabilities. And then of course, probabilities multiply: if several positions are each likely to be true, the combined risk of doom is lowered even further. Still, many skeptics hold their views because of a clear position on a single sub-issue.
Polling
There is some small amount of polling available about how popular each of these opinions are:
Not very much to go off of. It would be interesting to see some more comprehensive surveys of both experts and the general public.