TL;DR P(doom|superintelligence) is around 99% because AI researchers now don't have a precise theory of human values and precise understanding of inner structure of current AI, so they can’t encode one into the other. And while future AI may have different architecture and development process than LLMs, it also may be some variation of the same with all complications included.
What "doom" here is means? It means "certain death". And P(doom|superintelligence) means "probability that superintelligent AI will change the world in a way that will be incompatible with human survival".
Apparently there is no unified human value theory so i am going to use descriptions from clearerthinking.org and this post (value hierarchy seems distinct and well put). And to calculate P(doom|superintelligence) I will take "Immediate survival needs", all 5 of them. With 0.5 probability of success for each of them (because they are distinct). But where does "0.5" came from?
Level I: Immediate Survival
1. Oxygen – an open airway (between our lungs and the outside world) and a constant supply of air that is at least 19.5% oxygen (and partial pressure of oxygen of less than 1.4 atm)
2. Functioning – freedom from severe/acute injury, bodily damage, and organ failure
3. Safety – no immediate threats from the environment, including from dangerous human or nonhuman animals
4. Temperature – protection from hypothermia or hyperthermia (i.e., one’s core body temperature needs to stay within 95-104 degrees Fahrenheit at all times)
5. Hydration -at least a few liters of water every three days
It comes from aggregation of 2 schools of thoughts: natural ethics emergence and alien actress.
Some people think that morality is naturally emergent phenomena and AI will simply attain it by getting smarter, some people even think that a particular version of human ethics will emerge (out of human-prodused training data) that they like. And we don’t have to worry about things like AI alignment because thing will sort themself out so P(doom|superintelligence) is very small.
Some people think that there is a "shoggoth" pretending to be "nice and helpful", that we can have somewhat meaningful conversation with it about "justice" and "diversity" but we have no idea if this entity upon attaining (or making) superintelligence still be as "nice and helpful" as it is appear to be now. And this people have different P(doom|superintelligence) which is a lot bigger.
So P(Oxygen|superintelligence) and P(Functioning|superintelligence) and P(Safety|superintelligence) and P(Temperature|superintelligence) and P(Hydration|superintelligence) is either this or that - 50%. Because we can safely discard a possibility that SAIs don't understand (for given amount of understanding) what they are talking about with us in chat form. It is either understand and will ensure that certain level of oxygenation is upheld or it understand but will not ensure. That's not how it works, but that's how lazy i am now.
And so P(doom|superintelligence)=99,96875% (100-(0,5*0,5*0,5*0,5*0,5)) or 50% if you don't like really small number and prefer to see this "it will doom us or it will not".. Maybe there is domed existence in our future. Nice oxygenated habitats surrounded by hellish industrial landscape of "making more GPUs" (and some foodpaste). So, it is not a question of specific circumstances for AI wanting to kill all humans, but specific circumstances of human survival and specific preference of outcomes (or something like that) inside the ASI to keep humans alive.
Why would ASI do something that depletes oxygen? A don't know. But it is easy to imagen why and how it will try to harness most of the sunlight and results will kill most of the humans. Or it will know that humans wont allow it to build so many solar panels and kill us preemtevily.
How P(doom|superintelligence) can be changed for the better? It is easiest of Ps to change. We only need to advance interpretability research well enough so inner structure of numbers that is an AI will be as understood by developers as code of Skyrim or Windows XP was understood back in the past. We don't even need a "unified human value theory" for this because of how easy most of "Immediate survival needs" can be expressed with math.
How P(doom|superintelligence) can be reversed and not simply "changed for the better"? From 99,96875% to 0,03125? I don't know.
And why probabilities are independent? Because LLMs are trained. Not programmed.
But this was only Level I: Immediate Survival needs. There are 5 levels of needs! Needs on Level II is a 50% probability too! They too can be somewhat easily expressed in math but there is 9 of them! And if ASI won’t care for even 1 of them...
Level II: Extended Survival
6. Noncontamination – no more than trace quantities of most poisons, radiation, and toxins, and the near total absence of those that can kill even at undetectable levels.
7. Lack of infection – the absence of severe infection (or sepsis) with dangerous viruses, bacteria, fungi, protozoans, helminths, and prions
8. Dryness – a sufficiently dry environment at least every few days (or else our skin will macerate)
9. Sleep – at least ~28-56 hours (depending on the person) of sleep per week
10. Energy – at least ~25,000 calories every three weeks
11. Macronutrients – sufficient quantities of the three macronutrients each month (varieties of carbohydrates, fats, and proteins) plus fiber
12. Macrominerals – sufficient quantities of the essential macrominerals every so often (sodium, chloride, potassium, calcium, phosphorus, magnesium, sulfur)
13. Microminerals – at least a tiny bit of the microminerals every so often (which probably include iron, zinc, iodine, selenium, copper, manganese, fluoride, chromium, molybdenum, nickel, silicon, vanadium, cobalt)
14. Vitamins – sufficient quantities of the 13 essential vitamins every few months (vitamins A, C, D, E, K, B1, B2, B3, B5, B6, B7, B9, B12)
Note: these nutrients/minerals/vitamins may not all be truly essential for survival, and there could be other essential nutrients, vitamins, or minerals that science doesn’t yet know about or that we just haven’t heard of.
From Level III and higher there will be no "coins" with 50% but "dices" (with 3 and more sides and some of them can be rigged like how now ChatGPT can be trained not to come up with lawbreaking scemes almost all the time) because humans disagree on these options. How many permutations between different ideas for Urge Satiation (freedom from an intense addiction or urge that can’t be satiated (e.g., a heroin user who is in the throes of a very strong urge to use again or someone who is very hungry when there is food just out of reach that they can’t get to) and Choice (the ability to make choices about what we do and don’t do (including freedom from imprisonment, enslavement, and extreme coercion or control)? And we can't forget to keep a side for "shoggoth understand but do not care to follow".
If we can get levels 1 and 2 exactly right (and how unlikely that is? 14 "coin tosses") we will come from P(doom|superintelligence) to P(distopia|superintelligence) with 14 tosses and 17 throws. After that is P(weirdtopia|superintelligence) and finally P(utopia|superintelligence). How unlikely it is looks now with current advances in "unified human value theory" and interpretability research?
TL;DR P(doom|superintelligence) is around 99% because AI researchers now don't have a precise theory of human values and precise understanding of inner structure of current AI, so they can’t encode one into the other. And while future AI may have different architecture and development process than LLMs, it also may be some variation of the same with all complications included.
What "doom" here is means? It means "certain death". And P(doom|superintelligence) means "probability that superintelligent AI will change the world in a way that will be incompatible with human survival".
Apparently there is no unified human value theory so i am going to use descriptions from clearerthinking.org and this post (value hierarchy seems distinct and well put). And to calculate P(doom|superintelligence) I will take "Immediate survival needs", all 5 of them. With 0.5 probability of success for each of them (because they are distinct). But where does "0.5" came from?
It comes from aggregation of 2 schools of thoughts: natural ethics emergence and alien actress.
Some people think that morality is naturally emergent phenomena and AI will simply attain it by getting smarter, some people even think that a particular version of human ethics will emerge (out of human-prodused training data) that they like. And we don’t have to worry about things like AI alignment because thing will sort themself out so P(doom|superintelligence) is very small.
Some people think that there is a "shoggoth" pretending to be "nice and helpful", that we can have somewhat meaningful conversation with it about "justice" and "diversity" but we have no idea if this entity upon attaining (or making) superintelligence still be as "nice and helpful" as it is appear to be now. And this people have different P(doom|superintelligence) which is a lot bigger.
So P(Oxygen|superintelligence) and P(Functioning|superintelligence) and P(Safety|superintelligence) and P(Temperature|superintelligence) and P(Hydration|superintelligence) is either this or that - 50%. Because we can safely discard a possibility that SAIs don't understand (for given amount of understanding) what they are talking about with us in chat form. It is either understand and will ensure that certain level of oxygenation is upheld or it understand but will not ensure. That's not how it works, but that's how lazy i am now.
And so P(doom|superintelligence)=99,96875% (100-(0,5*0,5*0,5*0,5*0,5)) or 50% if you don't like really small number and prefer to see this "it will doom us or it will not".. Maybe there is domed existence in our future. Nice oxygenated habitats surrounded by hellish industrial landscape of "making more GPUs" (and some foodpaste). So, it is not a question of specific circumstances for AI wanting to kill all humans, but specific circumstances of human survival and specific preference of outcomes (or something like that) inside the ASI to keep humans alive.
Why would ASI do something that depletes oxygen? A don't know. But it is easy to imagen why and how it will try to harness most of the sunlight and results will kill most of the humans. Or it will know that humans wont allow it to build so many solar panels and kill us preemtevily.
How P(doom|superintelligence) can be changed for the better? It is easiest of Ps to change. We only need to advance interpretability research well enough so inner structure of numbers that is an AI will be as understood by developers as code of Skyrim or Windows XP was understood back in the past. We don't even need a "unified human value theory" for this because of how easy most of "Immediate survival needs" can be expressed with math.
How P(doom|superintelligence) can be reversed and not simply "changed for the better"? From 99,96875% to 0,03125? I don't know.
And why probabilities are independent? Because LLMs are trained. Not programmed.
But this was only Level I: Immediate Survival needs. There are 5 levels of needs! Needs on Level II is a 50% probability too! They too can be somewhat easily expressed in math but there is 9 of them! And if ASI won’t care for even 1 of them...
From Level III and higher there will be no "coins" with 50% but "dices" (with 3 and more sides and some of them can be rigged like how now ChatGPT can be trained not to come up with lawbreaking scemes almost all the time) because humans disagree on these options. How many permutations between different ideas for Urge Satiation (freedom from an intense addiction or urge that can’t be satiated (e.g., a heroin user who is in the throes of a very strong urge to use again or someone who is very hungry when there is food just out of reach that they can’t get to) and Choice (the ability to make choices about what we do and don’t do (including freedom from imprisonment, enslavement, and extreme coercion or control)? And we can't forget to keep a side for "shoggoth understand but do not care to follow".
If we can get levels 1 and 2 exactly right (and how unlikely that is? 14 "coin tosses") we will come from P(doom|superintelligence) to P(distopia|superintelligence) with 14 tosses and 17 throws. After that is P(weirdtopia|superintelligence) and finally P(utopia|superintelligence). How unlikely it is looks now with current advances in "unified human value theory" and interpretability research?