It would take many hours to write down all of my alignment cruxes but here are a handful of related ones I think are particularly important and particularly poorly understood:
Does 'generalizing far beyond the training set' look more like extending the architecture or extending the training corpus? There are two ways I can foresee AI models becoming generally capable and autonomous. One path is something like the scaling thesis, we keep making these models larger or their architecture more efficient until we get enough performance from few enough datapoints for AGI. The other path is suggested by the Chinchilla data scaling rules and uses various forms of self-play to extend and improve the training set so you get more from the same number of parameters. Both curves are important but right now the data scaling curve seems to have the lowest hanging fruit. We know that large language models extend at least a little bit beyond the training set. This implies it should be possible to extend the corpus slightly out of distribution by rejection sampling with "objective" quality metrics and then tuning the model on the resulting samples.
This is a crux because it's probably the strongest controlling parameter for whether "capabilities generalize farther than alignment". Nate Soares's implicit model is that architecture extensions dominate. He writes in his post on the sharp left turn that he expects AI to generalize 'far beyond the training set' until it has dangerous capabilities but relatively shallow alignment. This is because the generator of human values is more complex and ad-hoc than the generator of e.g. physics. So a model which is zero-shot generalizing from a fixed corpus about the shape of what it sees will get reasonable approximations of physics which interaction with the environment will correct the flaws in and less-reasonable approximations of the generator of human values which are potentially both harder to correct and Optional on its part to fix. By contrast if human readable training data is being extended in a loop then it's possible to audit the synthetic data and intervene when it begins to generalize incorrectly. It's the difference between trying to find an illegible 'magic' process that aligns the model in one step vs. doing many steps and checking their local correctness. Eliezer Yudkowsky explains a similar idea in List of Lethalities as there being 'no simple core of alignment' and nothing that 'hits back' when an AI drifts out of alignment with us. This resolves the problem by putting humans in a position to 'hit back' and ensure alignment generalization keeps up with capabilities generalization.
A distinct but related question is the extent to which the generator of human values can be learned through self play. It's important to remember that Yudkowsky and Soares expect 'shallow alignment' because consequentialist-materialist truth is convergent but human values are contingent. For example there is no objective reason why you should eat the peppers of plants that develop noxious chemicals to stop themselves from being eaten, but humans do this all the time and call them 'spices'. If you have a MuZero style self-play AI that grinds say, lean theorems, and you bootstrap it from human language then over time a greater and greater portion of the dataset will be lean theorems rather than anything to do with the culinary arts. A superhuman math agent will probably not care very much about humanity. Therefore if the self play process for math is completely unsupervised but the self play process for 'the generator of human values' requires a large relative amount of supervision then the usual outcome is that aligned AGI loses the race compared to pure consequentialists pointed at some narrow and orthogonal goal like 'solve math'. Furthermore if the generator of human values is difficult to compress then it will take more to learn and be more fragile to perturbations and damage. That is rather than think in terms of whether or not there is a 'simple core to alignment' what we care about is the relative simplicity of the generator of human values vs. other forms of consequentialist objective.
My personal expectation is that the generator of human values is probably not a substantially harder math object to learn than human language itself. Nor are they distinct, human language encodes a huge amount of the mental workspace, it is clear at this point that it's more of a 1D projection of higher dimensional neural embeddings than 'shallow traces of thought'. The key question then is how reasonable an approximation of English do large language models learn? From a precision-recall standpoint it seems pretty much unambiguous that large language models include an approximate understanding of every subject discussed by human beings. You can get a better intuitive sense of this by asking them to break every word in the dictionary into parts. This implies that their recall over the space of valid English sentences is nearly total. Their precision however is still in question. The well worn gradient methods doom argument is that if we take superintelligence to have general-search like Solomonoff structure over plans (i.e. instrumental utilities) then it is not enough to learn a math object that is in-distribution inclusive of all valid English sentences, but one which is exclusive of invalid sentences that score highly in our goal geometry but imply squiggle-maximization in real terms. That is, Yudkowsky's theory says you need to be extremely robust to adversarial examples such that superhuman levels of optimization against it don't yield Goodharted outcomes. My intuition strongly says that real agents avoid this problem by having feedback-loop structure instead of general-search structure (or perhaps a general search that has its hypothesis space constrained by a feedback loop) and a solution to this problem exists but I have not yet figured out how to rigorously state it.
Stated as claims that I'd endorse with pretty high, but not certain, confidence:
Note that this is not a claim that something like RLHF is somehow impossible. RLHF, and other RL-adjacent techniques that have reward-equivalents that would never realistically train from scratch, get to select from the capabilities already induced by pretraining. Note that many 'strong' RL-adjacent techniques involve some form of big world model, operate in some constrained environment, or otherwise have some structure to work with that makes it possible for the optimizer to take useful incremental steps.
One simple story of many, many possible stories:
1. It's 20XY. Country has no nukes but wants second strike capacity.
2. Nukes are kinda hard to get. Open-weights superintelligences can be downloaded.
3. Country fine-tunes a superintelligence to be an existential threat to everyone else that is activated upon Country being destroyed.
4. Coordination failures occur; Country gets nuked or invaded in a manner sufficient to trigger second strike.
5. There's a malign superintelligence actively trying to kill everyone, and no technical alignment failures occurred. Everything AI-related worked exactly as its human designers intended.
I think this is a great project! Clarifying why informed people have such different opinions on AGI x-risk seems like a useful path to improving our odds. I've been working on a post on alignment difficulty cruxes that covers much of hte same ground.
Your list is a good starting point. I'd add:
Time window of analysis: I think a lot of people give a low p(doom) because they're only thinking about the few years after we get real AGI.
Paul Christiano, for instance, adds a substantial chance that we've "irreversibly messed up our future within 10 years of building powerful AI" over and above the odds that we all die from takeover or misuse. (in My views on “doom”, from April 2023).
Here are my top 4 cruxes for alignment difficulty which is different but highly overlapping with p(doom).
How AGI will be designed and aligned
How well RL alignment will generalize
Whether we need to understand human values better
Whether societal factors are included in alignment difficulty
Other important cruxes are mentioned in Stop talking about p(doom) - basically, what the heck one includes in their calculation, like my first point on time windows.
Here’s an event that would change my p(doom) substantially:
Someone comes up with an alignment method that looks like it would apply to superintelligent entities. They get extra points for trying it and finding that it works, and extra points for society coming up with a way to enforce that only entities that follow the method will be created.
So far none of the proposed alignment methods seem to stand up to a superintelligent AI that doesn’t want to obey them. They don’t even stand up to a few minutes of merely human thought. But it‘s not obviously impossible, and lots of smart people are working on it.
In the non-doom case, I think one of the following will be the reason:
—Civilization ceases to progress, probably because of a disaster.
—The governments of the world ban AI progress.
—Superhuman AI turns out to be much harder than it looks, and not economically viable.
—The above happy circumstance, giving us the marvelous benefits of superintelligence without the omnicidal drawbacks.
Cruxes connected to whether we get human level A.I. soon:
Do LLM agents become useful in the short term?
How much better is GPT-5 than GPT-4?
Does this generation of robotics startups (e.g. Figure) succeed?
Cruxes connected to whether takeoff is fast:
Are A.I. significantly better at self improving while maintaining alignment of future versions than we are at aligning A.I.?
Cruxes that might change my mind about mech. interp. being doomed:
Can a tool which successfully explains cognitive behavior in GPT-N do the same for GPT-N+1 without significant work?
Last ditch crux:
In high dimensional spaces, do agents with radically different utility functions actually stomp on each other or do they trade? When the intelligence of one agent scales far beyond the other, does trade become stomping or do both just reduce etc.
Just wanted to note that I had a similar question here.
Also, DM me if you want to collaborate on making this a real project. I've been slowly working towards something like this, but I expect to focus more on it in the coming months. I'd like to have something like a version 1.0 in 2-3 months from now. I appreciate you starting this thread, as I think it's ideal for this to be a community effort. My goal is to feed this stuff into the backend of an alignment research assistant system.
[I'm posting this as a very informal community request in lieu of a more detailed writeup, because if I wait to do this in a much more careful fashion then it probably won't happen at all. If someone else wants to do a more careful version that would be great!]
By crux here I mean some uncertainty you have such that your estimate for the likelihood of existential risk from AI - your "p(doom)" if you like that term - might shift significantly if that uncertainty were resolved.
More precisely, let's define a crux as a proposition such that: (a) your estimate for the likelihood of existential catastrophe due to AI would shift a non-trivial amount depending on whether that proposition was true or false; (b) you think there's at least a non-trivial probability that the proposition is true; and (c) you also think there's at least a non-trivial probability that the proposition is false.
Note 1: It could also be a variable rather than a binary proposition, for example "year human-level AGI is achieved". In that case substitute "variable is above some number x" and "variable is below some number y" instead of proposition is true / proposition is false.
Note 2: It doesn't have to be that the proposition / variable on it's own would significantly shift your estimate. If some combination of propositions / variables would shift your estimate, then those propositions / variables are cruxes at least when combined.
For concreteness let's say that "non-trivial" here means at least 5%. So you need to think there's at least a 5% chance the proposition is true, and at least a 5% chance that it's false, and also that your estimate for p(existential catastrophe due to AI) would shift by at least 5% depending on whether the proposition is true or false.
Here are just a few examples of potential cruxes people might have (among many others!):
Listing all your cruxes would be the most useful, but if that is too long a list then just list the ones you find most important. Providing additional details (for example, your probability distribution for each crux and/or how exactly it would shift your p(doom) estimates) is recommended if you can but isn't necessary.
Commenting with links to other related posts on LW or elsewhere might be useful as well.