Edit: Changed the title.
Or, why I no longer agree with the standard LW position on AI anymore.
In a sense, this is sort of a weird post compared to what LW usually posts on AI.
A lot of this is going to depend on some posts that changed my worldview on AI risk, and they will be linked below:
Deceptive alignment skepticism sequence, especially the 2nd post in the sequence is here:
Evidence of the natural abstractions hypothesis in action:
https://www.lesswrong.com/posts/BdfQMrtuL8wNfpfnF/natural-categories-update
https://www.lesswrong.com/posts/obht9QqMDMNLwhPQS/asot-natural-abstractions-and-alphazero#comments
Summary: The big updates I made was that deceptive alignment was way more unlikely than I thought, and given that deceptive alignment was a big part of my model of how AI risk would happen (about 30-60% of my probability mass was on that failure mode), that takes a big bite out of the probability mass of extinction enough to make increasing AI capabilities having positive expected value. Combine this with the evidence that at least some form of the natural abstractions hypothesis is being borne out by empirical evidence, and I now think the probabilities of AI risk have steeply declined to only 0.1-10%, and all of that probability mass is plausibly reducible to ridiculously low numbers by going to the stars and speeding up technological progress.
In other words, I now believe a significant probability, on the order of 50-70%, that alignment is solved by default.
EDIT: While I explained why I increased my confidence in alignment by default in response to Shiminux, I now believe that for now I was overconfident on the precise probabilities on alignment by default.
What implications does this have, if this rosy picture is correct?
The biggest implication is that technological progress looks vastly positive, compared to what most LWers and the general public think.
This also implies a purpose shift for Lesswrong. For arguably 20 years, the site was focused on AI risk, though it arguably exploded with LLMs and actual AI capabilities being released.
What it will shift to is important, but assuming that this rosy model of alignment is correct, then I'd argue a significant part of the field of AI Alignment should and can change purpose to something else.
As for Lesswrong, I'd say we should probably focus more on progress studies like Jason Crawford and inadequate equilibria and how to change them.
I welcome criticism and discussion of this post, due to it's huge implications for LW.
Obviously it's a broader question than what I said, but from an AI safety perspective, the value of the natural abstractions hypothesis, conditional on it being right at least partially, is the following:
Interpretability becomes easier as we can get at least some guarantees about how they form abstractions.
Given that they're lower dimensional summaries, there's a chance we can understand the abstractions the AI is using, even when they are alien to us.
As far as Goodhart: a scenario that could come up is that trying to make the model explain itself might instead push us towards the failure mode where we don't have any real understanding, just simple sounding summaries that don't reveal much of anything. The natural abstractions hypothesis says that by default, AIs will make themselves more interpretable as they are more capable, avoiding goodharting interpretability efforts.