Epistemic Status
Written quickly, originally as a Twitter thread.
Thesis
I think a missing frame from AI threat discussion is incentives (especially economic) and selection (pressures exerted on a system during its development).
I hear a lot of AI threat arguments of the form: "AI can do X/be Y" with IMO insufficient justification that:
- It would be (economically) profitable for AI to do X
- The default environments/training setups select for systems that are Y
That is such arguments establish that somethings can happen, but do not convincingly argue that it is likely to happen (or that the chances of it happening are sufficiently high). I think it's an undesirable epistemic status quo.
Examples
1: Discrete Extinction Events
Many speculations of AI systems precipitating extinction in a discrete event[1].
I do not understand under what scenarios triggering a nuclear holocaust, massive genocide via robot armies or similar would be something profitable for the AI to do.
It sounds to me like just setting fire to a fuckton of utility.
In general, triggering civilisational collapse seems like something that would just be robustly unprofitable for an AI system to pursue[2].
As such, I don't expect misaligned systems to pursue such goals (as long as they don't terminally value human suffering/harm to humans/are otherwise malevolent).
2. Deceptive Alignment
Consider also deceptive alignment.
I understand what deceptive alignment is, how deception can manifest and why sufficiently sophisticated misaligned systems are incentivised to be deceptive.
I do not understand that training actually selects for deception though[3].
Deceptive alignment seems to require a peculiar combination of situational awareness/cognitive sophistication that complicates my intuitions around it.
Unlike with many other mechanisms/concepts we don't have a clear proof of concept, not even with humans and evolution.
Humans did not develop the prerequisite situational awareness/cognitive sophistication to even grasp evolution's goals until long after they had moved off the training distribution (ancestral environment) and undergone considerable capability amplification.
Insomuch as humans are misaligned with evolution's training objective, our failure is one of goal misgeneralisation not of deceptive alignment.
I don't understand well how values ("contextual influences on decision making") form in intelligent systems under optimisation pressure.
And the peculiar combination of situational awareness/cognitive sophistication and value malleability required for deceptive alignment is something I don't intuit.
A deceptive system must have learned the intended objective of the outer optimisation process, internalised values that are misaligned with said objective, be sufficiently situationally aware to realise its an intelligent system under optimisation and currently under training...
Reflect on all of this, and counterfactually consider how it's behaviour during training would affect the selection pressure the outer optimisation process applies to its values, care about its values across "episodes", etc.
And I feel like there are a lot of unknowns here. And the prerequisites seem considerable? Highly non-trivial in a way that e.g. reward misspecification or goal misgeneralisation are not.
Like I'm not sure this is a thing that necessarily ever happens. Or happens by default? (The way goal misgeneralisation/reward misspecification happen by default.)
I'd really appreciate an intuitive story of how training might select for deceptive alignment.
E.g. RLHF/RLAIF on a pretrained LLM (LLMs seem to be the most situationally aware AI systems) selecting for deceptive alignment.
I think extinction will take a "long time" post TAI failure and is caused by the "environment" (especially economic) progressively becomes ever more inhospitable to biological humans.
Homo sapiens gets squeezed out of its economic niche, and eventually dies out as a result. ↩︎The gist is that I expect that for > 99% of economically valuable goods/services it would be more profitable for the AI system to purchase it via the economy/market mechanisms than to produce it by itself.
Even if the AI system attained absolute advantage in most tasks of economic importance (something I don't expect), comparative advantages are likely to persist (barring takeoff dynamics that I think are impossible as a matter of physical/info-theoretic/computer science limitations).
Thus civilisational collapse just greatly impoverishes the AI system. ↩︎It seems plausible to me that I just don't understand deceptive alignment/ML training well enough to intuit the selection pressures for deception. ↩︎
A rational agent must plan to be able to maintain, defend and reproduce itself (ie the physical hardware that it runs on). The agent must be able to control robots and a manufacturing stack, as well as a source of energy. In Yudkowsky's model, AI creates a nanotech lifeform that outcompetes biology. This "diamondoid bacteria" is simulataniously a robot, factory and solar power plant. Presumably it also has computation, wireless communication and a self-aligned copy of the AI's software (or an upgraded version). I think a big part of the MIRI view depends on the possibility of amazing future nanotechnology, and the argument is substantially weaker if you are skeptical of nanotech.
The "diamondoid bacteria" is just an example of technology that we are moderately confident can exist, and that a superintelligence might use if there isn't something even better. Not being a superintelligence ourselves, we can't actually deduce what it would actually be able to use.
The most effective discoverable means seems more likely to be something that we would react to with disbelief that it could possibly work, if we had a chance to react at all. That's how things seem likely to go when there's an enormous difference in capability.
Nanotech is a fri... (read more)