Given any particular concrete demonstration of an AI algorithm doing seemingly-bad-thing X, a knowledgeable AGI optimist can look closely at the code, training data, etc., and say:
“Well of course, it’s obvious that the AI algorithm would do X under these circumstances. Duh. Why am I supposed to find that scary?”
And yes, it is true that, if you have enough of a knack for reasoning about algorithms, then you will never ever be surprised by any demonstration of any behavior from any algorithm. Algorithms ultimately just follow their source code.
(Indeed, even if you don’t have much of a knack for algorithms, such that you might not have correctly predicted what the algorithm did in advance, it will nevertheless feel obvious in hindsight!)
From the AGI optimist’s perspective: If I wasn’t scared of AGI extinction before, and nothing surprising has happened, then I won’t feel like I should change my beliefs. So, this is a general problem with concrete demonstrations as AGI risk advocacy:
- “Did something terribly bad actually happen, like people were killed?”
- “Well, no…”
- “Did some algorithm do the exact thing that one would expect it to do, based on squinting at the source code and carefully reasoning about its consequences?”
- “Well, yes…”
- “OK then! So you’re telling me: Nothing bad happened, and nothing surprising happened. So why should I change my attitude?”
I already see people deploying this kind of argument, and I expect it to continue into future demos, independent of whether or not the demo is actually being used to make a valid point.
I think a good response from the AGI pessimist would be something like:
I claim that there’s a valid, robust argument that AGI extinction is a big risk. And I claim that different people disagree with this argument for different reasons:
- Some are over-optimistic based on mistaken assumptions about the behavior of algorithms;
- Some are over-optimistic based on mistaken assumptions about the behavior of humans;
- Some are over-optimistic based on mistaken assumptions about the behavior of human institutions;
- Many are just not thinking rigorously about this topic and putting all the pieces together; etc.
If you personally are highly skilled and proactive at reasoning about the behavior of algorithms, then that’s great, and you can pat yourself on the back for learning nothing whatsoever from this particular demo—assuming that’s not just your hindsight bias talking. I still think you’re wrong about AGI extinction risk, but your mistake is probably related to the 2nd and/or 3rd and/or 4th bullet point, not the first bullet point. And we can talk about that. But meanwhile, other people might learn something new and surprising-to-them from this demo. And this demo is targeted at them, not you.
Ideally, this would be backed up with real quotes from actual people making claims that are disproven by this demo.
For people making related points, see: Sakana, Strawberry, and Scary AI; and Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
Also:
I think this comment is tapping into an intuition that assigns profound importance to the fact that, no matter what an AI algorithm is doing, if you zoom in, you’ll find that it’s just mechanically following the steps of an algorithm, one after the other. Nothing surprising or magic. In reality, this fact is not a counterargument to anything at all, but rather a triviality that is equally true of human brains, and would be equally true of an invading extraterrestrial army. More discussion in §3.3.6 here.
I consider this an acceptable straw-man of my position.
To be clear, there are some demos that would cause me to update.
For example, I think the Solomonoff Prior is Malign to be basically a failure to do counting correctly. And so if someone demonstrated a natural example of this, I would be forced to update.
Similarly, I think the chance of a EY-style utility-maximizing agent arising from next-token-prediction are (with caveats) basically 0%. So if someone demonstrated this, it would update my priors. I am especially unconvinced of the version of this where the next-token predictor simulates a malign agent and the malign agent then hacks out of the simulation.
But no matter how many times I am shown "we told the AI to optimize a goal and it optimized the goal... we're all doomed", I will continue to not change my attitude.
I do not think arguing about p(doom) in the abstract is a useful exercise. I would prefer the Overton Window for p(doom) look like 2-20%, Zvi thinks it should be 20-80%. But my real disagreement with Zvi is not that his P(doom) is too high, it is that he supp... (read more)