My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about "consequentionalist reasoning").
even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy - if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)
It seems to me that your analogy is the wrong way arond. IE, the right analogy would be "if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools".
(For what it's worth, I am not very excited about interpretability as an auditing tool. This -- ie, that powerful AIs might avoid this -- is one half of the reason. The other half is that I am sceptical that we will take audit warnings seriously enough -- ie, we might ignore any scheming that falls short of "clear-cut example of workable plan to kill many people". EG, ignore things like this. Or we might even decide to "fix" these issues by putting the interpretability in the loss function, and just deploying once the loss goes down.)
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".
That seems fine to me, given that (as far as I am concerned) no detailed convincing arguments for AI X-risk exist. But at the moment, the summary of the document gave me the impression that maybe some such argument will appear. So I suggest updating the summary (or some other part of the doc) to make it explicit that no detailed arugment for AI X-risk will be given.
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):
What is AI? Who is building it? Why? And is it going to be a future we want?
Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).
Machines of hateful competition need not have such hindrances.
"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work better.
There is no one is coming to save us.
Typo, extra "is".
The only thing necessary for the triumph of evil is for good people to do nothing. If you do nothing, evil triumphs, and that’s it.
Perhaps rewrite this for less antagonistic language? I know it is a quote and all, but still. (This can be interpreted as "the people building AI are evil and trying to cause harm on purpose". That seems false. And including this in the writing is likely to give the reader the impression that you don't understand the situation with AI, and stop reading.)
Perhaps (1) make it apparent that the first thing is a quote and (2) change the second sentence to "If you do nothing, our story gets a bad ending, and that's it.". Or just rewrite the whole thing.
I agree that "we can't test it right now" is more appropriate. And I was looking for examples of things that "you can't test right now even if you try really hard".
Good point. Also, for the purpose of the analogy with AI X-risk, I think we should be willing to grant that the people arrive at the alternative hypothesis through theorising. (Similarly to how we came up with the notion of AI X-risk before having any powerful AIs.) So that does break my example somewhat. (Although in that particular scenario, I imagine that sceptic of Newtonian gravity would came up with alternative explanations for the observation. Not that this seems very relevant.)
I agree with all of this. (And good point about the high confidence aspect.)
The only thing that I would frame slightly differently is that:
[X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this too far, and view "unfalsifiable" as a reason to stop paying attention, even though they would grant the general claim that [unfalsifiable] doesn't logically imply [false].)
I don't think it's foolish to look for analogous examples here, but I guess it'd make more sense to make the case directly.
That was my main plan. I was just hoping to accompany that direct case by a class of examples that build intuition and bring the point home to the audience.
Some partial examples I have so far:
Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.[1]
Situation where it seems false and unfalsifiable: The present world.
Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable.
Phenomenon: Physics and chemistry can give rise to complex life.
Situation where it seems false and unfalsifiable: If Earth didn't exist.
Problems with the example: (i) if Earth didn't exist, there wouldn't be anybody to ask the question, so the scenario is a bit too weird. (ii) The example would be much better if it was the case that if you wait long enough, any planet will produce life.
Phenomenon: Gravity -- all things with mass attract each other. (As opposed to "things just fall in this one particular direction".)
Situation where it seems false and unfalsifiable: If you lived in a bunker your whole life, with no knowledge of the outside world.[2]
Problems with the example: The example would be even better if we somehow had some formal model that: (a) describes how physics works, (b) where we would be confident that the model is correct, (c) and that by analysing that model, we will be able to determine whether the theory is correct or false, (d) but the model would be too complex to actually analyse. (Similarly to how chemistry-level simulations are too complex for studying evolution.)
Phenomenon: Eating too much sweet stuff is unhealthy.
Situation where it seems false and unfalsifiable: If you can't get lots of sugar yet, and only rely on fruit etc.
Problems with the example: The scenario is a bit too artificial. You would have to pretend that you can't just go and harvest sugar from sugar cane and have somebody eat lots of it.
Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.
To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.
I wanted to offer an alternative framing of this, though: For any state, we can consider the abstraction where all people in that state act in harmony to pursue the interests of the state. And then there is the more accurate abstraction where the state is made of individual people with imperfectly aligned interests, who each act optimally to pursue those interests, given their situation. And then there is the model where the individual humans are misaligned and make mistakes. And then you can classify the reasons based on which abstraction you need to explain them.
[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has "select things that score high in human ontology" as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].
And this leaves us two options. First, maybe we just have no write access to the AI's utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn't have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI's utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it's not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that "not-fully-aligned goal + extremely powerful optimisation ==> extinction". Which I didn't argue for here.)
IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.
More precisely: Damn, we need a better terminology here. The way I understand things, "natural abstraction hypothesis" is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that "almost no powerful AIs will use an ontology that is similar to ours". Let's call that "strong negation" of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world --- and those are all different from how we model the world.
(Also, I assume you know Circumventing interpretability: How to defeat mind-readers, but mentioning it just in case.)