To be clear, I agree that the situation is objectively terrifying and it's quite probable that everyone dies. I gave a copy of If Anyone Builds It to two math professors of my acquaintance at San Francisco State University (and gave $1K to MIRI) because, in that context, conveying the fact that we're in danger was all I had bandwidth for (and I didn't have a better book on hand for that).
But in the context of my own writing, everyone who's paying attention to me already knows about existential risk; I want my words to be focused on being rigorous and correct, not scaring policymakers and the public (notwithstanding that policymakers and the public should in fact be scared).
To the end of being rigorous and correct, I'm claiming that the "each of these black shapes is basically just as good at passing that particular test" story isn't a good explanation of why alignment is hard (notwithstanding that alignment is in fact hard), because of the story about deep net architectures being biased towards simple functions.
I don't think "well, I'm pitching to middle schoolers" saves it. If the actual problem is that we don't know what training data would imply the behavior we want, rather than the outcomes of deep learning being intrinsically super-chaotic—which would be an entirely reasonable thing to suspect if it's 2005 and you're reasoning abstractly about optimization without having any empirical results to learn from—then you should be talking about how we don't know what teal shape to draw, not that we might get a really complicated black shape for all we know.
I am of course aware that in the political arena, the thing I'm doing here would mark me as "not a team player". If I agree with the conclusion that superintelligence is terrifying, why would I critique an argument with that conclusion? That's shooting my own side's soldiers! I think it would be patronizing for me to explain what the problem with that is; you already know.
All dimensions that turn out to matter for what? Current AI is already implicitly optimizing people to use the world "delve" more often than they otherwise would, which is weird and unexpected, but not that bad in the grand scheme of things. Further arguments are needed to distinguish whether this ends in "humans dead, all value lost" or "transhuman utopia, but with some weird and unexpected features, which would also be true of the human-intelligence-augmentation trajectory." (I'm not saying I believe in the utopia, but if we want that Pause treaty, we need to find the ironclad arguments that convince skeptical experts, not just appeal to intuition.)
I mean, yes. What else could I possibly say? Of course, yes.
In the spirit of not trying to solve the entire alignment problem at once, I find it hard to be too specific about to what extent how my odds would shift without a more specific question. (I think LLMs are doing a pretty good job of knowing and doing what I mean, which implies some form of knowledge of "human values", but it's only a natural-language instruction-follower; it's not supposed to be a sovereign superintelligence, which looks vastly harder and I would rather people not do that for a long time.) Show me the ArXiv paper about inductive biases that I'm supposed to be updating on, and I'll tell you how much more terrified I am (above my baseline of "already pretty terrified, actually").
Thanks: my comments about "simplest goals" were implicitly assuming deep nets are more speed-prior-like than Solomonoff-like, and I should have been explicit about that. I need to think more about the deceptive policies already present before we start talking.
The argument is that for all measures that seem reasonable or that we can reasonably well-define, it seems like AI cares about things that are different than human values. Or in other words "most goals are misaligned".
I think it would be clarifying to try to talk about something other than "human values." On the If Anyone Builds It resource page for the frequently asked question "Why would an AI steer toward anything other than what it was trained to steer toward?", the subheader answers, "Because there are many ways to perform well in training." That's the class of argument that I understand Pope and Belrose to be critiquing: that deep learning systems can't be given goals, because all sorts of goals could behave as desired on the training distribution, and almost all of them would do something different outside the training distribution. (Note that "human values" does not appear in the question.)
But the empirical situation doesn't seem as dire as that kind of "counting argument" suggests. In Langosco et al. 2023's "Goal Misgeneralization in Deep Reinforcement Learning", an RL policy trained to collect a coin that was always at the right edge of a video game level learned to go right rather than collect the coin—but randomizing the position of the coin in just a couple percent of training episodes fixed the problem.
It's not a priori obvious that would work! You could imagine that the policy would learn some crazy thing that happened to match the randomized examples, but didn't generalize to coin-seeking. Indeed, there are astronomically many such crazy things—but they would seem to be more complicated than the intended generalization of coin-seeking. The counting argument of the form, "You don't get what you train for, because there are many ways to perform well in training" doesn't seem like the correct way to reason about neural network behavior (which does not mean that alignment is easy or that the humans will survive).
Now you have a counting argument. The counting argument predicts all the things usual counting arguments predict.
I guess I'm not sure what you mean by "counting argument." I understand the phrase to refer to inferences of the form, "Almost all things are wrong, so if you pick a thing, it's almost certainly going to be wrong." For example, most lottery tickets aren't winners, therefore your ticket won't win the lottery.
But the counting works in the lottery example because there's an implied uniform prior: every ticket has the same probability as any other. If we're weighing things by simplicity, what work is being done by counting, as such?
Suppose a biased coin is flipped 1000 times, and it comes up 600 Headses and 400 Tailses. If someone made a counting argument of the form, "Almost all guesses at what the coin's bias could be are wrong, so if you guess, you're almost certainly going to be wrong", that would be wrong: by the asymptotic equipartition property, I can actually be very confident that the bias is close to 0.6 in favor of Heads. You can't just count the number of things when some of them are much more probable than others. (And if any correct probabilistic argument is a "counting argument", that's a really confusing choice of terminology.)
An objection I didn't have time for in the above piece is something like "but what about Occam, though, and k-complexity? Won't you most likely get the simple, boring, black shape, if you constrain it as in the above?”
This is why I'm concerned about deleterious effects of writing for the outgroup: I'm worried you end up optimizing your thinking for coming up with eloquent allegories to convey your intuitions to a mass audience, and end up not having time for the actual, non-allegorical explanation that would convince subject-matter experts (whose support would be awfully helpful in the desperate push for a Pause treaty).
I think we have a lot of intriguing theory and evidence pointing to a story where the reason neural networks generalize is because the parameter-to-function mapping is not a one-to-one correspondence, and is biased towards simple functions (as Occam and Solomonoff demand): to a first approximation, SGD is going to find the simplest function that fits the training data (because simple functions correspond to large "basins" of approximately equal loss which are easy for SGD to find because they use fewer parameters or are more robust to some parameters being wrong), even though the network architecture is capable of representing astronomically many other functions that also fit the training data but have more complicated behavior elsewhere.
But if that story is correct, then "But what about Occam" isn't something you can offhandedly address as an afterthought to an allegory about how misalignment is the default because there are astronomically many functions that fit the training data. Whether the simplest function is misaligned (as posited by List of Lethalities #20) is the thing you have to explain!
We do not have the benefit that a breeding program done on dogs or humans has, of having already "pinned down" a core creature with known core traits and variation being laid down in a fairly predictable manner. There's only so far you can "stretch," if you’re taking single steps at a time from the starting point of "dog" or "human."
But you must realize that this sounds remarkably like the safety case for the current AI paradigm of LLMs + RLHF/RFAIF/RLVR! That is, the reason some people think that current-paradigm AI looks relatively safe is because they think that the capabilities of LLMs come from approximating the pretraining distribution, and RLHF/RFAIF/RLVR merely better elicits those capabilities by upweighting the rewarded trajectories (as evidenced by base models outperforming RL-trained models in pass@k evaluations for k in the hundreds or thousands) rather than discovering new "alien" capabilities from scratch.
If anything, the alignment case for SGD looks a lot better than that for selective breeding, because we get to specify as many billions and billions of input–output pairs for our network to approximate as we want (with the misalignment risk being that, as you say, if we don't know how to choose the right data, the network might not generalize the way we want). Imagine trying to breed a dog to speak perfect English the way LLMs do!
There's a lot I don't like about this post (trying to do away with the principle of indifference or goals is terrible greedy reductionism), but the core point, that goal counting arguments of the form "many goals could perform well on the training set, so you'll probably get the wrong one" seem to falsely imply that neural networks shouldn't generalize (because many functions could perform well on the training set), seems so important and underappreciated that I might have to give it a very grudging +9. (Update: downgraded to +1 in light of Steven Byrnes's comment.)
In the comments, Evan Hubinger and John Wentworth argue for a corrected goal-counting argument, but in both cases, I don't get it: it seems to me that simpler goals should be favored, rather than choosing randomly from a large space of equally probable goals. (This doesn't solve the alignment problem, because the simplest generalization need not be the one we want, per "List of Lethalities" #20.)
I am painfully aware that the problem might be on my end. (I'm saying "I don't get it", not "This is wrong.") Could someone help me out here? What does the correct goal-counting argument look like?
both counting arguments involve an inference from "there are 'more' networks with property X" to "networks are likely to have property X."
This is still correct, even though the "models should overfit" conclusion is false, because simpler functions have more networks (parameterizations) corresponding to them.
OK, but is there a version of the MIRI position, more recent than 2022, that's not written for the outgroup?
I'm guessing MIRI's answer is probably something like, "No, and that's fine, because there hasn't been any relevant new evidence since 2022"?
But if you're trying to make the strongest case, I don't think the state of debate in 2022 ever got its four layers.
Take, say, Paul Christiano's 2022 "Where I Agree and Disagree With Eliezer", disagreement #18:
I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
If Christiano is right, that seems like a huge blow to the argumentative structure of If Anyone Builds It. You have a whole chapter in your book denying this.
What is MIRI's response to the "but what about selective breeding" objection? I still don't know! (Yudkowsky affirmed in the comment section that Christiano's post as a whole was a solid contribution.) Is there just no response? I'm not seeing anything in the chapter 4 resources.
If there's no response, then why not? Did you just not get around to it, and this will be addressed now that I've left this comment bringing it to your attention?
The question is why that argument doesn't rule out all the things we do successfully use deep learning for. Do image classification, or speech synthesis, or helpful assistants that speak natural language and know everything on the internet "fall nicely out of any analysis of the neural network prior and associated training dynamics"? These applications are only possible because generalization often works out in our favor. (For example, LLM assistants follow instructions that they haven't seen before, and can even follow instructions in other languages despite the instruction-tuning data being in English.)
Again, obviously that doesn't mean superintelligence won't kill the humans for any number of other reasons that we've both read many hundreds of thousands of words about. But in order to convince people not to build it, we want to use the best, most convincing arguments, and "you don't get what you want out of training" as a generic objection to deep learning isn't very convincing if it proves too much.