Notes on "the hot mess theory of AI misalignment"

JakubK

TL;DR: the author makes some reasonable, good-faith arguments (including analysis of a neat survey that he sent to his friends), but I don't think these arguments are strong enough to support his conclusions in a significant way.

Jascha Sohl-Dickstein is a senior staff research scientist at Google Brain with an impressive research track record; for example, he basically co-invented diffusion models. In March he wrote a blog post titled "The hot mess theory of AI misalignment: More intelligent agents behave less coherently." My notes are below.

The start of the post

Jascha doesn't make any ad hominem or psychoanalysis-based arguments, which is promising. He also cites Katja Grace's "Counterarguments to the basic AI x-risk case." Rohin Shah + Geoffrey Irving reviewed earlier versions of Jascha's post.
Jascha states "Most work on AGI misalignment risk assumes that, unlike us, smart AI will not be a hot mess." I don't know where he gets this impression; I doubt that most AGI alignment researchers would agree with the statement "your research assumes that smart AI will not be a hot mess."
The post contains a decent summary of the classic misaligned brain-in-a-box scenario. Jascha calls this "the common narrative of existential risk from misaligned AGI," which seems misleading since there are other threat models.
In an earlier post, Jascha argued that Goodhart's law makes extreme efficiency perilous.
Jascha is "extremely glad people are worrying about and trying to prevent negative consequences from AI."
But he also thinks "predicting the future is hard, and predicting aspects of the future which involve multiple uncertain steps is almost impossible." This is reasonable and resembles some of Nuño Sempere's concerns.
He goes on to claim that "An accidentally misaligned superintelligence which poses an existential risk to humanity seems about as likely as any other specific hypothesis for the future which relies on a dependency chain of untested assumptions." This conclusion, which very faintly resembles the "safe uncertainty fallacy," seems far too strong to draw from (6) alone.
In footnote 5, Jascha outlines other AI x-risks that seem "at least as plausible" as misaligned AGI:
1. WW3 with AI-guided WMDs
2. terrorists use AI to build WMDs
3. a regime uses AI to make everyone compliant forever
4. highly addictive AI-generated experiences
5. massive unemployment with zero social safety nets
6. one tech corporations wins the AGI race and steers the future of humanity in bad ways
Jascha thinks instrumental convergence assumes the agent "will also monomaniacally pursue a consistent and well-defined goal" and will exhibit "much more coherent behavior than any human or human institution exhibits." I think this premise is false. Imagine giving the US Congress (one of the least coherent organizations in Jascha's survey) the power to achieve ~anything it wants. Or imagine giving a random person (one of the least coherent biological creatures in Jascha's survey) this power. I think these agents would effectively pursue instrumental subgoals despite being hot messes. Therefore, I don't think instrumental convergence requires agents to be supercoherent, just somewhat coherent.
An imaginary debate about whether the world will build agentic/goal-directed systems. (With inspiration from Leo Gao and Michaël Trazzi.)
1. Jascha: "When large language models behave in unexpected ways, it is almost never because there is a clearly defined goal they are pursuing in lieu of their instructions."
2. Jakub: Yes, but I expect people to use LLM technology to make more goal-directed systems, e.g. PaLM-SayCan or TaskMatrix.AI or AutoGPT; indeed, there are some incentives for doing so.
3. Jascha: Yes, but building such systems will probably require hard, incremental, deliberate work.
4. Jakub: Sure, but it might not thanks to AI improving AI?

The experiment

Jascha enlisted help from n=14 "incredibly busy" friends with academic neuroscience and ML backgrounds.
The first four people brainstormed lists of ML models, non-human organisms, famous humans, and human institutions. These lists contained 60 entities in total.
The next four people (plus one of the original four, so five in total) sorted these 60 entities based on intelligence, and the last six people sorted based on coherence. Jascha's specific questions are in this doc.
Jascha caveats appropriately: "this experiment aggregates the subjective judgements of a small group with homogenous backgrounds."
The results: a strong correlation between incoherence and intelligence in biological creatures, a moderate correlation in human institutions, and a strong correlation in ML systems.
Note that "Human judgments of intelligence are consistent across subjects, but judgements of coherence differ wildly" based on rank correlations.
The experimental data and the analysis Colab are both public.

The takeaways

Jascha says "If AI models are subtly misaligned and supercoherent, they may seem cooperative until the moment the difference between their objective and human interest becomes relevant, and they turn on us (from our perspective). If models are instead simply incoherent, this will be obvious at every stage of development." I'm quite skeptical of this claim. Was it obvious, at every stage of development, that DALL-E 2 was incoherent? Also, an extremely clever, supercoherent system could just pretend to be less coherent.
Jascha discusses some counterarguments:
1. Maybe this intelligence-incoherence correlation breaks down for sufficiently intelligent agents.
  1. Yup. I think this is a another example of the "lab mice problem" from "AI Safety Seems Hard to Measure."
2. Maybe the human raters gave bad ratings.
  1. I strongly agree. The coherence rankings "differed wildly" between the six respondents.
  2. Also, some coherence respondents may have been tracking something like "does this entity have dissociative identities or distinct subpersonalities," and I think a superintelligent agent with these qualities could still cause a catastrophe. For example, it seems like language models can act like different characters, including more goal-directed characters, when prompted appropriately (e.g. see the prompts in the MACHIAVELLI benchmark paper).
3. Perhaps high intelligence counteracts low coherence. Jascha postulates that "The effective capabilities that an entity applies to achieving an objective is roughly the product of its total capabilities, with the fraction of its capabilities that are applied in a coherent fashion." He recommends building evaluations that can better assess these "effective capabilities."
Jascha advocates for "adding subtlety to how we interpret misalignment" and adjusting research priorities accordingly. He thinks risks similar to "industrial accidents or misinformation" are more likely than a misaligned singleton. In subsequent Twitter discussions, Jascha remarked that losing control of a collection of hot messes seems like a "worringly plausible" risk scenario.
Footnote 16 makes an intriguing connection between the lack of empirical data in AGI threat models and the struggles of early theories in computational neuroscience, and Jascha remarks that "we seem to have many ideas about AI risk which are only supported by long written arguments." I think Jacob Steinhardt addresses this objection reasonably in his "More is Different" blog post series. Also, I do think we are starting to see empirical evidence for AGI power-seeking,
The final section considers ways to improve the intelligence-coherence survey experiment, including fun ideas like asking questions like "Is an ostrich or an ant smarter?" (instead of ranking all the entities) and then assigning an Elo score. Another idea is to "replace subjective judgements of intelligence and coherence with objective attributes." Jascha notes that finding appropriate empirical measures for coherence "would be a major research contribution on its own." I agree, and I encourage Jascha to try working on this.

Jakub's opinion

In summary, Jascha (1) identifies an important property (coherence) of AGI systems in a particular risk narrative, (2) asks several friends to rate the intelligence and coherence of 60 existing entities, and (3) extrapolates the observed trends to argue that AGI systems probably won't be maximally coherent, challenging the assumptions in the risk narrative.

The experiment is a cool idea, but I think that (a) the sample size was too small, (b) the coherence assessments were probably inaccurate, and (c) the correlations could certainly break down for more intelligent systems. Most importantly, I don't think a system needs to be supercoherent to single-handedly cause a catastrophe, and I think there are plenty of other threat models besides the brain-in-a-box scenario. For these reasons, I think Jascha's post provides only weak evidence for AGI x-risk being unlikely.

I especially like this perspective from Lauro Langosco:

Instead of thinking in terms of "coherence" vs. "hot mess", it is more fruitful to think about "how much influence is this system exerting on its environment?". Too much influence will kill humans, if directed at an outcome we're not able to choose.

I'd be curious to hear Jascha's response to Jacob Steinhardt's "More is Different" series (especially the introduction and the first two posts), since he seems quite skeptical of using theoretical reasoning to predict behavior in future ML systems. And I'd also like to hear his thoughts "The alignment problem from a deep learning perspective," since it provides some empirical examples.

For other people's responses to Jascha's post, see the comments of this LessWrong linkpost.

LESSWRONG
LW