Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities
1. Summary and overview LLMs seem to lack metacognitive skills that help humans catch errors. Improvements to those skills might be net positive for alignment, despite improving capabilities in new directions. Better metacognition would reduce LLM errors by catching mistakes, and by managing complex cognition to produce better answers in the first place. This could stabilize or regularize alignment, allowing systems to avoid actions they would not "endorse on reflection" (in some functional sense).[1] Better metacognition could also make LLM systems useful for clarifying the conceptual problems of alignment. It would reduce sycophancy, and help LLMs organize the complex thinking necessary for clarifying claims and cruxes in the literature. Without such improvements, collaborating with LLM systems on alignment research could be the median doom-path: slop, not scheming. They are sycophantic, agreeing with their users too much, and produce compelling-but-erroneous "slop". Human brains produce slop and sycophancy, too, but we have metacognitive skills, mechanisms, and strategies to catch those errors. Considering our metacognitive skills gives some insight into how they might be developed for LLMs, and how they might help with alignment (§6, §7). I'm not advocating for this. I'm noting that work is underway, noting the potential for capability gains, and noting the possibility that the benefits for alignment outweigh the danger from capability improvements. I'm writing about this because I think plans for alignment work should take these possibilities into account.[2] I'll elaborate on all of that in turn. I hypothesize that metacognitive skills constitute a major part of the "dark matter of intelligence"[3] that separates LLMs and LLM agents from human-level competence. I (along with many others) have spent a lot of time wondering why LLMs appear so intelligent in some contexts, but wildly incompetent in others. I now think metacognitive skills are a major part o
I find it a bit weird that this argument needs to be made at all. But it does, in current company at least.
One argument for current company is: have you actually met people? Or just your EA friends, some of the sweetest (and noncoincidentally most privileged) people to ever exist?
Outside of the EA sphere, I doubt people would be as drawn to the idea that maybe minds are safe and nice by default. There's only a vague and arguable tendency for smarter people to be nicer. And even the nice smart people aren't that nice. And most of history looks suffused with ruthless sociopathy to my eye.
I suspect most humans would be... (read more)