An Explication of Alignment Optimism
Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking. "Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists (c.f. Hubinger) So I think the casual alignment optimists aren't doing a great job of arguing their case. Still, it feels like there's an optimistic update somewhere in the current trajectory of AI development. It really is kinda crazy how capable current models are, and how much I basically trust them. Paradoxically, most of this trust comes from lack of capabilities (current models couldn't seize power right now if they tried). ...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb. Slow takeoff implies that we'll get the stupidest possible transformative AI first. Moravec's paradox leads to a similar conclusion. Calling LLMs a "cultural technology" can be a form of AI denialism, but there's still an important truth there. If the secret of our success is culture, then maybe culture(++) is all you need. Of course, the concern is that soon after we have stupid AI systems, we'll have even less stupid ones. But on my reading, the MIRI types were skeptical about whether we could get the transformative stuff at all without the dangerous capabilities coming bundled in. I think LLMs and their derivatives have provided substantial evidence that we can.
I think it looks really clean and makes me excited to write more and share more broadly (generally gives the vibe of "my corner of lesswrong" instead of "here are my lesswrong contributions")