Theoretical Alignment's Second Chance

lunatic_at_large

Historically, I've gotten the impression from the AI Safety community that theoretical AI alignment work is promising in long timeline situations but less relevant in short timeline ones. As timelines have shrunk and capabilities have accelerated, I feel like I've seen theoretical alignment work appear less frequently and with less fanfare.

However, I think that we should challenge this communal assumption. In particular, not all capabilities advance at the same rate, and I expect that formally verified mathematical reasoning capabilities will accelerate particularly quickly. Formally-verified mathematics has a beautifully clean training signal, it feels like the perfect setup for aggressive amounts of RL. We know that Deepmind's AlphaProof can get IMO Silver Medal performance (with caveats) while writing proofs in Lean. There's no reason to expect that performance to have plateaued in the months since AlphaProof's announcement. o3's announcement also showed a huge leap in performance on the FrontierMath benchmark -- I expect that such performance on general mathematical questions combined with excellent coding ability should open the door for Lean-checkable proofs of non-trivial results. I think that arms race dynamics are at a point where we can expect models capable of writing near-superhuman proofs in Lean to be released in the coming months.

That expectation ... really doesn't feel priced in among my theoretical AI Safety connections right now. Everyone still seems to be conducting the same style of research as three months ago, aware of all the new announcements but powerless to do anything about them. I claim that a theoretical AI Safety community that had internalized this expectation would be doing everything in its power to maximize its efficiency once near-superhuman Lean-proof-writing models become available.

I think that the theoretical alignment community has 2-3 months to prepare for its "second chance." A window of opportunity in which superstar PhD students or maybe even super-von-Neumann's can be queried on demand but where more general long-time-horizon agentic planning is still comparatively poor. A window in which problem-solving ability could cease to be a bottleneck, instead replaced by question-posing skill. I think that, of all the flavors of alignment research, theoretical alignment is going to be riding the takeoff curve faster than anything else.

I also claim that nothing short of demanding formal verification for theoretical results will do. Almost all of the current generation of models obfuscate or minimize flaws in their proofs if they cannot fully solve a problem. I expect this to get harder to detect as mathematical capabilities advance. The formal verification community has spent years building tools that completely neutralize the potential of near-superhuman models to deceive us mathematically, it feels nonsensical to ignore those tools.

What do I think the community should do to prepare?

Formalize quantities of interest / question statements in every field touched by theoretical AI Safety. We don't need to formalize proofs of already-known theorems, we just need a consensus on formal definitions so we're not wasting time writing those down when the floodgates open.
Prepare question banks. Seriously, I think it would make sense to have lists of 100-1000 questions per researcher so that if Deepmind releases AlphaProof 2.0 tomorrow people don't waste a week or two thinking of what to ask it.

Clock's on. Good luck everyone.

I disagree with the claim that we only have 2-3 months (I could easily see this period lasting several years, especially since there's not that much evidence that o1/o3 has generalized in ways that go beyond math/coding), but I directionally agree with this post, and thus it's important to signal boost it here.

Promising. Where can interested researchers discuss this and what does the question bank look like so far?

The question bank doesn't exist yet because the language to ask the questions doesn't exist yet. I spent a few weeks after writing this post trying to familiarize myself with Lean as quickly as possible and I found out that people in the Lean community simply haven't formalized most of the objects I'd want to talk about (probabilistic networks, computational complexity, Nash equilibria, etc.). I tried to get a project off the ground formalizing these objects -- you can see the GitHub repository here and the planning document here. Unfortunately this project quickly ballooned beyond what I can handle alone -- I'm just an undergraduate student and winter break is over now. I still think it's insane that some kind of crash formalization program isn't currently underway. If you're interested in pursuing a project like this then I'd be happy to talk through where I left off and what the next steps could look like!

Promising. Where can interested researchers discuss this and what does the question bank look like so far?

LESSWRONG
LW

LESSWRONG
LW

30

Theoretical Alignment's Second Chance

30

30

30