My January alignment theory Nanowrimo

Dmitry Vaintrob

42 My January alignment theory Nanowrimo

by Dmitry Vaintrob

2nd Jan 2025

2 min read

2

42

Update: list of posts so far. "(s)" denotes shortform.

post 1, post 2, post 3, post 4 (s), post 5 (s), post 6 (s), post 7, post 8, post 9 (s), post 10, post 11, post 12, post 13 (s), post 14, post 15, post 16 (with Lauren Greenspan), post 17, post 18, post 19 (s), post 20, post 21.

*****

This is a quick announcement/commitment post:

I've been working at the PIBBSS Horizon Scanning team (with Lauren Greenspan and Lucas Teixeira), where we have been working on reviewing some "basic-science-flavored" alignment and interpretability research and doing talent scouting (see this intro doc we wrote so far, which we split off from an unfinished larger review). I have also been working on my own research. Aside from active projects, I've accumulated a bit of a backlog of technical writeups and shortforms in draft or "slack discussion"-level form, with various levels of publishability.

This January, I'm planning to edit and publish some of these drafts as posts and shortforms on LW/the alignment forum. To keep myself accountable, I'm committing to publish at least 3 posts per week.

I'm planning to post about (a subset? superset? overlapping set? of) the following themes:

Opinionated takes on a few research directions (I have drafts on polytopes, mode connectivity, and takes on proof vs. other kinds of "principled formalism without proofs").
Notes on grammars and more generally, how simpler rules and formal structures can combine into larger ones. This overlaps with a project I'm working on with collaborators, involving a notion of "analogistic circuits": mechanisms that learn to generalize a complex rule "by analogy", without ever encoding the structure itself.
Joint with Lauren Greenspan and Lucas Teixeira: some additional bits of our review, with a focus on interepretability (and ways to think about assumptions and experiments).
Joint with Lauren: some distillation and discussion of QFT methods in interpretability.
Bayesian vs. SGD learning from various points of view. (Closely related to discussions with Kaarel Hänni, Lucius Bushnaq, and others).
Related to the above: Extensions of the "Low-Hanging-Fruit" prior post with Nina Panicksserry, specifically focusing on non-learnability of parity, and a new notion of "training stories" (this is closely related to some other work we've done with Nina, as well as joint work with Louis Jaburi).
???

I am generally resistant to making announcements before doing writeups. But in this case, I have thought for a while that these drafts might be useful to get out, but have been blocked by not wanting to post unpolished things. I'll be pointing at this announcement when posting this month for the following reasons:

I will appreciate the extra accountability.
Since I'm planning a kind of "nanowrimo" sprint, I'm using this as an excuse to post draft-quality writing (possibly with mistakes, bugs, etc.).
I'm hoping to treat this month as a test run of producing more short, imperfect and slightly technical takes which straddle the line between distillation, hot takes, and original research (a very ambitious comparison point I have for the format is Terry Tao's blog). Based on the success and reception of this short project, I might either do more or less of this in the future.
I'm expecting to be wrong about some things, and hoping that more eyes and discussion on the work I and my collaborators have been thinking about will help me find mistakes quickly and debug my thinking more effectively.

Interpretability (ML & AI)AI

Frontpage