let's call "hard alignment" the ("orthodox") problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don't care about by default and destroying everything of value to us on the way there. let's call "easy" alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.
what should one be working on? as always, the calculation consists of comparing
- p(
hard
) × how much value we can get inhard
- p(
easy
) × how much value we can get ineasy
given how AI capabilities are going, it's not unreasonable for people to start playing their outs — that is to say, to start acting as if alignment is easy, because if it's not we're doomed anyways. but i think, in this particular case, this is wrong.
this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.
to me, p(hard
) is big enough, and my hard
-compatible plan seems workable enough, that it makes sense for me to continue to work on it.
let's not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.
Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch's stuff becomes super duper important to understand.