Work supported by a Manifund Grant titled Alignment is hard.

While many people have made the claim that the alignment problem is hard in an engineering sense, this paper makes the argument that the alignment problem is impossible in at least one case in a theoretical computer science sense. The argument being formalized is that if we can't prove a program will loop forever, we can't prove an agent will care about us forever. More Formally, when the agent's environment can be modeled with discrete time, the agent's architecture is agentically-turing complete and the agent's code is immutable, testing the agent's alignment is CoRE-Hard if the alignment schema is Demon-having, angel-having, universally-betrayal-sensitive, perfect and thought-apathetic. Further research could be done to change most assumptions in that argument other than the immutable code.

This is my first major paper on alignment. Since there isn't really an alignment Journal, I'm aiming to have this post act as a peer review step, but forums are weird. Getting the formatting right seems dubious, so I'm posting the abstract and linking the pdf.

New Comment
4 comments, sorted by Click to highlight new comments since:

... but the inability to solve the halting problem doesn't imply that you can't construct a program that you can prove will or won't halt, only that there are programs for which you can't determine that by examination.

I originally wrote "You wouldn't try to build an 'aligned' agent by creating arbitrary programs at random and then checking to see if they happened to meet your definition of alignment"... but on reflection that's more or less what a lot of people do seem to be trying to do. I'm not sure a mere proof of impossibility is going to deter somebody like that, though.

On your first point, correct the thing shown to be uncomputable is testing alignment. And yes, uncomputability is a worst case claim. Would it be clearer to call the paper an uncomputable alignment TEST as opposed to an uncomputable alignment PROBLEM? (Im considering editing the paper before submitting it to a journal)

Detering a few would be nice. More realistically, proofs in this vain could help convince regulators to ignore opaque box makers claims about detecting an agent's alignment.

I think that would help. I think the existing title primed me to expect something else, more in the line of it being impossible for an "aligned" program to exist because it couldn't figure out what to do.

Or perhaps the direct-statement style "Aligned status of software is undecideable" or something like that.

Thanks for the feedback!