to prove the AI's goal will remain stable over millions of self-modifications (the harder, and all too frequently ignored, side of the problem)
What's the easier side?
Figuring out what the goal should be (note, I said easier, not easy). You probably know more than I do but the way I see it the whole thing breaks down into a philosophy problem and a maths problem. Most people find philosophy more fun than maths so spend all their time debating the former.
I don't know if this is a little too afar field for even a Discussion post, but people seemed to enjoy my previous articles (Girl Scouts financial filings, video game console insurance, philosophy of identity/abortion, & prediction market fees), so...
I recently wrote up an idea that has been bouncing around my head ever since I watched Death Note years ago - can we quantify Light Yagami's mistakes? Which mistake was the greatest? How could one do better? We can shed some light on the matter by examining DN with... basic information theory.
Presented for LessWrong's consideration: Death Note & Anonymity.