I saw a talk earlier this year that mentioned this 2015 Corrigibility paper as a good starting point for someone new to alignment research. If that's still true, I started writing up some thoughts on a possible generalization of the method in that paper.
Anyway, submitting this draft early to hopefully get some feedback whether I'm on the right track:
I saw a talk earlier this year that mentioned this 2015 Corrigibility paper as a good starting point for someone new to alignment research. If that's still true, I started writing up some thoughts on a possible generalization of the method in that paper.
Anyway, submitting this draft early to hopefully get some feedback whether I'm on the right track:
GeneralizedUtilityIndifference_Draft_Latest.pdf (edited)
The new version does better on sub-agent shutdown and eliminates the "managing the news" problem.
(Let me know if someone already thought of this approach!)
EDIT 2017-11-09: filled in the section on the n-action model.