As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.
Yeah, sorry about that -- we are taking some actions to close the writing/research gap and make it easier for people to contribute fresh results, but it will take time for those to come to fruition. In the interim, all I can provide is LW karma and textual reinforcement. Nice work!
(We are in new territory now, FWIW.)
I agree with these concerns; specifying US is really hard and making it interact nicely with UN is also hard.
- How does this model extend past the three-timestep toy scenario?
Roughly, you add correction terms f1(a1), f2(a1, o1, a2), etc. for every partial history, where each one is defined as E[Ux|A1=a1, O1=o1, ..., do(On rel Press)]. (I think.)
- Does the model remain stable under assumptions of bounded computational power?
Things are certainly difficult, and the dependence upon this particular agent's expectations is indeed weird/brittle. (For example, consider another agent maximizing this utility function, where the expectations are the first agent's expectations. Now it's probably incentivized to exploit places where the first agent's expectations are known to be incorrect, although I haven't the time right now to figure out exactly how.) This seems like potentially a good place to keep poking.
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.