So8res comments on Introducing Corrigibility (an FAI research subfield) - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (28)
Yeah, sorry about that -- we are taking some actions to close the writing/research gap and make it easier for people to contribute fresh results, but it will take time for those to come to fruition. In the interim, all I can provide is LW karma and textual reinforcement. Nice work!
(We are in new territory now, FWIW.)
I agree with these concerns; specifying US is really hard and making it interact nicely with UN is also hard.
Roughly, you add correction terms f1(a1), f2(a1, o1, a2), etc. for every partial history, where each one is defined as E[Ux|A1=a1, O1=o1, ..., do(On rel Press)]. (I think.)
Things are certainly difficult, and the dependence upon this particular agent's expectations is indeed weird/brittle. (For example, consider another agent maximizing this utility function, where the expectations are the first agent's expectations. Now it's probably incentivized to exploit places where the first agent's expectations are known to be incorrect, although I haven't the time right now to figure out exactly how.) This seems like potentially a good place to keep poking.