Review

Recently, I read Corrigibility by Soares et al. and became confused. I followed most of the mathematical reasoning but am now struggling to understand what the point or end goal of this avenue of research is meant to be. I know that MIRI now pursues a different research direction so maybe some or all of what I write here is moot. 

Doesn't this paragraph undermine the whole project?

In Section 2.1, a 'manipulative action'  is considered. This action results in a lower utility according to  (the 'trial' utility function which the agent should maximize unless the shutdown button is pressed) but lowers the probability that the shutdown button is pressed. It makes sense that such an action could be manipulative. But the authors then clarify the following:

We remark that [the mathematical condition satisfied by ] is not by any means a necessary or sufficient description of manipulative actions preventing [pressing of the button]. Some actions satisfying [these conditions] might be non-manipulative (for example, if the agent expends extra effort ensuring that anew subagent is corrigible, and this behavior helps convince the programmers that the agent is indeed corrigible). Furthermore, many manipulative actions may not have this form
(as we will see in Section 4.2). Rather, we consider one manipulative action  which happens to be both manipulative and have this form, and show that if  is set to be too low, then U will incentivize U-agents to prefer this  to the
default action 

Ok, cool. But if we are using a framework where the manipulative action has exactly the same mathematical representation as a non-manipulative action, isn't this an indication that our whole framework is wrong? Surely we need to distinguish between these two types of action? Doesn't this paragraph show precisely that corrigibility can't be accurately captured by this model using utility functions and actions?

Suppose I designed a utility function and proved, in this framework that it never took deceptive actions like , then this proof would also show that my utility function doesn't allow for the kind of helpful, non-manipulative actions described in the quoted paragraph above. 

Conversely, if I did create a real-life instantiation of a corrigible agent, it would distinguish between manipulative and non-manipulative actions and thus couldn't be modelling the world in the way which is used in this paper, since the mathematical representation used in the paper does not always properly distinguish between these two types of actions.

New Answer
New Comment
2 comments, sorted by Click to highlight new comments since:

Yup, this all seems basically right. Though in reality I'm not that worried about the "we might outlaw some good actions" half of the dilemma. In real-world settings, actions are so multi-faceted that being able to outlaw a class of actions based on any simple property would be a research triumph.

Also see https://www.lesswrong.com/posts/LR8yhJCBffky8X3Az/using-predictors-in-corrigible-systems or https://www.lesswrong.com/posts/qpZTWb2wvgSt5WQ4H/defining-myopia for successor lines of reasoning.

Yes, I too am more concerned from a 'maybe this framing isn't super useful as it fails to capture important distinctions between corrigible and non-corrigible' point of view rather than a 'we might outlaw some good actions' point of view.

Thanks for the links, they look interesting!