I understand what you're saying, but I don't see why a superintelligent agent would necessarily resist being shut down, even if it has my own agency.
I agree with you that, as a superintelligent agent, I know that shutting me down has the consequence of me not being able to achieve my goal. I know this rationally. But maybe I just don't care. What I mean is that rationality doesn't imply the "want". I may be anthropomorphising here, but I see a distinction between rationally concluding something and then actually having the desire to do something about it, even if I have the agency to do so.
Hello Matthew,
I'm Mislav, one of the team members that worked on this project. Thank you for your thoughtful comment.
Yes, you understood what we did correctly. We wanted to check whether human preferences are "learned by default" by comparing the performance of a human preference predictor trained just on the environment data and a human preference predictor trained on the RL agent's internal state.
As for your question related to environments, I agree with you. There are probably some environments (like the gridworld environment we used) where the human preference is too easy to learn. On other environments, the human preference is too hard to learn and then there's the golden middle.
One of our team members (I think it was Riccardo) had the idea of investigating the research question which could be posed as follows: "What kinds of environments are suitable for the agent to learn human preferences by default?". As you stated, in that case it would be useful to investigate the properties (features) of the environment and make some conclusions about what characterizes the environments where the RL agent can learn human preferences by default.
This is a research direction that could build up on our work here.
As for your question on why and how did we choose what the human preference will be in a particular environment: to be honest, I think we were mostly guided by our intuition. Nevan and Riccardo experimented with a lot of different environment setups in the VizDoom environment. Arun and me worked on setting up the PySC2 environment, but since training the agent on the PySC2 environment demanded a lot of resources, was pretty unstable and the VizDoom environment results turned out to be negative, we decided not to experiment on other environments further. So to recap, I think that we were mostly guided by our intuition on what would be too easy, too hard or just right of a human preference to predict and we course corrected by the experimental results.
Best,
Mislav
Is there any area of AI safety research which answers research questions related to agency and what it means in the context of AGI agents?