All of Stephen Zhao's Comments + Replies

The first game is the prisoner's dilemma if you read the payoffs as player A/B, which is a bit different from how it's normally presented.

And yes, prisoner's dilemma is not zero sum.

Short comment on the last point - euthanasia is legal in several countries (thus wanting to die is not prevented, and even socially accepted) and in my opinion the moral choice of action in certain situations.

Thanks for your response - good points and food for thought there. 

One of my points is that this is a problem which arises depending on your formulation of empowerment, and so you have to be very careful with the way in which you mathematically formulate and implement empowerment. If you use a naive implementation I think it is very likely that you get undesirable behaviour (and that's why I linked the AvE paper as an example of what can happen). 

Also related is that it's tricky to define what the "reasonable future time cutoff" is. I don't think... (read more)

5jacob_cannell
Thanks, I partially agree so I'm going to start with the most probable crux: I am somewhat confident that any fully successful alignment technique (one resulting in a fully aligned CEV style sovereign) will prevent suicide; that this is a necessarily convergent result; and that the fact that maximizing human empowerment agrees with the ideal alignment solution on suicide is actually a key litmus test success result. In other words I fully agree with you on the importance of the suicide case, but this evidence is in favor of human empowerment convergence to CEV. I have a few somewhat independent arguments of why CEV necessarily converges to suicide prevention: 1. The simple counterfactual argument: Consider the example of happy adjusted but unlucky Bob whose brain is struck by a cosmic ray which happens to cause some benign tumor in just the correct spot to make him completely suicidal. Clearly pre-accident Bob would not choose this future, and strongly desires interventions to prevent the cosmic ray. Any agent successfully aligned to pre-accident Bob0 would agree. It also should not matter when the cosmic ray struck - the desire of Bob0 to live outweighs the desire of Bob1 to die. Furthermore - if Bob1 had the option of removing all effects of the cosmic ray induced depression they would probably take that option. Suicidal thinking is caused by suffering - via depression, physical pain, etc - and most people (nearly all people?) would take an option to eliminate their suffering without dying, if only said option existed (and they believed it would work). 2. Counterfactual intra-personal CEV coherence: A suicidal agent is one - by definition - that assigns higher ranking utility to future worlds where they no longer exist than future worlds where they do exist. Now consider the multiverse of all possible versions of Bob. The suicidal versions of Bob rank their worlds as lower utility than other worlds without them, and the non-suicidal versions of Bob rank thei

I think JenniferRM's comment regarding suicide raises a critical issue with human empowerment, one that I thought of before and talked with a few people about but never published. I figure I may as well write out my thoughts here since I'm probably not going to do a human empowerment research project (I almost did; this issue is one reason I didn't). 

The biggest problem I see with human empowerment is that humans do not always want to maximally empowered at every point in time. The suicide example is a great example, but not the only one. Other exampl... (read more)

7jacob_cannell
Yes - often we face decisions between short term hedonic rewards vs long term empowerment (spending $100 on a nice meal, or your examples of submarine trips), and an agent optimizing purely for our empowerment would always choose long term empowerment over any short term gain (which can be thought of as 'spending' empowerment). This was discussed in some other comments and I think mentioned somewhere in the article but should be more prominent: empowerment is only a good bound of the long term component of utility functions, for some reasonable future time cutoff defining 'long term'. But I think modelling just the short term component of human utility is not nearly as difficult as accurately modelling the long term, so it's still an important win. I didn't investigate that much in the article, but that is why the title is now "Empowerment is (almost) all we need". Thanks for the link to the "Assistance via Empowerment" study, I hadn't seen that before. Based on skimming the paper I agree there are settings of the hyperparams where the empowerment copilot doesn't help, but that is hardly surprising and doesn't tell us much - that is nearly always the case with ML systems. On a more general note I think the lunar landing game has far too short of a planning horizon to be in the regime where you get full convergence to empowerment. Hovering in the air only maximizes myopic empowerment. If you imagine a more complex real world scenario where the lander has limited fuel and you crash if running out of fuel, crashing results in death, you can continue to live on a mission for years after landing .. etc it then becomes more obvious that the optimal plan for empowerment converges to landing successfully and safely.