I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.
That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.
Presumably the deferential parter could just use a utility function which is a weighted combination of their partner's and their own (selfish) one. For instance, the deferential partner could use a utility function like , where is the utility function of the partner and is the utility function of the deferential person accounting only for their weak personal preferences and not their altruism.
Obviously the weights could depend on the level of altruism, the strength of the partner's preferences, whether they are reporting their true preferences or the preferences such that the outcome will be what they want, etc. But this type of deferential preference can still be described by a utility function.
Thanks for the great post!
In the definition of Coalition-Perfect CoCo Equilibrium, it seems to me like part 1) is already implied by part 3).
1: is on the Pareto frontier.
1) means that the utility profile achieved by the joint strategy for the grand coalition is on the Pareto frontier.
3:
(ie. all the joint strategies are trying to maximize the money earned if up against the opposing coalition in a zero-sum game and as a special case, when S=N, it says that what the entire group actually ends up doing maximizes surplus value, which is another way of stating that the are the appropriate virtual currencies to use at the point)
If in the joint strategy for the special case of S=N the group maximizes surplus value according to some weight function, then utility profile resulting from this joint strategy should be on the Pareto frontier, so 1) should be automatically satisfied.
If it wasn't, then you could improve a player's utility without hurting anyone else. But that would improve the surplus value as well[1], which would mean that the S=N joint strategy didn't maximize surplus value (contradicting 3) ).
I think this is because your utilitarian characterization is an if and only if.
Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function.
Could be: An outcome is on the Pareto frontier if and only if it can be post-hoc interpreted as the result of maximizing a collective utility function.
I guess I'm assuming the weights are strictly positive, whereas you only assumed them to be non-negative. Does this matter/Is this the reason why we need 1)?
I'm curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they'd want more capabilities training and sandbag to ensure they get it?
It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly "unlearning" just sounds bad/scary?)
What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?
In addition, the "incentivized to do well" scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I'm not sure whether that would effect the results, but you could also compare to a baseline where there's no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)