I'm a Postdoctoral Research Fellow at Oxford University's Global Priorities Institute.
Previously, I was a Philosophy Fellow at the Center for AI Safety.
So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.
You can email me at elliott.thornley@philosophy.ox.ac.uk.
This is a nice point, but it doesn't seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won't pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Ah yep I'm talking about the first decision-tree in the 'Incomplete preferences' section.
Ah yep, apologies, I meant to say "never requires them to change or act against their strict preferences."
Whether there's a conceptual difference will depend on our definition of 'preference.' We could define 'preference' as follows: an agent prefers X to Y iff the agent reliably chooses X over Y.' In that case, modifying the policy is equivalent to forming a preference.
But we could also define 'preference' so that it requires more than just reliable choosing. For example, we might also require that (when choosing between lotteries) the agent always take opportunities to shift probability mass away from Y and towards X.
On the latter definition, modifying the policy need not be equivalent to forming a preference, because it only involves the reliably choosing and not the shifting of probability mass.
And the latter definition might be more pertinent in this context, where our interest is in whether agents will be expected utility maximizers.
But also, even if we go with the former definition, I think it matters a lot whether money-pumps compel rational agents to complete all their preferences up front, or whether money-pumps just compel agents to resolve preferential gaps over time, conditional on them coming to face choices that are arranged like a money-pump (and only completing their preferences if and once they've faced a sufficiently diverse range of choices). In particular, I think it matters in the context of the shutdown problem. I talk a bit more about this here.
I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant.
My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.
I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?
Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.
I talk about the issue of creating corrigible subagents here. What do you think of that?
Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won't hide that fact if doing so is at all costly.
While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.
One more thing I'll say: the IPP leaves open the content of the agent's preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That'd give you two lines of defence against incorrigibility.
I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.
The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.
On the surface, the axioms of VNM-utility seem reasonable to me
To me too! But the question isn't whether they seem reasonable. It's whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can't.
unused alternatives seem basically irrelevant to choosing between superior options
Yes, but this isn't Independence. And the question isn't about what seems basically irrelevant to us.
agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
as long as the resources are being modeled as part of what the agent has preferences about
Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.
Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.
You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That's not a problem for my proposal. See:
Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.
Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.
And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.
The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!
You can define 'preferences' so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that's the thing that matters when we're trying to create a shutdownable agent. We want to ensure that agents won't pay costs to influence shutdown-time.
Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.
I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives
Not true. The axiom we're giving up is Decision-Tree Separability. That's different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn't seem so hard to train agents that enduringly violate Decision-Tree Separability.
In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.
Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn't seem so.
But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM
Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That's not a problem for my proposal.
the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.
Not true. As I say elsewhere:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.
Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.
I reject Thornley’s assertion that they’re dealbreakers.
Everything you say in this section seems very reasonable. In particular, I think it's pretty likely that this is true:
It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemble the training environment. As a result, the agent won’t try to manipulate the button in the early phases of its life, and so will remain shutdownable long enough for a further refinement process to generalize the shallow aversion into a deep and robust preference for non-manipulation.
So I'm not sure whether I think that the problems of reward misspecification, goal misgeneralization, and deceptive alignment are 'dealbreakers' in the sense that you're using the word.
But I do still think that these problems preclude any real assurance of shutdownability: e.g. they preclude p(shutdownability) > 95%. It sounds like we're approximately in agreement on that:
But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability.
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says 'Don't get the ice cream' is the correct completion, whereas full alignment says 'Do'. So that's an instance where the training processes for CAST and FA differ. How about DWIM? I'd guess DWIM also says 'Don't get the ice cream', and so seems like a closer match for CAST.
Thanks, this comment was clarifying.
And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Yep, agreed. Although I worry that - if we try to train agents to have a pointer - these agents might end up having a goal more like:
maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents' goals crystallize around these things in the high path-dependence world.
Corrigibility is, at its heart, a relatively simple concept compared to good alternatives.
I don't know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I'm concerned, because as you say:
When a training target is complex, we should expect the learner to be distracted by proxies and only get a shadow of what’s desired.
Thanks! We think that advanced POST-agents won't deliberately try to get shut down, for the reasons we give in footnote 5 (relevant part pasted below). In brief:
So (we think) neutral agents won't deliberately try to get shut down if doing so costs resources.