EJT

I'm a Postdoctoral Research Fellow at Oxford University's Global Priorities Institute.

Previously, I was a Philosophy Fellow at the Center for AI Safety.

So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.

You can email me at elliott.thornley@philosophy.ox.ac.uk.

Wiki Contributions

Comments

Sorted by
EJT50

Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they're unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.

On generalization, the questions involving the string 'shutdown' are just supposed to be quick examples. To get good generalization, we'd want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely 'in distribution' for the agent, so you're not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.

People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn't expect generalizing to it to be the default.

I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. 'Don't manipulate shutdown' is a complex rule to learn, in part because whether an action counts as 'manipulating shutdown' depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is 'Don't pay costs to shift probability mass between different trajectory-lengths.' That's a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won't be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.

The talk about "giving reward to the agent" also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.

Yes, I don't assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of 'preference.' My own definition of 'preference' makes no reference to reward.

EJT30

I don't think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they'll reliably choose A- over B if choosing between these options ex nihilo. That's not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-

EJT30

It also makes it behaviorally indistinguishable from an agent with complete preferences, as far as I can tell.

That's not right. As I say in another comment:

And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.

Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.

Or consider the pattern of behaviour that (I elsewhere argue) can make agents with incomplete preferences shutdownable. Agents abiding by the Caprice rule can refuse to pay costs to shift probability mass between A and B, and refuse to pay costs to shift probability mass between A and B+. Agents with complete preferences can't do that.

The same updatelessness trick seems to apply to all money pump arguments.

[I'm going to use the phrase 'resolute choice' rather than 'updatelessness.' That seems like a more informative and less misleading description of the relevant phenomenon: making a plan and sticking to it. You can stick to a plan even if you update your beliefs. Also, in the posts on UDT, 'updatelessness' seems to refer to something importantly distinct from just making a plan and sticking to it.]

That's right, but the drawbacks of resolute choice depend on the money pump to which you apply it. As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point. For example, they have to choose B at node 3 in the money pump below, even though - were they facing that choice ex nihilo - they'd prefer to choose A-.

There's no such drawback for agents with incomplete preferences using resolute choice. As I note in this post, agents with incomplete preferences using resolute choice need never choose against their strict preferences. The agent's past plan only has to serve as a tiebreaker: forcing a particular choice between options between which they'd otherwise lack a preference. For example, they have to choose B at node 2 in the money pump below. Were they facing that choice ex nihilo, they'd lack a preference between B and A-.

EJT10

Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.

EJT10

Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I'm still concerned that CAST training doesn't get us truly corrigible agents with high probability. I think we're better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).

EJT10

I'm pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.

That's only a flaw if the AGI is aligned. If we're sufficiently concerned the AGI might be misaligned, we want it to allow shutdown.
 

EJT10

Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.

EJT10

If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:

  • The agent chooses stochastically between different available trajectory-lengths.
  • Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.

If the environment is stochastic (as - e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won't perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.

EJT10

I don't think human selective breeding tells us much about what's simple and natural for AIs. HSB seems very different from AI training. I'm reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It's probably hard to get next-token predictors via HSB, but you can do it via AI training.

On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.

EJT20

To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:

We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The  discount factor is constantly teaching the agent this simple lesson.

Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:

  • it has no incentive to shift probability mass towards longer trajectories,
  • and hence has no incentive to prevent shutdown in deployment,
  • and hence has no incentive to preserve its ability to prevent shutdown in deployment,
  • and hence has no incentive to avoid being made to satisfy Timestep Dominance,
  • and hence has no incentive to pretend to satisfy Timestep Dominance in training.

I expect agents' not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that - e.g. - agents' capabilities will generalize from training to deployment, why do you think their not caring about shutdown won't?

I don't assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?

Your point about shutting down subagents is important and I'm not fully satisfied with my proposal on that point. I say a bit about it here.

Load More