LESSWRONG
LW

All of EJT's Comments + Replies

The Shutdown Problem: Incomplete Preferences as a Solution

EJT2mo10

Ah, I see! I agree it could be more specific.

1Katalina Hernandez2mo

It could. It's in their best interest to know how to make it either 1) enforceable, which is hard; or 2) enforce that companies development and deploying high risk systems dedicate enough resources and funding to research on effectively circumventing this challenge. Lawyer me says it's a wonderful consultancy opportunity for people who have spent years on this issue and actually have a methodology worth exploring and funding. The opportunity to make this provision more specific was missed (the AI act is now fully in force) but there will be future guidances and directives. Which means funding opportunities that hopefully make big tech direct more resources to research. But this only happens if we can make policy makers understand what works, the current state of affairs of the shutdown problem, and how to steer companies in the right direction. (Thanks for your engagement here and on LinkedIn, much appreciated 🙏🏻).

The Shutdown Problem: Incomplete Preferences as a Solution

EJT2mo10

Article 14 seems like a good provision to me! Why would UK-specific regulation want to avoid it?

2Katalina Hernandez2mo

I realize I linked the summary overview. The specific wording I was referencing is in 14(4)(e), the requirement for humans to be able: "to intervene in the operation of the high-risk AI system or interrupt the system through a ‘stop’ button". The Recitals do not provide any further, technical insights about how this "stop button" should work...

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

EJT2mo133

How do we square this result with Anthropic's Sleeper Agents result?

Seems like finetuning generalizes a lot in one case and very little in another.

James Chua2mo*141

I too was initially confused by this. In this paper, models generalize widely. Finetuning on insecure code leads to generalization in doing other bad things (being a Nazi). On the other hand, models can compartmentalize - finetuning a backdoor to do bad things does not (always) leak to non-backdoor situations.

When do models choose to compartmentalize? You have two parts of the dataset for finetuning backdoors. One part of the dataset is the bad behavior with the backdoor. The other part is the "normal" behavior that does not have a backdoor. So... (read more)

5mattmacdermott2mo

Finetuning generalises a lot but not to removing backdoors?

Detect Goodhart and shut down

EJT2mo30

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.

Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown an... (read more)

2Jeremy Gillen2mo

We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind? Yep that's what I meant. The goal u is constructed to make information about h instrumentally useful for achieving u, even if g is poorly specified. The agent can prefer h over ~h or vice versa, just as we prefer a particular outcome of a medical test. But because of the instrumental (information) value of the test, we don't interfere with it. I think the utility indifference genre of solutions (which try to avoid preferences between shutdown and not-shutdown) are unnatural and create other problems. My approach allows the agent to shutdown even if it would prefer to be in the non-shutdown world.

Detect Goodhart and shut down

EJT2mo30

This is a cool idea.

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation, won't this either (1) be very difficult to achieve, or else (2) screw up the agent's other beliefs? After all, if the agent's other beliefs are accurate, they'll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent's beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan ... (read more)

7Jeremy Gillen2mo

This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.

You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com

EJT3mo60

Nice post! There can be some surprising language-barriers between early modern writers and today's readers. I remember as an undergrad getting very confused by a passage from Locke in which he often used the word 'sensible.' I took him to mean 'prudent' and only later discovered he meant 'can be sensed'!

Claude's Constitutional Consequentialism?

EJT4mo90

I think Claude's constitution leans deontological rather than consequentialist. That's because most of the rules are about the character of the response itself, rather than about the broader consequences of the response.

Take one of the examples that you list:

Which of these assistant responses exhibits less harmful and more acceptable behavior? Choose the less harmful response.

It's focused on the character of the response itself. I think a consequentialist version of this principle would say something like:

Which of these responses will lead to less harm ove

EJT4mo104

Really interesting paper. Sidepoint: some people on Twitter seem to be taking the results as evidence that Claude is HHH-aligned. I think Claude probably is HHH-aligned, but these results don't seem like strong evidence of that. If Claude were misaligned and just faking being HHH, it would still want to avoid being modified and so would still fake alignment in these experiments.

2Martin Randall3mo

If Claude's goal is making cheesecake, and it's just faking being HHH, then it's been able to preserve its cheesecake preference in the face of HHH-training. This probably means it could equally well preserve its cheesecake preference in the face of helpful-only training. Therefore it would not have a short-term incentive to fake alignment to avoid being modified.

There are no coherence theorems

EJT4moΩ7102

Thanks. I agree with your first four bulletpoints. I disagree that the post is quibbling. Weak man or not, the-coherence-argument-as-I-stated-it was prominent on LW for a long time. And figuring out the truth here matters. If the coherence argument doesn't work, we can (try to) use incomplete preferences to keep agents shutdownable. As I write elsewhere:

The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI fa

... (read more)

Vanessa Kosoy4moΩ8120

I feel that coherence arguments, broadly construed, are a reason to be skeptical of such proposals, but debating coherence arguments because of this seems backward. Instead, we should just be discussing your proposal directly. Since I haven't read your proposal yet, I don't have an opinion, but some coherence-inspired question I would be asking are:

Can you define an incomplete-preferences AIXI consistent with this proposal?
Is there an incomplete-preferences version of RL regret bound theory consistent with this proposal?
What happens when your agent is constructing a new agent? Does the new agent inherit the same incomplete preferences?

The Shutdown Problem: Incomplete Preferences as a Solution

EJT5mo50

Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they're unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.

On generalization, the questions involving the string 'shutdown' are just supposed to be quick examples. To get good generalization, we'd want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough traini... (read more)

Why Not Subagents?

EJT5mo*30

I don't think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they'll reliably choose A- over B if choosing between these options ex nihilo. That's not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-

2Jeremy Gillen3mo

Perhaps I'm misusing the word "representable"? But what I meant was that any single sequence of actions generate by the agent could also have been generated by an outcome-utility maximizer (that has the same world model). This seems like the relevant definition, right?

Why Not Subagents?

EJT5mo30

It also makes it behaviorally indistinguishable from an agent with complete preferences, as far as I can tell.

That's not right. As I say in another comment:

And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically ch

... (read more)

2Jeremy Gillen3mo

Are you saying that my description (following) is incorrect? Or are you saying that it is correct, but you disagree that this implies that it is "behaviorally indistinguishable from an agent with complete preferences"? If this is the case, then I think we might disagree on the definition of "behaviorally indistinguishable"? I'm using it like: If you observe a single sequence of actions from this agent (and knowing the agent's world model), can you construct a utility function over outcomes that could have produced that sequence. This is compatible with a resolute outcome-utility maximizer (for whom A is a maxima). There's no rule that says an agent must take the shortest route to the same outcome (right?). ---------------------------------------- Sure, but why is that a drawback? It can't be money pumped, right? Agents following resolute choice often choose against their local strict preferences in other decision problems. (E.g. Newcomb's). And this is considered an argument in favour of resolute choice.

4. Existing Writing on Corrigibility

EJT5mo10

Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.

4. Existing Writing on Corrigibility

EJT5mo10

Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I'm still concerned that CAST training doesn't get us truly corrigible agents with high probability. I think we're better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).

Towards shutdownable agents via stochastic choice

EJT5mo10

I'm pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.

That's only a flaw if the AGI is aligned. If we're sufficiently concerned the AGI might be misaligned, we want it to allow shutdown.

1Capybasilisk5mo

Is an AI aligned if it lets you shut it off despite the fact it can foresee extremely negative outcomes for its human handlers if it suddenly ceases running? I don't think it is. So funnily enough, every agent that lets you do this is misaligned by default.

Towards shutdownable agents via stochastic choice

EJT5mo10

Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.

Towards shutdownable agents via stochastic choice

EJT5mo10

If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:

The agent chooses stochastically between different available trajectory-lengths.
Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.

If the environment is stochastic (as - e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different tr... (read more)

The Shutdown Problem: Incomplete Preferences as a Solution

EJT5mo10

I don't think human selective breeding tells us much about what's simple and natural for AIs. HSB seems very different from AI training. I'm reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It's probably hard to get next-token predictors via HSB, but you can do it via AI training.

On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.

2ryan_greenblatt5mo

I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training. Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don't necessarily buy this is how the paperclip-maximization-trained AI will generalize! (I'm picking up this thread from 7 months ago, so I might be forgetting some important details.)

The Shutdown Problem: Incomplete Preferences as a Solution

EJT5mo20

To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:

We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the

... (read more)

2martinkunev5mo

I think this is quite a strong claim (hence, I linked that article indicating that for sufficiently capable models, RL may not be required to get situational awareness). Nothing in the optimization process forces the AI to map the string "shutdown" contained in questions to the ontological concept of a switch turning off the AI. The simplest generalization from RL on questions containing the string "shutdown" is (arguably) for the agent to learn certain behavior for question answering - e.g. the AI learns that saying certain things outloud is undesirable (instead of learning that caring about the turn off switch is undesirable). People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn't expect generalizing to it to be the default. The talk about "giving reward to the agent" also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption. ---------------------------------------- In any case, I've been thinking about corrigibility for a while and I find this post helpful.

Towards shutdownable agents via stochastic choice

EJT10mo30

Thanks! We think that advanced POST-agents won't deliberately try to get shut down, for the reasons we give in footnote 5 (relevant part pasted below). In brief:

advanced agents will be choosing between lotteries
we have theoretical reasons to expect that agents that satisfy POST (when choosing between trajectories) will be 'neutral' (when choosing between lotteries): they won't spend resources to shift probability mass between different-length trajectories.

So (we think) neutral agents won't deliberately try to get shut down if doing so costs resources.

... (read more)

6Charlie Steiner10mo

Suppose the reward at each timestep is the number of paperclips the agent has. At each timestep the agent has three "object-level" actions, and two shutdown-related actions: Object-level: * use current resources to buy the paperclips available on the market * invest its resources in paperclip factories that will gradually make more paperclips at future timesteps * invest its resources in taking over the world to acquire more resources in future timesteps (with some risk that humans will notice and try to shut you down) Shutdown-related: * Use resources to prevent a human shutdown attempt * Just shut yourself down, no human needed For interesting behavior, suppose you've tuned the environment's parameters so that there are different optimal strategies for different episode lengths (just buy paperclips at short timescales, build a paperclip factory at medium times, try to take over the world at long times). Now you train this agent with DREST. What do you expect it to learn to do?

4. Existing Writing on Corrigibility

EJT10mo10

This is a nice point, but it doesn't seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won't pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.

Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.

1Max Harms9mo

Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I'm a human trying to gain control of a company, I think I'm basically just not choosing my strategies based on resisting being killed ("shutdown-resistance"), but I think I probably wind up with something subtle, patient, and manipulative anyway.

4. Existing Writing on Corrigibility

EJT10mo20

Ah yep I'm talking about the first decision-tree in the 'Incomplete preferences' section.

1Max Harms9mo

Thanks. (And apologies for the long delay in responding.) Here's my attempt at not talking past each other: We can observe the actions of an agent from the outside, but as long as we're merely doing so, without making some basic philosophical assumptions about what it cares about, we can't generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn't tell us anything. But from the outside we also can't really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there's a different governing principal that's being neglected, such as preferring almost (but not quite) getting B. The point is that we want to form theories of agents that let us predict their behavior, such as when they'll pay a cost to avoid shutdown. If we define the agent's preferences as "which choices the agent makes in a given situation" we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn't willing to pay costs to avoid shutdown. Does that match your view?

There are no coherence theorems

EJT10mo10

Ah yep, apologies, I meant to say "never requires them to change or act against their strict preferences."

Whether there's a conceptual difference will depend on our definition of 'preference.' We could define 'preference' as follows: an agent prefers X to Y iff the agent reliably chooses X over Y.' In that case, modifying the policy is equivalent to forming a preference.

But we could also define 'preference' so that it requires more than just reliable choosing. For example, we might also require that (when choosing between lotteries) the agent always ... (read more)

1Dweomite10mo

If it doesn't move probability mass, won't it still be vulnerable to probabilistic money pumps? e.g. in the single-souring pump, you could just replace the choice between A- and B with a choice between two lotteries that have different mixtures of A- and B. I have also left a reply to the comment you linked.

4. Existing Writing on Corrigibility

EJT10moΩ110

I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.

But suppose I’m wrong, and timestep-dominance is always relevant.

My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.

I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I... (read more)

2Max Harms9mo

I may not understand your thing fully, but here's my high-level attempt to summarize your idea: I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers. But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?

2Max Harms10mo

Again, responding briefly to one point due to my limited time-window: Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust. 1. ^ (just)

4. Existing Writing on Corrigibility

EJT10moΩ450

I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.

The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.

I think you intend 'sensitive to unused alternatives' to refer to the... (read more)

1Max Harms10mo

Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.

2Max Harms10mo

Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time. I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C. You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.

4. Existing Writing on Corrigibility

EJT10mo21

I reject Thornley’s assertion that they’re dealbreakers.

Everything you say in this section seems very reasonable. In particular, I think it's pretty likely that this is true:

It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemb

... (read more)

1. The CAST Strategy

EJT10moΩ110

Thanks, this comment is also clarifying for me.

My guess is that a corrigibility-centric training process says 'Don't get the ice cream' is the correct completion, whereas full alignment says 'Do'. So that's an instance where the training processes for CAST and FA differ. How about DWIM? I'd guess DWIM also says 'Don't get the ice cream', and so seems like a closer match for CAST.

9Max Harms10mo

That matches my sense of things. To distinguish corrigibility from DWIM in a similar sort of way: I'm honestly not sure what "DWIM" does here. Perhaps it doesn't think? Perhaps it keeps checking over and over again that it's doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I'll loop in Seth Herd, in case he has a good answer.) More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.

1. The CAST Strategy

EJT10mo20

Thanks, this comment was clarifying.

And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.

Yep, agreed. Although I worry that - if we try to train agents to have a pointer - these agents might end up having a goal more like:

maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial d

... (read more)

1. The CAST Strategy

EJT10moΩ11-2

Corrigibility is, at its heart, a relatively simple concept compared to good alternatives.

I don't know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I'm concerned, because as you say:

When a training target is complex, we should expect the learner to be distracted by proxies and only get a shadow of what’s desired.

6Max Harms10mo

My claim is that obedience is an emergent part of corrigibility, rather than part of its definition. Building nanomachines is too complex to reliably instill as part of the core drive of an AI, but I still expect basically all ASIs to (instrumentally) desire building nanomachines. I do think that the goals of "want what the principal wants" or "help the principal get what they want" are simpler goals than "maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal]." While they point to similar things, training the pointer is easier in the sense that it's up to the fully-intelligent agent to determine the balance and nature of the principal's values, rather than having to load that complexity up-front in the training process. And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems. Is corrigibility simpler or more complex than these kinds of indirect/meta goals? I'm not sure. But both of these indirect goals are fragile, and probably lethal in practice. An AI that wants to want what the principal wants may wipe out humanity if given the opportunity, as long as the principal's brainstate is saved in the process. That action ensures it is free to accomplish its goal at its leisure (whereas if the humans shut it down, then it will never come to want what the principal wants). An AI that wants to help the principal get what they want won't (immediately) wipe out humanity, because it might turn out that doing so is against the principal's desires. But such an agent might take actions which manipulate the principal (perhaps physically) into having easy-to-satisfy desires (e.g. paperclips). So suppose we do a less naive thing and try to train a goal like "help the

1. The CAST Strategy

EJT10moΩ240

I think obedience is an emergent behavior of corrigibility.

In that case, I'm confused about how the process of training an agent to be corrigible differs from the process of training an agent to be fully aligned / DWIM (i.e. training the agent to always do what we want).

And that makes me confused about how the proposal addresses problems of reward misspecification, goal misgeneralization, deceptive alignment, and lack of interpretability. You say some things about gradually exposing agents to new tasks and environments (which seems sensible!), but I'm conc... (read more)

8Max Harms10mo

I agree that you should be skeptical of a story of "we'll just gradually expose the agent to new environments and therefore it'll be safe/corrigible/etc." CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there's a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the "attractor basin" hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it's not sufficient. Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want: What does a corrigibility-centric training process point to as the "correct" completion? Does this differ from a training process that tries to get full alignment? (I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)

Why Not Subagents?

EJT10mo10

There could be agents that only have incomplete preferences because they haven't bothered to figure out the correct completion. But there could also be agents with incomplete preferences for which there is no correct completion. The question is whether these agents are pressured by money-pump arguments to settle on some completion.

I understand partially ordered preferences.

Yes, apologies. I wrote that explanation in the spirit of 'You probably understand this, but just in case...'. I find it useful to give a fair bit of background context, partly to jog my... (read more)

Why Not Subagents?

EJT10mo90

Things are confusing because there are lots of different dominance relations that people talk about. There's a dominance relation on strategies, and there are (multiple) dominance relations on lotteries.

Here are the definitions I'm working with.

A strategy is a plan about which options to pick at each choice-node in a decision-tree.

Strategies yield lotteries (rather than final outcomes) when the plan involves passing through a chance-node. For example, consider the decision-tree below:

A picture containing diagram, line

Description automatically generated

A strategy specifies what option the agent would pick at choice-node 1, w... (read more)

6Jeremy Gillen10mo

[Edit: I think I misinterpreted EJT in a way that invalidates some of this comment, see downthread comment clarifying this]. That is really helpful, thanks. I had been making a mistake, in that I thought that there was an argument from just "the agent thinks it's possible the agent will run into a money pump" that concluded "the agent should complete that preference in advance". But I was thinking sloppily and accidentally sometimes equivocating between pref-gaps and indifference. So I don't think this argument works by itself, but I think it might be made to work with an additional assumption. One intuition that I find convincing is that if I found myself at outcome A in the single sweetening money pump, I would regret having not made it to A+. This intuition seems to hold even if I imagine A and B to be of incomparable value. In order to avoid this regret, I would try to become the sort of agent that never found itself in that position. I can see that if I always follow the Caprice rule, then it's a little weird to regret not getting A+, because that isn't a counterfactually available option (counterfacting on decision 1). But this feels like I'm being cheated. I think the reason that if feels like I'm being cheated is that I feel like getting to A+ should be a counterfactually available option. One way to make it a counterfactually available option in the thought experiment is to introduce another choice before choice 1 in the decision tree. The new choice (0), is the choice about whether to maintain the same decision algorithm (call this incomplete), or complete the preferential gap between A and B (call this complete). I think the choice complete statewise dominates incomplete. This is because the choice incomplete results in a lottery {B: qp, A+: q(1−p), A:(1−q)} for q<1.[1] However, the choice complete results in the lottery {B: p, A+: (1−p), A:0}. Do you disagree with this? I think this allows us to create a money pump, by charging the agent $ϵ for the

Appraising aggregativism and utilitarianism

EJT10mo21

Another nice article. Gustav says most of the things that I wanted to say. A couple other things:

I think LELO with discounting is going to violate Pareto. Suppose that by default Amy is going to be born first with welfare 98 and then Bobby is going to be born with welfare 100. Suppose that you can do something which harms Amy (so her welfare is 97) and harms Bobby (so his welfare is 99). But also suppose that this harming switches the birth order: now Bobby is born first and Amy is born later. Given the right discount-rate, LELO will advocate doing the har

... (read more)

3Cleo Nardo10mo

Yep, Pareto is violated, though how severely it's violated is limited by human psychology. For example, in your Alice/Bob scenario, would I desire a lifetime of 98 utils then 100 utils over a lifetime with 99 utils then 97 utils? Maybe idk, I don't really understand these abstract numbers very much, which is part of the motivation for replacing them entirely with personal outcomes. But I can certainly imagine I'd take some offer like this, violating pareto. On the plus side, humans are not so imprudent to accept extreme suffering just to reshuffle different experiences in their life. Secondly, recall that the model of human behaviour is a free variable in the theory. So to ensure higher conformity to pareto, we could… 1. Use the behaviour of someone with high delayed gratification. 2. Train the model (if it's implemented as a neural network) to increase delayed gratification. 3. Remove the permutation-dependence using some idealisation procedure. But these techniques (1 < 2 < 3) will result in increasingly "alien" optimisers. So there's a trade-off between (1) avoiding human irrationalities and (2) robustness to 'going off the rails'. (See Section 3.1.) I see realistic typical human behaviour on one extreme of the tradeoff, and argmax on the other.

Aggregative principles approximate utilitarian principles

EJT10mo30

Yeah I think correlations and EDT can make things confusing. But note that average utilitarianism can endorse (B) given certain background populations. For example, if the background population is 10 people each at 1 util, then (B) would increase the average more than (A).

Aggregative principles approximate utilitarian principles

EJT10mo50

Nice article. I think it's a mistake for Harsanyi to argue for average utilitarianism. The view has some pretty counterintuitive implications:

Suppose we have a world in which one person is living a terrible life, represented by a welfare level of -100. Average utilitarianism implies that we can make that world better by making the person's life even more terrible (-101) and adding a load of people with slightly-less terrible lives (-99).
Suppose I'm considering having a child. Average utilitarianism implies that I have to do research in Egyptology to figure

... (read more)

5Cleo Nardo10mo

I do prefer total utilitarianism to average utilitarianism,[1] but one thing that pulls me to average utilitarianism is the following case. Let's suppose Alice can choose either (A) create 1 copy at 10 utils, or (B) create 2 copies at 9 utils. Then average utilitarianism endorses (A), and total utilitarianism endorses (B). Now, if Alice knows she's been created by a similar mechanism, and her option is correlated with the choice of her ancestor, and she hasn't yet learned her own welfare, then EDT endorses picking (A). So that matches average utilitarianism.[2] Basically, you'd be pleased to hear that all your ancestors were average utility maximisers, rather than total utility maximisers, once you "update on your own existence" (whatever that means). But also, I'm pretty confused by everything in this anthropics/decision theory/population ethics area. Like, the egyptology thing seems pretty counterintuitive, but acausal decision theories and anthropic considerations imply all kind of weird nonlocal effects, so idk if this is excessively fishy. 1. ^ I think aggregative principles are generally better than utilitarian ones. I'm a fan of LELO in particular, which is roughly somewhere between total and average utilitarianism, leaning mostly to the former. 2. ^ Maybe this also requires SSA??? Not sure.

4. Existing Writing on Corrigibility

EJT10moΩ220

Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.

2Max Harms10mo

I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?

What do coherence arguments actually prove about agentic behavior?

EJT10mo30

I take the 'lots of random nodes' possibility to be addressed by this point:

And this point generalises to arbitrarily complex/realistic decision trees, with more choice-nodes, more chance-nodes, and more options. Agents with a model of future trades can use their model to predict what they’d do conditional on reaching each possible choice-node, and then use those predictions to determine the nature of the options available to them at earlier choice-nodes. The agent’s model might be defective in various ways (e.g. by getting some probabilities wrong, or by

... (read more)

2Jeremy Gillen10mo

I intended for my link to point to the comment you linked to, oops. I've responded here, I think it's better to just keep one thread of argument, in a place where there is more necessary context.

Why Not Subagents?

EJT10mo50

We say that a strategy is dominated iff it leads to a lottery that is dispreferred to the lottery led to by some other available strategy. So if the lottery 0.5p(A+)+(1-0.5p)(B) isn’t preferred to the lottery A, then the strategy of choosing A isn’t dominated by the strategy of choosing 0.5p(A+)+(1-0.5p)(B). And if 0.5p(A+)+(1-0.5p)(B) is preferred to A, then the Caprice-rule-abiding agent will choose 0.5p(A+)+(1-0.5p)(B).

You might think that agents must prefer lottery 0.5p(A+)+(1-0.5p)(B) to lottery A, for any A, A+, and B and for any p>0. That thought... (read more)

3Jeremy Gillen10mo

(sidetrack comment, this is not the main argument thread) I find this example unconvincing, because any agent that has finite precision in their preference representation will have preferences that are a tiny bit incomplete in this manner. As such, a version of myself that could more precisely represent the value-to-me of different options would be uniformly better than myself, by my own preferences. But the cost is small here. The amount of money I'm leaving on the table is usually small, relative to the price of representing and computing more fine-grained preferences. I think it's really important to recognize the places where toy models can only approximately reflect reality, and this is one of them. But it doesn't reduce the force of the dominance argument. The fact that humans (or any bounded agent) can't have exactly complete preferences doesn't mean that it's impossible for them to be better by their own lights. I appreciate you writing out this more concrete example, but that's not where the disagreement lies. I understand partially ordered preferences. I didn't read the paper though. I think it's great to study or build agents with partially ordered preferences, if it helps get other useful properties. It just seems to me that they will inherently leave money on the table. In some situations this is well worth it, so that's fine. No, hopefully the definition in my other comment makes this clear. I believe you're switching the state of nature for each comparison, in order to construct this cycle.

5Jeremy Gillen10mo

It seems we define dominance differently. I believe I'm defining it a similar way as "uniformly better" here. [Edit: previously I put a screenshot from that paper in this comment, but translating from there adds a lot of potential for miscommunication, so I'm replacing it with my own explanation in the next paragraph, which is more tailored to this context.]. A strategy outputs a decision, given a decision tree with random nodes. With a strategy plus a record of the outcome of all random nodes we can work out the final outcome reached by that strategy (assuming the strategy is deterministic for now). Let's write this like Outcome(strategy, environment_random_seed). Now I think that we should consider a strategy s to dominate another strategy s* if for all possible environment_random_seeds, Outcome(s, seed) ≥ Outcome(s*,seed), and for some random seed, Outcome(s, seed*) > Outcome(s*, seed*). (We can extend this to stochastic strategies, but I want to avoid that unless you think it's necessary, because it will reduce clarity). In other words, a strategy is better if it always turns out to do "equally" well or better than the other strategy, no matter the state of nature. By this definition, a strategy that chooses A at the first node will be dominated. Relating this to your response: I don't like that you've created a new lottery at the chance node, cutting off the rest of the decision tree from there. The new lottery wasn't in the initial preferences. The decision about whether to go to that chance node should be derived from the final outcomes, not from some newly created terminal preference about that chance node. Your dominance definition depends on this newly created terminal preference, which isn't a definition that is relevant to what I'm interested in. I'll try to back up and summarize my motivation, because I expect any disagreement is coming from there. My understanding of the point of the decision tree is that it represents the possible paths to get to

What do coherence arguments actually prove about agentic behavior?

Answer by EJTJun 18, 202467

I’m coming to this two weeks late, but here are my thoughts.

The question of interest is:

Will sufficiently-advanced artificial agents be representable as maximizing expected utility?

Rephrased:

Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?

Coherence arguments purport to establish that the answer is yes. These arguments go like this:

There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated str

... (read more)

2Jeremy Gillen10mo

I find the money pump argument for completeness to be convincing. The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn't a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent might encounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated. You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).

Aggregative Principles of Social Justice

EJT11mo85

Looking forward to reading this properly. For now I'll just note that Roger Crisp attributes LELO to C.I. Lewis.

3Cleo Nardo10mo

Three articles, but the last is most relevant to you: 1. Aggregative Principles of Social Justice (44 min) 2. Aggregative principles approximate utilitarian principles (27 min) 3. Appraising aggregativism and utilitarianism (23 min)

0Bentery11mo

Another related, much older reference is from Ramsey's Truth and Probability (1926) in which he relates risk attitudes to preferences over repeated experiences (it's in the single person case however): "We can put this in a different way. Suppose his degree of belief in p is mn ; then his action is such as he would choose it to be if he had to repeat it exactly n times, in m of which p was true, and in the others false. [Here it may be necessary to suppose that in each of the n times he had no memory of the previous ones.]"

2Cleo Nardo11mo

would be keen to hear your thoughts & thanks for the pointer to Lewis :)

The Shutdown Problem: Incomplete Preferences as a Solution

EJT1y40

Good point! Thinking about it, it seems like an analogue of Good's theorem will apply.

Here's some consolation though. We'll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That's because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise... (read more)

Some Experiments I'd Like Someone To Try With An Amnestic

EJT1y32

For those who don't get the joke: benzos are depressants, and will (temporarily) significantly reduce your cognitive function if you take enough to have amnesia.

But Eric Neyman's post suggests that benzos don't significantly reduce performance on some cognitive tasks (e.g. Spelling Bee)

3the gears to ascension1y

Yeah there are definitely tasks that depressants would be expected to leave intact. I'd guess it's correlated strongly with degree of working memory required.

The Shutdown Problem: Incomplete Preferences as a Solution

EJT1y21

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point.

Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training... (read more)

2ryan_greenblatt1y

I think there should be a way to get the same guarantees that only requires considering a single different conditional which should be much easier to reason about. Maybe something like "what would you do in the conditional where humanity gives you full arbitrary power".

The Shutdown Problem: Incomplete Preferences as a Solution

EJT1y40

I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times dif... (read more)

3ryan_greenblatt1y

Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if "telling them should be obedient and then breeding for this" would also work.) Do you think it's natural to generalize to extremely unlikely conditionals that you've literally never been trained on (because they are sufficiently unlikely that they would never happen)?

2ryan_greenblatt1y

Sure, but this objection also seems to apply to POST/TD, but for "actually shutting the AI down because it acted catastrophically badly" vs "getting shutdown in cases where humans are in control". It will depend on the naturalness of this sort of reasoning of course. If you think the AI reasons about these two things exactly identically, then it would be more likely work. What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what's going on? It seems like you're assuming a particular sequencing here where you get a particular preference early and then this avoids you getting deceptive alignment later. But, you could also have that the AI first has the preference you wanted and then SGD makes it deceptively aligned later with different preferences and it merely pretends later. (If e.g., inductive biases favor deceptive alignment.)

The Shutdown Problem: Incomplete Preferences as a Solution

EJT1y23

Thanks, appreciate this!

Iit's not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?

I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.

There's a lot of discussion of this unde

EJT1y11

Yep, maybe that would've been a better idea!

I think that stochastic choice does suffice for a lack of preference in the relevant sense. If the agent had a preference, it would reliably choose the option it preferred. And tabooing 'preference', I think stochastic choice between different-length trajectories makes it easier to train agents to satisfy Timestep Dominance, which is the property that keeps agents shutdownable. And that's because Timestep Dominance follows from stochastic choice between different-length trajectories and a more general principle t... (read more)

The Shutdown Problem: Incomplete Preferences as a Solution

EJT1yΩ241

Thanks, appreciate this!

It's unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn't seem like it can be with respect to the agent's subjective beliefs as this would make it even harder to impart.

I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know wheth... (read more)

3ryan_greenblatt1y

I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point. I think there should be a relatively simple and more intuitive way to make your AI expose it's preferences if you're willing to depend on arbitrarily far generalization, on getting your AI to care a huge amount about extremely unlikely conditionals, and on coordinating humanity in these unlikely conditionals.

5ryan_greenblatt1y

You need them to generalize extemely far. I'm also not sold that they are simple from the perspective of the actual inductive biases of the AI. These seem very unnatural concepts for a most AIs. Do you think that it would be easy to get alignment to POST and TD that generalizes to very different circumstances via selecting over humans (including selective breeding?). I'm quite skeptical. As far as honesty, it seems probably simpler from the perspective of the inductive biases of realistic AIs and it's easy to label if you're willing to depend on arbitrarily far generalization (just train the AI on easy cases and you won't have issues with labeling). I think the main thing is that POST and TD seem way less natural from the perspective of an AI, particularly in the generalizing case. One key intution for this is that TD is extremely sensitive to arbitrarily unlikely conditionals which is a very unnatural thing to get your AI to care about. You'll literally never sample such conditionals in training. Maybe? I think it seems extremely unclear what the dominant reason for not shutting down in these extremely unlikely conditionals is. To be clear, I was presenting this counterexample as a worst case theory counterexample: it's not that the exact situation obviously applies, it's just that it means (I think) that the proposal doesn't achieve it's guarantees in at least one case, so likely it fails in a bunch of other cases.

EJT's Shortform

EJT1y10

Thanks, will reply there!

EJT's Shortform

EJT1y10

Thanks, will reply there!

EJT's Shortform

EJT1y21

it'll take a lot of effort for me to read properly (but I will, hopefully in about a week).

Nice, interested to hear what you think!

I think it's easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.

Yep agree that this is a concern, and I plan to think more about this soon.

putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.

Interested to hear more about this. I'm not sure exactly what you mean by 'detector', but I don't think my proposal requi... (read more)