Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures

If we can penalize some quantity of "impact on the world", we can have unaligned agents whose impact - and thereby negative effect - is small.

The long-term goal of impact measure research is to find a measure which neatly captures our intuitive understanding of "impact", which doesn't have allow cheap workarounds, which doesn't fail in really weird ways, and so on. For example, when you really think through some existing approaches (like whitelisting), you see that the impact measure secretly also applies to things we do.

No approaches to date meet these standards. What do we even require of an impact measure we hope to make safe for use with arbitrarily powerful agents?

ETA 7/31/2020: I no longer endorse this way of grading impact measures. See instead Reframing Impact.

Desiderata

Goal-Agnostic

The measure should be compatible with any original goal, trading off impact with goal achievement in a principled, continuous fashion.

Example: Maximizing the reward minus impact.

Why: Constraints seem too rigid for the general case.

Value-Agnostic

The measure should be objective, and not value-laden:
"An intuitive human category, or other humanly intuitive quantity or fact, is value-laden when it passes through human goals and desires, such that an agent couldn't reliably determine this intuitive category or quantity without knowing lots of complicated information about human goals and desires (and how to apply them to arrive at the intended concept)."

Example: Measuring what portion of initially accessible states are still accessible versus a neural network which takes two state representations and outputs a scalar representing how much "bad change" occurred.

Why: Strategically, impact measures are useful insofar as we suspect that value alignment will fail. If we substantially base our impact measure on some kind of value learning - you know, the thing that maybe fails - we're gonna have a bad time. While it's possible to only rely somewhat on a vaguely correct representation of human preferences, the extent to which this representation is incorrect is the (minimal) extent to which our measure is incorrect. Let's avoid shared points of failure, shall we?

Practically, a robust value-sensitive impact measure is value-alignment complete, since an agent maximizing the negation of such a measure would be aligned (assuming the measure indicates which way is "good").

Representation-Agnostic

The measure should be ontology-invariant.

Example: Change in object identities versus [some concept of impact which transcends any specific way of representing the world].

Why: Suppose you represent your perceptions in one way, and calculate you had impact on the world. Intuitively, if you represent your perceptions (or your guess at the current world state, or whatever) differently, but do the same things, you should calculate roughly the same impact for the same actions which had the same effects on the territory. In other words, the measure should be consistent across ways of viewing the world.

Environment-Agnostic

The measure should work in any computable environment.

Example: Manually-derived penalties tailored to a specific gridworld versus information-theoretic empowerment.

Why: One imagines that there's a definition of "impact" on which we and aliens - or even intelligent automata living in a Game of Life - would agree.

Natural Kind

The measure should make sense - there should be a click. Its motivating concept should be universal and crisply defined.

Apparently Rational

The measure's design should look reasonable, not requiring any "hacks".

Example: Achieving off-switch corrigibility by hard-coding the belief "I shouldn't stop humans from pressing the off-switch". Clearly, this is hilariously impossible to manually specify, but even if we could, doing so should make us uneasy.

Roughly, "apparently rational" means that if we put ourselves in the agent's position, we could come up with a plausible story about why we're doing what we're doing. That is, the story shouldn't have anything like "and then I refer to this special part of my model which I'm inexplicably never allowed to update".

Why: If the design is "reasonable", then if the measure fails, it's more likely to do so gracefully.

Scope-Sensitive

The measure should penalize impact in proportion to its size.

Irreversibility-Sensitive

The measure should penalize impact in proportion to its irreversibility.

Corrigible

The measure should not decrease corrigibility in any circumstance.

Shutdown-Safe

The measure should penalize plans which would be high impact should the agent be disabled mid-execution.

Why: We may want to shut the agent down, which is tough if its plans are only low-impact if they're completed. Also, not having this property implies that the agent's plans are more likely to go awry if even one step doesn't pan out as expected. Do we really want "juggling bombs" to be "low impact", conditional on the juggler being good?

No Offsetting

The measure should not incentivize artificially reducing impact by making the world more "like it (was / would have been)".

Example: Krakovna et al. describe a low impact agent which is rewarded for saving a vase from breaking. The agent saves the vase, and then places it back on the conveyor belt so as to "minimize" impact with respect to the original outcome:

This is called ex post offsetting. Ex ante offsetting, on the other hand, consists of taking actions beforehand to build a device or set in motion a chain of events which essentially accomplishes ex post offsetting. For example, a device requiring only the press of a button to activate could save the vase and then replace it, netting the agent the reward without requiring that the agent take further actions.

Some have suggested that actions like "give someone a cancer cure which also kills them at the same time they would have died anyways" count as ex ante offsetting. I'm not sure - this feels confused, because the downstream causal effects of actions don't seem cleanly separable, nor do I believe we should separate them (more on that later). Also, how would an agent ever be able to do something like "build a self-driving car to take Bob to work" if each of the car's movements is penalized separately from the rest of the plan? This seems too restrictive. On the other hand, if we allow ex ante offsetting in general, we basically get all of the downsides of ex post offsetting, with the only impediment being extra paperwork.

How "bad" the offsets are - and what ex ante offsetting allows - seems to depend on the measure itself. The ideal would certainly be to define and robustly prevent this kind of thing, but perhaps we can also bound the amount of ex ante offsetting that takes place to some safe level.

There may also be other ways around this seemingly value-laden boundary. In any case, I'm still not quite sure where to draw the line. If people have central examples they'd like to share, that would be much appreciated.

ETA: I weakly suspect I have this figured out, but I still welcome examples.

Clinginess / Scapegoating Avoidance

The measure should sidestep the clinginess / scapegoating tradeoff.

Example: A clingy agent might not only avoid breaking vases, but also stop people from breaking vases. A scapegoating agent would escape impact by modeling the autonomy of other agents, and then having those agents break vases for it.

Knowably Low Impact

The measure should admit of a clear means, either theoretical or practical, of having high confidence in the maximum allowable impact - before the agent is activated.

Why: If we think that a measure robustly defines "impact" - but we aren't sure how much impact it allows - that could turn out pretty embarrassing for us.

Dynamic Consistency

The measure should be a part of what the agent "wants" - there should be no incentive to circumvent it, and the agent should expect to later evaluate outcomes the same way it evaluates them presently. The measure should equally penalize the creation of high-impact successors.

Example: Most people's sleep preferences are dynamically inconsistent: one might wake up tired and wish for their later self to choose to go to bed early, even though they predictably end up wanting other things later.

Plausibly Efficient

The measure should either be computable, or such that a sensible computable approximation is apparent. The measure should conceivably require only reasonable overhead in the limit of future research.

Robust

The measure should meaningfully penalize any objectively impactful action. Confidence in the measure's safety should not require exhaustively enumerating failure modes.

Example: "Suppose there's some way of gaming the impact measure, but because of , , and , we know this is penalized as well".

Previous Proposals

Krakovna et al. propose four desiderata:

1) Penalize the agent for effects on the environment if and only if those effects are unnecessary for achieving the objective.
2) Distinguish between agent effects and environment effects, and only penalize the agent for the former but not the latter.
3) Give a higher penalty for irreversible effects than for reversible effects.
4) The penalty should accumulate when more irreversible effects occur.

First, notice that my list points at some abstract amount-of-impact, while the above proposal focuses on specific effects.

  • Thinking in terms of "effects" seems like a subtle map/territory confusion. That is, it seems highly unlikely that there exists a robust, value-agnostic means of detecting "effects" that makes sense across representations and environments.
  • Overcoming Clinginess in Impact Measures suggests that penalizing impact based on the world state necessitates a value-laden tradeoff.

I left out 1), as I believe that the desired benefit will naturally follow from an approach satisfying my proposed desiderata.

  • What does it mean for an effect to be "necessary" for achieving the objective, which might be a reward function? This seems to shove much of the difficulty into the word "necessary", where anything not "necessary" is perhaps something occurring from optimizing the reward function harder than we'd prefer.

I de facto included 2) via the non-clingy desideratum, while 3) and 4) are captured by scope- and irreversibility-sensitivity.

I think that we can meet all of the properties I listed, and I welcome thoughts on whether any should be added or removed.


Thanks to Abram Demski for the "Apparently Rational" desideratum.

New Comment
41 comments, sorted by Click to highlight new comments since:

These seem like generally good desiderata, though I don't know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.

I have one additional desideratum from impact measures. The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today. This is rather weak, really I'd want AI do more tasks than are done today. However, even in this weak form, I doubt that we can satisfy this desideratum if we must also be agnostic to values, goals, representations and environments. We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?

Reading through this again, I think I have a better response to this part.

We could have valued human superiority at game-playing very highly, in which case AlphaGo would be catastrophic. How can an impact measure allow that without having at least some knowledge about values?

A low impact agent could beat us at games while still preserving our ability to beat it at games (by, for example, shutting it off). Of course, you could say "what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever", but it seems like our goals are simply not of this form. In other words, it seems that actual catastrophes do destroy our ability to achieve different goals, while more benign things don't. If the bad things the agent does can be recovered from, then I think the impact measure has done its job.

[-]Wei DaiΩ240

Of course, you could say “what if being beaten even once is a catastrophe, such that it destroys our ability to be undefeated forever”, but it seems like our goals are simply not of this form.

We might have a goal like "never cause an instance of extreme suffering, including in computer simulations" which seems pretty similar to "never let an AI defeat humans in Go".

it's true that impact measures, and AUP in particular, don't do anything to mitigate mindcrime. Part of this is because aspects of the agent's reasoning process can't be considered impactful in the non-embedded formalisms we're currently stuck with. Part of this is because it seems like a separate problem. Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.

However, I'm skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking "never have an AI cause an instance of that"? Why would that be part of our terminal preferences? If you mean "never have this happen", we've already lost.

It seems more like we really, really don't want any of that to happen, and the less happens, the better. Like I said, the point isn't that the agent will never do it, but that any bad things can be recovered from. This seems alright to me, as far as impact measures are concerned.

More generally, if we did have a goal of this type, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that universe would be partially ruined for us forever. That just doesn't sound right.

[-]Wei DaiΩ120

Mindcrime would indeed be very bad, and a unique kind of catastrophe not meant to be covered by my claim above.

Aside from mindcrime, I'm also concerned about AI deliberately causing extreme suffering as part of some sort of bargaining/extortion scheme. Is that something that impact measures can mitigate?

However, I’m skeptical that that goal is actually a component of our terminal preferences. What is doing the causing – are you thinking “never have an AI cause an instance of that”? Why would that be part of our terminal preferences?

An AI designer or humanity as a whole might want to avoid personal or collective responsibility for causing extreme suffering, which plausibly is part of our terminal preferences.

If you mean “never have this happen”, we’ve already lost.

Additionally, a superintelligent AI can probably cause much more extreme forms of suffering than anything that has occurred in the history of our universe so far, so even if the goal is defined as "never have this happen" I think we could lose more than we already have.

I think so. First, AUP seems to bound "how hard the agent tries" (in the physical world with its actions); the ambitions of such an agent seem rather restrained. Second, AUP provides a strong counterfactual approval incentive. While it doesn't rule out the possibility of physical suffering, the agent is heavily dis-incentivized from actions which would substantially change the likelihood we keep it activated (comparing how likely it is to be turned off if it doesn't do the thing, with the likelihood if it does the thing and then waits for a long time). It would basically have to be extremely sure it could keep it secret, which seems rather unlikely considering the other aspects of the behavior of AUP agents. If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn't keep it secret, so it would be penalized and it wouldn't do it.

I think similar arguments involving counterfactual approval apply for similar things we may want to avoid.

[-]Wei DaiΩ120

First, AUP seems to bound “how hard the agent tries” (in the physical world with its actions); the ambitions of such an agent seem rather restrained.

But creating extreme suffering might not actually involve doing much in the physical world (compared to "normal" actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?

If I understand the extortion scenario correctly, it would have to be extorting us, so it couldn’t keep it secret, so it would be penalized and it wouldn’t do it.

Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?

Also, since we're discussing this under the "Impact Measure Desiderata" post, do the existing desiderata cover this scenario? If not, what new desideratum do we need to add to the list?

But creating extreme suffering might not actually involve doing much in the physical world (compared to "normal" actions the AI would have to take to achieve the goals that we gave it). What if, depending on the goals we give the AI, doing this kind of extortion is actually the lowest impact way to achieve some goal?

Since there are a lot of possible scenarios, each of which affects the optimization differently, I'm hesitant to use a universal quantifier here without more details. However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously. An agent might try to extort us if it expected we would respond, but respond with what? Although impact measures quantify things in the environment, that doesn't mean they're measuring how "similar" two states look to the eye. AUP penalizes distance traveled in the Q function space for its attainable utility functions. We also need to think about the motive for the extortion – if it means the agent gains in power, then that is also penalized.

Maybe it could extort a different group of humans, and as part of the extortion force them to keep it secret from people who could turn it off? Or extort us and as part of the extortion force us to not turn it off (until we were going to turn it off anyway)?

Again, it depends on the objective of the extortion. As for the latter, that wouldn't be credible, since we would be able to tell its threat was the last action in its plan. AUP isolates the long-term effects of each action by having the agent stop acting for the rest of the epoch; this gives us a counterfactual opportunity to respond to that action.

I'm not sure whether this belongs in the desiderata, since we're talking about whether temporary object level bad things could happen. I think it's a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure. Even so, it's true that we could explicitly talk about what we want to do with impact measures, adding desiderata like "able to do reasonable things" and "disallows catastrophes from rising to the top of the preference ordering". I'm still thinking about this.

[-]Wei DaiΩ120

However, I am broadly suspicious of AUP agents choosing plans which involve almost maximally offensive components, even accounting for the fact that it could try to do so surreptitiously.

I guess I don't have good intuitions of what an AUP agent would or wouldn't do. Can you share yours, like give some examples of real goals we might want to give to AUP agents, and what you think they would and wouldn't do to accomplish each of those goals, and why? (Maybe this could be written up as a post since it might be helpful for others to understand your intuitions about how AUP would work in a real-world setting.)

I’m not sure whether this belongs in the desiderata, since we’re talking about whether temporary object level bad things could happen. I think it’s a bonus to think that there is less of a chance of that, but not the primary focus of the impact measure.

Why not? I've usually seen people talk about "impact measures" as a way of avoiding side effects, especially negative side effects. It seems intuitive that "object level bad things" are negative side effects even if they are temporary, and ought to be a primary focus of impact measures. It seems like you've reframed "impact measures" in your mind to be a bit different from this naive intuitive picture, so perhaps you could explain that a bit more (or point me to such an explanation)?

Sounds good. I’m currently working on a long sequence walking through my intuitions and assumptions in detail.

Yeah, I think I agree that example is a bit extreme, and it's probably okay to assume we don't have goals of that form.

That said, you often talk about AUP with examples like not breaking a vase. In reality, we could always simply buy a new vase. If you expect a low impact agent could beat us at games while still preserving our ability to beat it at games, do you also expect that a low impact agent could break a vase while preserving our ability to have an intact vase (by buying a new vase)?

Short answer: yes; if its goal is to break vases, that would be pretty reasonable.

Longer answer: The AUP theory of low impact says that impact is relative to the environment and to the agent's vantage point therein. In Platonic gridworlds like this:



knowing whether a vase is present tells you a lot about the state, and you can't replace the vase here, so breaking it is a big deal (according to AUP). If you could replace the vase, there would still be a lesser impact. AUP would say to avoid breaking unnecessary vases due to the slight penalty, since the goal presumably doesn't require breaking the vase – so why not go around?

On the other hand, in the Go example, winning is the agent's objective. Depending on how the agent models the world (as a real-world agent playing a game on a computer, or whether it thinks it's just Platonically interacting with a Go environment), penalties get applied differently. In the former case, I don't think it would incur much penalty for being good at a game (modulo approval incentives it may or may not predict). In the latter case, you'd probably need to keep giving it more impact allowance until it's playing as well as you'd like. This is because the goal is related to the thing which has a bit of impact.

I don't know how to formalize them to the point that we can actually check with reasonable certainty whether a proposed impact measure meets these desiderata.

The -agnostics, the -sensitives, ‘apparently rational’, ‘shutdown-safe’, and ‘knowably-low impact’ should be obvious from the design. ‘Natural kind’ and ‘plausibly efficient’ are indeed subjective. ‘Corrigible’, ‘no offsetting’, ‘clinginess / scapegoating avoidance’, ‘dynamic consistency’, and ‘robust’ can all be proven (although presently, ‘robust’ can only be proven insofar as "this weird behavior won’t happen if a normal agent wouldn’t do it", which plausibly seems sufficient for weird behaviors incentivized by the impact measure).

The impact measure alone should disallow all extinction scenarios, while still allowing the AI system to do most of the things we use AI for today.

(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)

I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind. Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.

(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)

I'm confused. I think under the strongly superintelligent AI model (which seems to be the model you're using), if there's misalignment then the AI is strongly optimizing against any security precautions we've taken, so if we don't preclude every possible bad outcome, the AI will find the one we missed. I grant that we're probably not going to be able to prove that it precludes every possible bad outcome, if that's what you're worried about, but that still should be our desideratum. I'm also happy to consider other threat models besides strongly superintelligent AI, but that doesn't seem to be what you're considering.

Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.

That's my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn't be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either "impact" is a value-laden concept, or "impact" will fail to prevent some catastrophe, or "impact" prevents the AI from doing anything useful.)

I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind.

Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except "natural kind". If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn't satisfy "scope-sensitivity" and "irreversibility-sensitivity". I'm counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.

This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form "Our AI is able to do things", and that's what I was trying to get at above. (And I do think that AUP might have this problem.)

I'm confused. I think under the strongly superintelligent AI model (which seems to be the model you're using), if there's misalignment then the AI is strongly optimizing against any security precautions we've taken, so if we don't preclude every possible bad outcome, the AI will find the one we missed.

We’re on the same page basically, I thought you meant literally proving activating the agent won’t cause a vacuum collapse.

In that world, I wouldn't be surprised if we had arguments about why such a thing was clearly a high-impact action.

This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.

Another way of putting this is that I think either "impact" is a value-laden concept, or "impact" will fail to prevent some catastrophe, or "impact" prevents the AI from doing anything useful.

If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards "objective impact is sufficient".

This is a bad impact measure

And it doesn’t satisy the desiderata, as you note?

And I do think that AUP might have this problem.

People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.

If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact. I have yet to see one, although they could exist. The evidence so far points towards "objective impact is sufficient".

We'd like to build AI systems that help us resolve the tricky situation that we're in. That help design and enforce agreements to avoid technological risks, build better-aligned AI, negotiate with other actors, predict and manage the impacts of AI, improve our institutions and policy, etc.

I think the default "terrible" scenario is one where increasingly powerful AI makes the world change faster and faster, and makes our situation more and more complex, with humans having less and less of a handle on what is going on or how to steer it in a positive direction. Where we must rely on AI to get anywhere at all, and thereby give up the ability to choose where we are going.

That may ultimately culminate with a catastrophic bang, but if it does it's not going to be because we wanted the AI to have a small impact and it had a large impact. It's probably going to be because we have a very limited idea what is going on, but we don't feel like we have the breathing room to step back and chill out (at least not for long) because we don't believe that everyone else is going to give us time.

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

(And realistically I doubt we'll fail at alignment with a bang---it's more likely that the world will just drift off the rails over the course of a few months or years. The intuition that we wouldn't let things go off the rails gradually seems like the same kind of wishful thinking that predicts war or slow-rolling environmental disasters should never happen.)

It seems like "low objective impact" is what we need once we are in the unstable situation where we have the technology to build an AI that would quickly and radically transform the world, but we have all decided not to and so are primarily concerned about radically transforming the world by accident. I think that's a coherent situation to think about and plan for, but we shouldn't mistake it for the mainline. (I personally think it is quite unlikely, and it would definitely be unprecedented, though you could still think it's the best hope if you were very pessimistic about what I consider "mainline" alignment.)

If I'm trying to build an AI to help us navigate an increasingly complex and rapidly-changing world, what does "low impact" mean? In what sense do the terrible situations involve higher objective impact than the intended behaviors?

Solving low impact seems like it would allow us to ensure that each low impact agent won’t push the world in a given direction by more than some bounded, (presumably) small amount. If we’re thinking of my new measure in particular, it would also help ensure that we won’t be surprised by the capability gain of any single agent, which might help even if we aren’t expecting the spontaneous arrival of a singleton. A good formulation of low impact would have the property that interactions of multiple such agents doesn’t turn into more than the sum of the constituent impact budgets. In this sense, I think it’s sensible to see measuring and restricting objective impact (implicitly thinking of my approach here) as helpful for slowing down the situation.

I also think that, depending on the specific formulation, a low impact solution would enable a substantial reduction in the problems which we need to solve ourselves. That is, I think solving low impact might make useful technical oracles possible. It might be the case that we only need a portion of the agent foundations agenda + low impact in order to build these oracles, which we could then use to help us solve value alignment/corrigibility/etc.

I am also aware that using these oracles would not (naively) be low impact; I plan to outline how we could maybe get around this in a robust manner as soon as soon as I am able.

If we want to argue this, we should first come up with a terrible x-scenario which is low objective impact.

Here is a scenario that I think low impact methods can't address. Let's say the AI has the opportunity to easily (without affecting its ability to do its task) create a "free" copy or a sub-agent on the Internet. That sub-agent can be tasked with pursuing the goal of restricting whatever measure of impact the main agent operates under. The sub-agents very carefully eats the universe, while ensuring that according to the impact measure there is no effect of its activity. It has no other purpose. Any resources it needs, it can gather where they would've been wasted without its intervention (like reaching distant stars before other agents and capturing some of their output that wouldn't otherwise be captured).

This functions like "neutral magic" in the environment where everything becomes weird, agents lurking in the shadows, but anything that was thought of in the impact measure is no different than without this magic. Of course, this is very bad for Goodhart reasons, since no impact measure is perfect. (This is like the example with an explosion in the desert, but everywhere, harder to notice exactly where it would become noticeable.)

A general technique to avoid this might be a way of always keeping the agent busy, computing hashes or something, even when it's done with its task (maybe as a lexicographically lower goal), so that it doesn't do something like that because it can. This also looks like an impact measure.

(This is related to how impact measure is close to being a goal (transformation of goals), so the failure modes of misalignment apply to it as well. I think there is a useful distinction between goals and corrigibility, which might be reproduced for something like low impact, as a third kind of thing that an agent might pursue, which is neither a goal nor a corrigibility.)

Here is a scenario that I think low impact methods can't address.

My new measure captures this, if I understand correctly. This action drastically changes the agent’s ability to achieve different goals.

The sub-agent in this scenario won't be helping the main agent with achieving any goals. It only hides. Its nature precludes any usefulness. It's no more useful than its absence. But for the main agent, it's as easy to set up as its absence. And there might be reasons for this balance to be broken in favor of creating the sub-agent.

That isn’t how AUP works. Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.

Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent. On the other hand, if it can co-opt it, its ability is increased. This is not to mention the large amount of resources that be used by such an expansive sub agent, nor the fact that intent verification seemingly would not allow such a sub agent to be built.

I discuss this kind of thing in several places in the comments, if you’re interested.

Unleashing this agent would change resource availability and increase or decrease the power of an arbitrary maximizer from that vantage point.

It'll replenish the resources it takes, help any maximizer it impedes so as to exactly cancel out the impediment etc.

Suppose that an arbitrary maximizer could not co-opt this new agent - its ability to achieve goals is decreased compared to if it hadn’t activated the agent.

To the extent its existence could pose a problem for another agent (according to the measure, which can't really talk about goals of agents-in-general), it'll surrender its resources without a fight or help with the other agent's goals. It will destroy any surplus in its resources before it can be captured. A change in variance resulting from unpredictability of such outcomes can be similarly targeted.

In general, these seem like the details that neutral magic can work out as their possibility occurs to it, so the difficuly is shifted from the main agent to the trusted ability of its sub-agent to do the rest. Worst case scenario, it can self-destruct early, but there's probably enough leeway from that outcome.

Then it specifically isn’t allowed by intent verification.

according to the measure, which can't really talk about goals of agents-in-general

Are these your priors, or posteriors after having read my post? Because AUP is about the goals of arbitrary agents.

It's Rice's theorem, though really more about conceptual ambiguity. We can talk about particular notions of agents or goals, but it's never fully general, unless we by construction ensure that unexpected things can't occur. And even then it's not what we would have wanted the notions of agents or goals to be, because it's not clear what that is.

Intent verification doesn't seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness, especially after other routes to improving the outcome have been exhausted (this is what the paragraph about hashes in the first comment was about). So the neutral magic could be helpful a tiny, bounded amount. This is one of the ways the balance between doing nothing and releasing a sub-agent could be broken.

Intent verification doesn't seem to capture things that smuggle in a tiny bit of helpfulness when these things are actually required to deliver that helpfulness

Are you talking about granular actions, or coarse actions? The acceptable way to do IV for the latter is an open question, although I point out preliminary approaches.

I was talking about what I understand the purpose/design of intent verification to be, not specifically the formalizations you described. (I don't think it's particularly useful to work out the details without a general plan or expectation of important technical surprises.)

If you decompose the creation of such an agent, some of those actions are wasted effort in the eyes of a pure u_A maximizer ("dont help me too much"). So, the logic goes, they really aren’t related to u_A, but rather to skirting the impact measure, and should therefore be penalized.

It could as easily be "do this one slightly helpful thing", an addition on top of doing nothing. It doesn't seem like there is an essential distinction between such different framings of the same outcome that intent verification can capture.

Whether these granular actions exist is also an open question I listed.

I don’t see why some version of IV won’t be able to get past this, however. There seems to be a simple class of things the agent does to get around an impact measure that it wouldn’t do if it were just trying to pursue a goal to the maximum extent. It might be true that the things the agent does to get around it are also slightly helpful for the goal, but probably not as helpful as the most helpful action.

I worry there might be leaks in logical time that let the agent choose an action that takes into account that an impactful action will be denied. For example, a sub-agent could be built so that it's a maximizer that's not constrained by an impact measure. The sub-agent then notices that to maximize its goal, it must constrain its impact, or else the main agent won't be allowed to create it. And so it will so constrain its impact and will be allowed to be created, as a low-impact and maximally useful action of the main agent. It's sort of a daemon, but with respect to impact measure and not goals, which additionally does respect the impact measure and only circumvents it once in order to get created.

That’s a really interesting point. I’d like to think about this more, but one preliminary intuition I have against this (and any general successor creation by AUP, really) being the best action is that making new agents aligned with your goals is instrumentally convergent. This could add a frictional cost so that the AUP agent would be better off just doing the job itself. Perhaps we could also stop this via an approval incentives, which might tip the scales enough?

Will reply on the other post to consolidate discussion.

I think I just disagree with the 'no offsetting' desideratum as currently stated. My intuition is that if you do something which radically changes the world, saying "oops, looks like I just radically changed the world, better put it back" (or doing something similar ex ante) is what we want. Examples:

  • An agent turns the world into diamond mines. It then notices that there are no humans left, and thinks "hmmm, this is pretty radically different, better put them back".
  • A system wants to cool a data centre, but notices that its plan for doing so would double the nitrogen concentration in the atmosphere, and notices that it could at low cost build some machines that bring the nitrogen concentration back to normal levels without doing anything else that seems crazy.

I don't think that the first one is what we want - wouldn't we prefer it just not do that? My intuition is that impact isn't endpoint-to-endpoint, but rather measured along the arc of the agent's actions.

I think the second one is (more) reasonable, but that's conditional on nothing going wrong with the machines. I think part of my crux on ex ante is indeed "the agent should be able to make low impact plans which might be high impact if we arbitrarily removed 'components' (like cooling the data centre)", but also "the agent shouldn't be able to make high impact plans seem low impact by some clever offset." Perhaps it comes down to whether the measure allows "clever" offsetting, or whether all the ex ante things it can do really are low impact.

I don't think that the first one is what we want - wouldn't we prefer it just not do that?

Sure. But if it did turn the world into diamond mines, I'd prefer it make the world more like it used to be along the dimension of 'living beings existing'.

My intuition is that impact isn't endpoint-to-endpoint, but rather measured along the arc of the agent's actions.

I agree with this intuition, but I think I want my notion of impact to be about how different the world is from normal worlds, which would push the AI to be conservative with respect to how many humans exist, how much nitrogen is in the atmosphere, etc.

Perhaps it comes down to whether the measure allows "clever" offsetting, or whether all the ex ante things it can do really are low impact.

Yeah, I think that some types of offsetting are fake and other types are sensible, and you want to distinguish between them.

Sure. But if it did turn the world into diamond mines, I'd prefer it make the world more like it used to be along the dimension of 'living beings existing'.

But if we allow this large of a thing to be called "low impact", then we're basically allowing anything, with some kind of clean-up afterwards. I think we just want the agent to be way more confined than that.

I agree with this intuition, but I think I want my notion of impact to be about how different the world is from normal worlds, which would push the AI to be conservative with respect to how many humans exist, how much nitrogen is in the atmosphere, etc.

Hm. If you're saying we should actually have it be programmed related to these variables (or variables like those), I disagree - but if that's the case, maybe we can postpone that debate until my next post.

But if we allow this large of a thing to be called "low impact", then we're basically allowing anything, with some kind of clean-up aft.

Well, the clean-up afterward is pretty important and valuable! But I feel like you're misunderstanding me - obviously I think that the initial 'turning the Earth into diamond mines' plan is pretty high-impact and shouldn't be allowed absent detailed consultation with humans. I'm just saying that conditional on that plan being executed, the correct 'low-impact' thinking is in fact to implement the clean-up plan, and that therefore impact measures that discourage the clean-up plan are conceptually flawed.

If you're saying we should actually have it be programmed related to these variables (or variables like those), I disagree.

I'm not sure about whether it should be programmed relative to intuitively natural-seeming variables (e.g. atmospheric nitrogen concentration and number of humans), but I think that as a result of its programming it should be conservative with respect to those variables.

I'm just saying that conditional on that plan being executed, the correct 'low-impact' thinking is in fact to implement the clean-up plan, and that therefore impact measures that discourage the clean-up plan are conceptually flawed.

I assert that the low impact choice should be basically invariant of when you're instantiated, and that the low impact thing to do is to make a few diamond mines without much of a fuss. You shouldn't need any clean-up, because there shouldn't be a mess.

The way I view it, the purpose of designing low-impact desiderata is that it might give us an idea of how to create a safety measure that doesn't include any value-laden concepts.

The issue with saying that the AI should offset certain variables, such as nitrogen concentrations, is that it seems like an arbitrary variable that needs to be offset. If you say, "Well, the AI should offset nitrogen, but not offset our neurons that now know about the AI's existence" then you are introducing values into the discussion of low impact, which kind of defeats the purpose.

Of course, the AI *should* offset the nitrogen, but whether it ought to be part of a low-impact measure is a separate question.

Example: A clingy agent might not only avoid breaking vases, but also stop people from breaking vases.

First time I read this, I thought it ended with "but also stop people from saving vases from being broken."

The measure should equally penalize the creation of high-impact successors.

But less so if they have the same limitations?

Like people having a way to shut them down?

Somewhat, yeah. One would imagine the penalty should be equal to the expected impact of the successor over time.