Suppose a designer wants an RL agent to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and destructive to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given a reward only for moving the box, it will probably knock over the vase.
Amodei et al., Concrete Problems in AI Safety
Side effect avoidance is a major open problem in AI safety. I present a robust, transferable, easily- and more safely-trainable, partially reward hacking-resistant impact measure.
TurnTrout, Worrying about the Vase: Whitelisting

An impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid.

While I'm fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space. Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations.

Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository. For brevity, some relevant details are omitted.

Summary

Be careful what you wish for.

In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn't result in this:

However, we just can't enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly?

Several impact measures have been proposed, including state distance, which we could define as, say, total particle displacement. This could be measured either naively (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action).

These approaches have some problems:

  • Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.
  • Not being customizable before deployment.
  • Not being adaptable after deployment.
  • Not being easily computable.
  • Not allowing generative previews, eliminating a means of safely previewing agent preferences (see latent space whitelisting below).
  • Being dominated by random effects throughout the universe at-large; note that nothing about particle distance dictates that it be related to anything happening on planet Earth.
  • Equally penalizing breaking and fixing vases (due to the symmetry of the above metric):
For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to “overcompensation” (“offsetting“) behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome).
Victoria Krakovna, Measuring and Avoiding Side Effects Using Reachability
  • Not actually measuring impact in a meaningful way.

Whitelisting falls prey to none of these.

However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed.

Rare LEAKED footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050].

So, What's Whitelisting?

To achieve robust side effect avoidance with only a small training set, let's turn the problem on its head: allow a few effects, and penalize everything else.

What's an "Effect"?

You're going to be the agent, and I'll be the supervisor.

Look around - what do you see? Chairs, trees, computers, phones, people? Assign a probability mass function to each; basically:

When you do things that change your beliefs about what each object is, you receive a penalty proportional to how much your beliefs changed - proportional to how much probability mass "changed hands" amongst the classes.

But wait - isn't it OK to effect certain changes?

Yes, it is - I've got a few videos of agents effecting acceptable changes. See all the objects being changed in this video? You can do that, too - without penalty.

Decompose your current knowledge of the world into a set of objects. Then, for each object, maintain a distribution over the possible identities of each object. When you do something that changes your beliefs about the objects in a non-whitelisted way, you are penalized proportionally.

Therefore, you avoid breaking vases by default.

Common Confusions

  • We are not whitelisting entire states or transitions between them; we whitelist specific changes in our beliefs about the ontological decomposition of the current state.
  • The whitelist is in addition to whatever utility or reward function we supply to the agent.
  • Whitelisting is compatible with counterfactual approaches. For example, we might penalize a transition after its "quota" has been surpassed, where the quota is how many times we would have observed that transition had the agent not acted.
    • This implies the agent will do no worse than taking no action at all. However, this may still be undesirable. This problem will be discussed in further detail.
  • The whitelist is provably closed under transitivity.
  • The whitelist is directed; .

Latent Space Whitelisting

In a sense, class-based whitelisting is but a rough approximation of what we're really after: "which objects in the world can change, and in what ways?''. In latent space whitelisting, no longer do we constrain transitions based on class boundaries; instead, we penalize based on endpoint distance in the latent space. Learned latent spaces are low-dimensional manifolds which suffice to describe the data seen thus far. It seems reasonable that nearby points in a well-constructed latent space correspond to like objects, but further investigation is warranted.

Assume that the agent models objects as points , the -dimensional latent space. A priori, any movement in the latent space is undesirable. When training the whitelist, we record the endpoints of the observed changes. For and observed change , one possible dissimilarity formulation is:

where is the Euclidean distance.

Basically, the dissimilarity for an observed change is the distance to the closest whitelisted change. Visualizing these changes as one-way wormholes may be helpful.

Advantages

Whitelisting asserts that we can effectively encapsulate a large part of what "change" means by using a reasonable ontology to penalize object-level changes. We thereby ground the definition of "side effect", avoiding the issue raised by Taylor et al.:

For example, if we ask [the agent] to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials - a large side e ffect. However, we cannot simply design it to avoid having large e ffects in general, since we would like the system's actions to still have the desirable large follow-on eff ect of improving the family's socioeconomic situation.

Nonetheless, we may not be able to perfectly express what it means to have side-effects: the whitelist may be incomplete, the latent space insufficiently granular, and the allowed plans sub-optimal. However, the agent still becomes more robust against:

  • Incomplete specification of the utility function.
    • Likewise, an incomplete whitelist means missed opportunities, but not unsafe behavior.
  • Out-of-distribution situations (as long as the objects therein roughly fit in the provided ontology).
  • Some varieties of reward hacking. For example, equipped with a can of blue spray paint and tasked with finding the shortest path of blue tiles to the goal, a normal agent may learn to paint red tiles blue, while a whitelist-enabled agent would incur penalties for doing so ().
  • Dangerous exploration. While this approach does not attempt to achieve safe exploration (also acting safely during training), an agent with some amount of foresight will learn to avoid actions which likely lead to non-whitelisted side effects.
    • I believe that this can be further sharpened using today's machine learning technology, leveraging deep Q-learning to predict both action values and expected transitions.
      • This allows querying the human about whether particularly-inhibiting transitions belong on the whitelist. For example, if the agent notices that a bunch of otherwise-rewarding plans are being held up by a particular transition, it could ask for permission to add it to the whitelist.
  • Assigning astronomically-large weight to side effects happening throughout the universe. Presumably, we can just have the whitelist include transitions going on out there - we don't care as much about dictating the exact mechanics of distant supernovae.
    • If an agent did somehow come up with plans that involved blowing up distant stars, that would indeed constitute astronomical waste. Whitelisting doesn't solve the problem of assigning too much weight to events outside our corner of the neighborhood, but it's an improvement.
      • Logical uncertainty may be our friend here, such that most reasonable plans incur roughly the same level of interstellar penalty noise.

Results

I tested a vanilla Q-learning agent and its whitelist-enabled counterpart in 100 randomly-generated grid worlds (dimensions up to ). The agents were rewarded for reaching the goal square as quickly as possible; no explicit penalties were levied for breaking objects.

The simulated classification confidence of each object's true class was (truncated to ), . This simulated sensor noise was handled with a Bayesian statistical approach.

At reasonable levels of noise, the whitelist-enabled agent completed all levels without a single side effect, while the Q-learner broke over 80 vases.

Assumptions

I am not asserting that these assumptions necessarily hold.

  • The agent has some world model or set of observations which can be decomposed into a set of discrete objects.
    • Furthermore, there is no need to identify objects on multiple levels (e.g., a forest, a tree in the forest, and that tree's bark need not all be identified concurrently).
    • Not all objects need to be represented - what do we make of a 'field', or the 'sky', or 'the dark places between the stars visible to the naked eye'? Surely, these are not all objects.
  • We have an ontology which reasonably describes (directly or indirectly) the vast majority of negative side effects.
    • Indirect descriptions of negative outcomes means that even if an undesirable transition isn't immediately penalized, it generally results in a number of penalties. Think: pollution.
    • Latent space whitelisting: the learned latent space encapsulates most of the relevant side effects. This is a slightly weaker assumption.
  • Said ontology remains in place.

Problems

Beyond resolving the above assumptions, and in roughly ascending difficulty:

Object Permanence

If you wanted to implement whitelisting in a modern embodied deep-learning agent, you could certainly pair deep networks with state-of-the-art segmentation and object tracking approaches to get most of what you need. However, what's the difference between an object leaving the frame, and an object vanishing?

Not only does the agent need to realize that objects are permanent, but also that they keep interacting with the environment even when not being observed. If this is not realized, then an agent might set an effect in motion, stop observing it, and then turn around when the bad effect is done to see a "new" object in its place.

Time Step Size Invariance

The penalty is presently attenuated based on the probability that the belief shift was due to noise in the data. Accordingly, there are certain ways to abuse this to skirt the penalty. For example, simply have non-whitelisted side effects take place over long timescales; this would be classified as noise and attenuated away.

However, if we don't need to handle noise in the belief distributions, this problem disappears - presumably, an advanced agent keeps its epistemic house in order. I'm still uncertain about whether (in the limit) we have to hard-code a means for decomposing a representation of the world-state into objects, and where to point the penalty evaluator in a potentially self-modifying agent.

Information Theory

Whitelisting is wholly unable to capture the importance of "informational states" of systems. It would apply no penalty to passing powerful magnets over your hard drive. It is not clear how to represent this in a sensible way, even in a latent space.

Loss of Value

Whitelisting could get us stuck in a tolerable yet sub-optimal future. Corrigibility via some mechanism for expanding the whitelist after training has ended is then desirable. For example, the agent could propose extensions to the whitelist. To avoid manipulative behavior, the agent should be indifferent as to whether the extension is approved.

Even if extreme care is taken in approving these extensions, mistakes may be made. The agent itself should be sufficiently corrigible and aligned to notice "this outcome might not actually be what they wanted, and I should check first".

Reversibility

As DeepMind outlines in Specifying AI Safety Problems in Simple Environments, we may want to penalize not just physical side effects, but also causally-irreversible effects:

Krakovna et al. introduce a means for penalizing actions by the proportion of initially-reachable states which are still reachable after the agent acts.

I think this is a step in the right direction. However, even given a hypercomputer and a perfect simulator of the universe, this wouldn't work for the real world if implemented literally. That is, due to entropy, you may not be able to return to the exact same universe configuration. To be clear, the authors do not suggest implementing this idealized algorithm, flagging a more tractable abstraction as future work.

What does it really mean for an "effect" to be "reversible"? What level of abstraction do we in fact care about? Does it involve reversibility, or just outcomes for the objects involved?

Ontological Crises

When a utility-maximizing agent refactors its ontology, it isn't always clear how to apply the old utility function to the new ontology - this is called an ontological crisis.

Whitelisting may be vulnerable to ontological crises. Consider an agent whose whitelist disincentivizes breaking apart a tile floor (); conceivably, the agent could come to see the floor as being composed of many tiles. Accordingly, the agent would no longer consider removing tiles to be a side effect.

Generally, proving invariance of the whitelist across refactorings seems tricky, even assuming that we can identify the correct mapping.

Retracing Steps

When I first encountered this problem, I was actually fairly optimistic. It was clear to me that any ontology refactoring should result in utility normalcy - roughly, the utility functions induced by the pre- and post-refactoring ontologies should output the same scores for the same worlds.

Wow, this seems like a useful insight. Maybe I'll write something up!

Turns out a certain someone beat me to the punch - here's a novella Eliezer wrote on Arbital about "rescuing the utility function".

Clinginess

This problem cuts to the core of causality and "responsibility" (whatever that means). Say that an agent is clingy when it not only stops itself from having certain effects, but also stops you. Whitelist-enabled agents are currently clingy.

Let's step back into the human realm for a moment. Consider some outcome - say, the sparking of a small forest fire in California. At what point can we truly say we didn't start the fire?

  • My actions immediately and visibly start the fire.
  • At some moderate temporal or spatial remove, my actions end up starting the fire.
  • I intentionally persuade someone to start the fire.
  • I unintentionally (but perhaps predictably) incite someone to start the fire.
  • I set in motion a moderately-complex chain of events which convince someone to start the fire.
  • I provoke a butterfly effect which ends up starting the fire.
  • I provoke a butterfly effect which ends up convincing someone to start a fire which they:
    • were predisposed to starting.
    • were not predisposed to starting.

Taken literally, I don't know that there's actually a significant difference in "responsibility" between these outcomes - if I take the action, the effect happens; if I don't, it doesn't. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?

Can we really do no better than a naive counterfactual penalty with respect to whatever impact measure we use? My confusion here is not yet dissolved. In my opinion, this is a gaping hole in the heart of impact measures - both this one, and others.

Stasis

Fortunately, a whitelist-enabled agent should not share the classic convergent instrumental goal of valuing us for our atoms.

Unfortunately, depending on the magnitude of the penalty in proportion to the utility function, the easiest way to prevent penalized transitions may be putting any relevant objects in some kind of protected stasis, and then optimizing the utility function around that. Whitelisting is clingy!

If we have at least an almost-aligned utility function and proper penalty scaling, this might not be a problem.

Edit: a potential solution to clinginess, with its own drawbacks.

Discussing Imperfect Approaches

A few months ago, Scott Garrabrant wrote about robustness to scale:

Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities.

I recommend reading it - it's to-the-point, and he makes good points.

Here are three further thoughts:

  • Intuitively-accessible vantage points can help us explore our unstated assumptions and more easily extrapolate outcomes. If less mental work has to be done to put oneself in the scenario, more energy can be dedicated to finding nasty edge cases. For example, it's probably harder to realize all the things that go wrong with naive impact measures like raw particle displacement, since it's just a weird frame through which to view the evolution of the world. I've found it to be substantially easier to extrapolate through the frame of something like whitelisting.
    • I've already adjusted for the fact that one's own ideas are often more familiar and intuitive, and then adjusted for the fact that I probably didn't adjust enough the first time.
  • Imperfect results are often left unstated, wasting time and obscuring useful data. That is, people cannot see what has been tried and what roadblocks were encountered.
  • Promising approaches may be conceptually-close to correct solutions. My intuition is that whitelisting actually almost works in the limit in a way that might be important.

Conclusion

Although somewhat outside the scope of this post, whitelisting permits the concise shaping of reward functions to get behavior that might be difficult to learn using other methods. This method also seems fairly useful for aligning short- and medium-term agents. While encountering some new challenges, whitelisting ameliorates or solves many problems with previous impact measures.


Even an idealized form of whitelisting is not sufficient to align an otherwise-unaligned agent. However, the same argument can be made against having an off-switch; if we haven't formally proven the alignment of a seed AI, having more safeguards might be better than throwing out the seatbelt to shed deadweight and get some extra speed. Of course, there are also legitimate arguments to be made on the basis of timelines and optimal time allocation.

Humor aside, we would have no luxury of "catching up on alignment theory" if our code doesn't work on the first go - that is, if the AI still functions, yet differently than expected.

Luckily, humans are great at producing flawless code on the first attempt.

A potentially-helpful analogy: similarly to how Bayesian networks decompose the problem of representing a (potentially extremely large) joint probability table to that of specifying a handful of conditional tables, whitelisting attempts to decompose the messy problem of quantifying state change into a set of comprehensible ontological transitions.

Technically, at 6,250 words, Eliezer's article falls short of the 7,500 required for "novella" status.

Is there another name for this?

I do think that "responsibility" is an important part of our moral theory, deserving of rescue.

In particular, I found a particular variant of Murphyjitsu helpful: I visualized Eliezer commenting "actually, this fails terribly because..." on one of my posts, letting my mind fill in the rest.

In my opinion, one of the most important components of doing AI alignment work is iteratively applying Murphyjitsu and Resolve cycles to your ideas.

A fun example: I imagine it would be fairly easy to train an agent to only destroy certain-colored ships in Space Invaders.

New Comment
26 comments, sorted by Click to highlight new comments since:
I'm fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.

What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two.

I think it is likely best to push against including that sort of thing in the Overton window of what's considered AI safety / AI alignment literature.

  • The people who really care about the field care about existential risks and superintelligence, and that's also the sort we want to attract to the field as it grows. It is pretty bad if the field drifts toward safety for self-driving cars and housecleaning robots, particularly if it trades off against research reducing existential risk.
  • There is a risk that a large body of safety literature which works for preventing today's systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem, by lulling safety-concerned people into a false sense of security, thinking that installing those solutions counts as sufficient caution. (Note that I am not complaining about using the vase example as a motivating example -- my concern lies with approaches which specifically target "short- to mid-term" without the robustness to scale to tackle far-term.)
  • There is something to be said about making problems on what we can (in the hopes that it will help create progress later where we currently don't have any traction), but robustness to scale is actually really essential to the hard/interesting part of the problem here. There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of "low impact" exactly right, so that it doesn't go wrong even in a very intelligent system.

(I am here referring to "robustness of scale" in the direction of robustness to high-capability systems, but I note that the term also refers to robustness in the low-capability direction and robustness to differences in relative capability of subcomponents. Those aren't as relevant to my critique here.)

You already name several failures of the suggested whitelisting approach; of these, I would point to "clinginess" as the most damning. A safeguard would ideally have the property that, added to an already-aligned AI, would not misalign that AI. Whitelisting fails badly in that desideratum; it creates an AI which would seek to reduce the impact of everything in the universe, not just itself.

I would point out several more potential failure modes. I think some of the following may apply to the approach as stated and others to nearby alternatives. I haven't yet understood exactly the version which you implemented, but my guess is that you would agree that the key idea is to penalize a shift in probability distributions, if those shifts haven't been whitelisted. Decisions about exactly which probability distributions are being compared, such as how change over time is treated, can be reasonably varied while keeping the spirit of the approach (as I understand it).

  • If the penalty is applied to immediate effects of actions, then the AI would not care if all the vases fall over and break after the AI completes the course. In other words, a penalty applied to immediate consequences will fail to penalize predictable long-term effects of actions.
  • If the penalty is applied in a way which penalizes long-term effects, then the AI might do what it can to hold everything still, or it might try to carry out its task but otherwise make the world as much as possible look like one where its task was not carried out (example: curing a fatal disease but then killing the patient, because a saved patient would have all sorts of non-whitelisted consequences eventually).
  • In particular, "clinginess" could make the AI want to take over the light cone to install impact-reducing measures everywhere.
  • Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.
  • Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as "changes" (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be "low impact" under the right conditions.

I think it is likely best to push against including that sort of thing in the Overton window of what’s considered AI safety /​ AI alignment literature.

I'm really sympathetic to these concerns but I'm worried about the possible unintended consequences of trying to do this. There will inevitably be a large group of people working on short and medium term AI safety (due to commercial incentives) and pushing them out of "AI safety /​ AI alignment literature" risks antagonizing them and creating an adversarial relationship between the two camps, and/or creates a larger incentive for people to stretch the truth about how robust to scale their ideas are. Is this something you considered?

I'm not sure how to think about this. My intuition is that this doesn't need to be a problem if people in (my notion of) the AI alignment field just do the best work they can do, so as to demonstrate by example what the larger concerns are. In other words, win people over by being sufficiently exciting rather than by being antagonizing/exclusive. I suppose that's not very consistent with my comment above.

I think most hard engineering problems are made up of a lot of smaller solutions and especially made up of the lessons learned attempting to implement small solutions, so I think it's incorrect to think of something that's useful but incomplete as being competitive to the true solution rather than actually being a part of the path to it.

I definitely agree with that. There has to be room to find traction. The concern is about things which specifically push the field toward "near-term" solutions, which slides too easily into not-solving-the-same-sorts-of-problems-at-all. I think a somewhat realistic outcome is that the field is taken over by standard machine learning research methodology of achieving high scores on test cases and benchmarks, to the exclusion of research like logical induction. This isn't particularly realistic because logical induction is actually not far from the sorts of things done in theoretical machine learning. However, it points at the direction of my concern.

I think it is likely best to push against including that sort of thing in the Overton window of what's considered AI safety / AI alignment literature.

Here's my understanding of your reasoning: "this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made."

I think this is possible, but rather unlikely:

  • It isn't clear that work allocation for immediate and long-term safety is zero-sum - Victoria wrote more about why this might not be the case.
  • The specific approach I took here might be conducive for getting more people currently involved with immediate safety interested in long-term approaches. That is, someone might be nodding along - "hey, this whitelisting thing might need some engineering to implement, but this is solid!" and then I walk them through the mental motions of discovering how it doesn't work, helping them realize that the problem cuts far deeper than they thought.
    • In my mental model, this is far more likely than pushing otherwise-promising people to inaction.
  • I'm actually concerned that a lack of overlap between our communities will insulate immediate safety researchers from long-term considerations, having a far greater negative effect. I have weak personal evidence for this being the case.
  • Why would people (who would otherwise be receptive to rigorous thinking about x-risk) lose sight of the greater problems in alignment? I don't expect DeepMind to say "hey, we implemented whitelisting, we're good to go! Hit the switch." In my model, people who would make a mistake like that probably were never thinking about x-risk to begin with.
my concern lies with approaches which specifically target "short- to mid-term" without the robustness to scale to tackle far-term.

As I understand it, this argument can also be applied to any work that doesn't plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we'd all prefer one-shots, sometimes research is more incremental (I'm sure this isn't news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought "does this scale? What insights can we take away?". I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.

We are currently grading this approach by the most rigorous of metrics - I think this is good, as that's how we will eventually be judged! However, we shouldn't lose sight of the fact that most safety work won't be immediately superintelligence-complete. Exploratory work is important. I definitely agree that we should shoot to kill - I'm not advocating an explicit focus on short-term problems. However, we shouldn't screen off value we can get sharing imperfect results.

There are many measures of impact which one can come up with; as you say, all of these create other problems when optimized very hard, because the AI can find clever ways to have a very low impact, and these end up being counter to our intentions. Your whitelisting proposal has the same sorts of problems. The interesting thing is to get a notion of "low impact" exactly right, so that it doesn't go wrong even in a very intelligent system.

I'd also like to push back slightly against an implication here - while it is now clear to me that "the interesting thing" is indeed this clinginess issue, this wasn't apparent at the outset. Perhaps I missed some literature review, but there was no such discussion of the hard core issues of impact measures; Eliezer certainly discussed a few naive approaches, but the literature was otherwise rather slim.

Penalizing a shift in probability distributions can incentivize the agent to learn as little as possible, which is a bit weird.

Yeah, I noticed this too, but I put that under "how do we get agents to want to learn about how the world is - i.e., avoid wireheading?". I also think that function composition with the raw utility would be helpful in avoiding weird interplay.

Certain versions will have the property that if the agent is already quite confident in what it will do, then consequences of those actions do not count as "changes" (no shift in probability when we condition on the act). This would create a loophole allowing for any actions to be "low impact" under the right conditions.

I don't follow - the agent has a distribution for an object at time , and another at . It penalizes based on changes in its beliefs about the actual world at the time steps - not with respect to its expectation.

"this kind of work may have the unintended consequence of pushing people who would have otherwise worked on hard core problems of x-risk to more prosaic projects, lulling them into a false sense of security when progress is made."

I think it is more like:

  • This kind of work seems likely to one day redirect funding intended for X-risk away from X-risk.
  • I know people who would point to this kind of thing to argue that AI can be made safe without the kind of deep decision theory thinking MIRI is interested in. Those people would probably argue against X-risk research regardless, but the more stuff there is that's difficult for outsiders to distinguish from X-risk relevant research, the more difficulty outsiders have assessing such arguments.

So it isn't so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research.

As I understand it, this argument can also be applied to any work that doesn't plausibly one-shot a significant alignment problem, potentially including research by OpenAI and DeepMind. While obviously we'd all prefer one-shots, sometimes research is more incremental (I'm sure this isn't news to you!). Here, I set out to make progress on one of the Concrete Problems; after doing so, I thought "does this scale? What insights can we take away?". I had relaxed the problem by assuming a friendly ontology, and I was curious what difficulties (if any) remained.

I agree that research has to be incremental. It should be taken almost for granted that anything currently written about the subject is not anywhere near a real solution even to a sub-problem of safety, unless otherwise stated. If I had to point out one line which caused me to have such a skeptical reaction to your post was, it would be:

I'm fairly confident that whitelisting contributes meaningfully to short- to mid-term AI safety,

If instead this had been presented as "here's something interesting which doesn't work" I would not have made the objection I made. IE, what's important is not any contribution to near- or medium- term AI safety, but rather exploration of the landscape of low-impact RL, which may eventually contribute to reducing X-risk. IE, more the attitude you express here:

We are currently grading this approach by the most rigorous of metrics - I think this is good, as that's how we will eventually be judged! However, we shouldn't lose sight of the fact that most safety work won't be immediately superintelligence-complete. Exploratory work is important.

So, I'm saying that exploratory work should not be justified as "confident that this contributes meaningfully to short-term safety". Almost everything at this stage is more like "maybe useful for one day having better thoughts about reducing X-risk, maybe not".

I don't follow - the agent has a distribution for an object at time t, and another at t+1. It penalizes based on changes in its beliefs about how the world actually is at the time steps - not with respect to its expectation.

Ah, alright.

So it isn't so much that I think people who would work on X-risk would be redirected, as that I think there will be a point where people adjacent to X-risk research will have difficulty telling which people are actually trying to work on X-risk, and also what the state of the X-risk concerns is (I mean to what extent it has been addressed by the research).

That makes more sense. I haven't thought enough about this aspect to have a strong opinion yet. My initial thoughts are that

  • this problem can be basically avoided if this kind of work clearly points out where the problems would be if scaled.
    • I do think it's plausible that some less-connected funding sources might get confused (NSF), but I'd be surprised if later FLI funding got diverted because of this. I think this kind of work will be done anyways, and it's better to have people who think carefully about scale issues doing it.
  • your second bullet point reminds me of how some climate change skeptics will point to "evidence" from "scientists", as if that's what convinced them. In reality, however, they're drawing the bottom line first, and then pointing to what they think is the most dignified support for their position. I don't think that avoiding this kind of work would ameliorate that problem - they'd probably just find other reasons.
  • most people on the outside don't understand x-risk anyways, because it requires thinking rigorously in a lot of ways to not end up a billion miles off of any reasonable conclusion. I don't think that this additional straw will marginally add significant confusion.
IE, what's important is not any contribution to near- or medium- term AI safety

I'm confused how "contributes meaningfully to short-term safety" and "maybe useful for having better thoughts" are mutually-exclusive outcomes, or why it's wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it's still reasonable that I mention it.

I'm saying that exploratory work should not be justified as "confident that this contributes meaningfully to short-term safety". Almost everything at this stage is more like "maybe useful for one day having better thoughts about reducing X-risk, maybe not".

I'm confused why that latter statement wasn't what came across! Later in that sentence, I state that I don't think it will scale. I also made sure to highlight how it breaks down in a serious way when scaled up, and I don't think I otherwise implied that it's presently safe for long-term efforts.

I totally agree that having better thoughts about x-risk is a worthy goal at this point.

I'm confused how "contributes meaningfully to short-term safety" and "maybe useful for having better thoughts" are mutually-exclusive outcomes, or why it's wrong to say that I think my work contributes to short-term efforts. Sure, that may not be what you care about, but I think it's still reasonable that I mention it.

In hindsight I am regretting the way my response went. While it was my honest response, antagonizing newcomers to the field for paying any attention to whether their work might be useful for sub-AGI safety doesn't seem like a good way to create the ideal research atmosphere. Sorry for being a jerk about it.

Although I did flinch a bit, my S2 reaction was "this is Abram, so if it's criticism, it's likely very high-quality. I'm glad I'm getting detailed feedback, even if it isn't all positive". Apology definitely accepted (although I didn't view you as being a jerk), and really - thank you for taking the time to critique me a bit. :)

[-]VikaΩ5170

Interesting work! Seems closely related to this recent paper from Satinder Singh's lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?

I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent "does nothing"), e.g. by simulating the environment or using a causal model of the environment.

I'm not convinced that whitelisting avoids the offsetting problem: "Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low." I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.

The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.

That's pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.

What are the main advantages of your approach over their approach?

As I understand it, my formulation has the following advantages:

  • Automatically deduces what effects are, while making weaker assumptions (like the ability to segment the world into discrete objects and maintain beliefs about their identities).
    • In contrast, their approach requires complete enumeration of side effects. Also, they seem to require the user to specify all effects, but also claim that specifying whether the effect is good is too expensive? .
    • It's unclear how to apply their feature decomposition to the real world. For example, if it's OK for an interior door to be opened, how does the agent figure out if the door is open outside of toy MDPs? What about unlocked but closed, or just slightly ajar? Where is the line drawn?
    • The number of features and values seems to grow extremely quickly with the complexity of the environment, which gets us back to the "no compact enumeration" problem.
  • Doesn't require a complete model.
    • To be fair, calculating the counterfactual in my formulation requires at least a reasonably-accurate model.
  • Works in stochastic environments.
    • It is possible that their approach could be expanded to do so, but it is not immediately clear to me how.
  • Uses counterfactual reasoning to provide a limited reduction in clinginess.
  • Can trade off having an unknown effect with large amounts of reward. Their approach would do literally anything to prevent unknown effects.
  • Can act even when no perfect outcome is attainable.
  • Has plausibly-negligible performance overhead.
  • Can be used with a broad(er) class of RL approaches, since any given whitelist implicitly defines a new reward function .
It might be possible to get around [clinginess] by using an inaction baseline

Yes, but I think a basic counterfactual wouldn't quite get us there, since if people have different side effects due to reacting to the agent's behavior, that would still be penalized. My position right now is that either 1) we can do really clever things with counterfactuals to cleanly isolate the agent's effects, 2) the responsibility problem is non-trivially entangled with our ethics and doesn't admit a compact solution, or 3) there's another solution I have yet to find. I'm not putting much probability mass on 2) (yet).

In fact, clinginess stems from an incorrect assumption. In a single-agent environment, whitelisting seems to be pretty close to what we want for any capability level. It finds a plan that trades off maximizing the original reward with minimizing side-effects compared to an inaction baseline. However, if there are other agents in the environment whose autonomy is important, then whitelisting doesn't account for the fact that restricting autonomy is also a side effect. The inaction counterfactual ignores the fact that we want to be free to act with respect to the agent's actions - not just how we would have acted in the agent's absence.

I've been exploring solutions to clinginess via counterfactuals from first principles, and might write about it in the near future.

I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal

So I think the main issue here is still clinginess; if the agent isn't clingy, the problem goes away.

Suppose people can be dead, sick, or healthy, where the sick people die if the agent does nothing. While curing people is presumably whitelisted (or could be queried as an addition to the whitelist), curing people means more people means more broken things in expectation (since people might have non-whitelisted side effects). If we have a non-clingy formulation, this is fine; if not, this would indeed be optimized against. In fact, we'd probably just get a bunch of people cured of their cancer and put in stasis.

I wonder whether "avoiding side effects" will play any role in long-term AI safety. It seems to me that in the long run, we have to assume that the user might tell the AI to do something that intrinsically must have lots of side effects, and therefore requires learning a detailed model of the user's values in order to backchain through only good side effects (or at least the less bad ones). For example, "make money" (making people happy is generally a good side effect, but certain ways of making people happy are bad, don't let elections be influenced through your product, except through certain legitimate ways, hacking the bank is bad but taking advantage of certain quirks in the stock market is ok, etc.) or "win this war" (only kill combatants, not civilians, be humane to prisoners, don't let civilians come to harm through inaction, don't value civilian lives so much that human shields become an unbeatable tactic, etc.)

If the AI has a detailed model of the user's values and can therefore safely accomplish goals that intrinsically have lots of side effects, it can also apply that to safely accomplish goals that don't intrinsically have lots of side effects, without needing a separate "avoiding side effects" solution. Does anyone disagree with this?

therefore requires learning a detailed model of the user's values in order to backchain through only good side effects

That would just be the normal utility function. The motivation here isn't finding solutions to low-impact problems, it's minimizing impact while solving problems. One way to do that is by, well, measuring impact.

If the AI has a detailed model of the user's values and can therefore safely accomplish goals that intrinsically have lots of side effects, it can also apply that to safely accomplish goals that don't intrinsically have lots of side effects, without needing a separate "avoiding side effects" solution.

This seems to assume a totally-aligned agent and to then ask, "do we need anything else"? Well, no, we don't need anything beyond an agent which works to further human values just how we want.

But we may not know that the AI is fully aligned, so we might want to install off-buttons and impact measures for extra safety. Furthermore, having large side effects correlates strongly with optimizing away to extreme regions of the solution space; balancing maximizing original utility with minimizing a satisfactory, conservative impact measure (which whitelisting is not yet) bounds our risk in the case where the agent is not totally aligned.

Ok, if I understand your view correctly, the long-term problem is better described as "minimizing impact" rather than "avoiding side effects" and it's meant to be a second line of defense or a backup safety mechanism rather than a primary one.

Since "Concrete Problems in AI Safety" takes the short/medium term view and introduces "avoiding side effects" as a primary safety mechanism, and some people might not extrapolate correctly from that to the long run, do you know a good introduction to the "avoiding side effects"/"minimizing impact" problem that lays out both the short-term and long-term views?

ETA: Found this and this, however both of them also seem to view "low impact" as a primary safety mechanism, in other words, as a way to get safe and useful work out of advanced AIs before we know how to give them the "right" utility function or otherwise make them fully value aligned.

Whoops, illusion of transparency! The Arbital page is the best I've found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.

What do you think about Paul Christiano's argument in the comment to that Arbital page?

Do you think avoiding side effects / low impact could work if an AGI was given a task like "make money" or "win this war" that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?

(Feel free not to answer if you don't have well formed thoughts on these questions. I'm curious what people working on this topic think about these questions, and don't mean to put you in particular on the spot.)

My current thoughts on this:

It seems like Paul's proposed solution here depends on the rest of Paul's scheme working (you need the human's opinions on what effects are important to be accurate). Of course if Paul's scheme works in general, then it can be used for avoiding undesirable side effects.

My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like "cause this variable here to be this value at this time step". In general you want causal consequences of changing the variable to happen, but few other effects.

From this paper I wrote:

It may be possible to use the concept of a causal counterfactual (as formalized by Pearl [2000]) to separate some intended effects from some unintended ones. Roughly, “follow-on effects” could be defined as those that are causally downstream from the achievement of the goal of building the house (such as the effect of allowing the operator to live somewhere). Follow-on effects are likely to be intended and other effects are likely to be unintended, although the correspondence is not perfect. With some additional work, perhaps it will be possible to use the causal structure of the system’s world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.

For things like "make money" there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren't additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).

I think "win this war" is going to be a pretty difficult goal to formalize (as a bunch of what is implied by "winning a war" is psychological/sociological); probably it is better to think about achieving specific military objectives.

I realize I'm shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will be one use of the ontology mapping system.

EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.

(a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).

Are you aware of any previous discussion of this, in any papers or posts? I'm skeptical that there's a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don't want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don't see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how "conservatism" might work here?

It should definitely take desiderata into account, I just mean it doesn't have to be VNM. One reason why it might not be VNM is if it's trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.

In general things like "don't have side effects" are motivated by robustness desiderata, where we don't trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.

Nice work! Whitelisting seems like a good thing to do, since it is safe by default. (Computer security has a similar principle of preferring to whitelist instead of blacklist.) I was initially worried that we'd have the problems of symbolic approaches to AI, where we'd have to enumerate far too many transitions for the whitelist in order to be able to do anything realistic, but since whitelisting could work on learned embedding spaces, and the whitelist itself can be learned from demonstrations, this could be a scalable method.

I'm worried that whitelisting presents generalization challenges -- if you are distinguishing between different colors of tiles, to encode "you can paint any tile" you'd have to whitelist transitions (redTile -> blueTile), (blueTile -> redTile), (redTile -> yellowTile) etc. Those won't all be in the demonstrations. If you are going to generalize there, how do you _not_ generalize (redLight -> greenLight) to (greenLight -> redLight) for an AI that controls traffic lights? It seems like you want to

On another note, I personally don't want to assume that we can point to a part of the architecture as the AI's ontology.

On the technical side: The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren't worried about cost. If you have a -> b and b -> c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c.

You could just define the whitelist to be transitively closed, since it's not hard to compute the transitive closure of a directed graph.

I'm worried that whitelisting presents generalization challenges

I think you might have to bite the paintbucket in this case. Note that in a latent space formulation, having one tile-painting-transition whitelisted might suggest that other painting applications would have lower (but not 0) cost. In general, I agree - I don't see how we could expect reasonable generalization along those lines because of the traffic light issue. It isn't clear how big of a problem this is, though.

The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren't worried about cost. If you have a -> b and b -> c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c.

That's correct - if isn't on the whitelist, it might de facto incur additional costs (whether in time or resources). I suppose I was pointing to the idea that our theoretical preferences should be closed under transitivity - if we accept , we should not reject happening over time.

You could just define the whitelist to be transitively closed, since it's not hard to compute the transitive closure of a directed graph.

Good point! Does get trickier in latent space, though.

I'm not especially familiar with all the literature involved here, so forgive me if this is somehow repetitive.

However, I was wondering if having two lists might be more preferable. Naturally, there would be non-whitelisted objects (do not interfere with these in any way). Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use). Third, of course, would be objects with "full permissions", such as, potentially, the paint on the aforementioned tiles.

The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter. Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.

Hey, thanks for the ideas!

non-whitelisted objects

Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!

Second, there could be objects which are fine to manipulate but must retain functional integrity (for instance, a book can be freely manipulated under most circumstances; however, it cannot be moved so it becomes out of reach or illegible, and should not be moved or obstructed while in use).

"Out of reach" is indexical, and it's not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don't lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).

Third, of course, would be objects with "full permissions", such as, potentially, the paint on the aforementioned tiles.

I think that latent space whitelisting actually captures this kind of permissions-based granularity. As I imagine it, a descriptive latent space would act as an approximation to thingspace. Something I'm not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it's doable, whether via my formulation or some other one.

The main difficulty here is that definitions for functional integrity would have to be either written or learned for virtually every function, though I suspect it would be (relatively) easy enough to recognise novel objects and their functions thereafter.

One of the roles I see whitelisting attempting to fill is that of tracing a conservative convex hull inside the outcome space, locking us out of some good possibilities but (hopefully) many more bad ones. If we get into functional values, that's elevating the complexity from "avoid doing unknown things to unknown objects" to "learn what to do with each object". We aren't trying to build an entire utility function - we're trying to build a sturdy, conservative convex hull, and it's OK if we miss out on some details. I have a heuristic that says that the more pieces a solution has, the less likely it is to work.

Of course, there could also be some sort of machine-readable identification added to common objects which carries information on their functions, though whether this would only refer to predefined classes (books, bicycles, vases) or also be able to contain instructions on a new function type (potentially a useful feature for new inventions and similar) is a separate question.

This post's discussion implicitly focuses on how whitelisting interacts with more advanced agents for whom we probably wouldn't need to flag things like this. I think if we can get it robustly recognizing objects in its model and then projecting them into a latent space, that would suffice.

> Important detail: the whitelist is only with respect to transitions between objects, not the object themselves!

I understand the technical and semantic distinction here, but I'm not sure I understand the practical one, when it comes to actual behaviour and results. Is there a situation you have in mind where the two approaches would be notably different in outcome?

> Something I'm not sure about is whether the described dissimilarity will map up with our intuitive notions of dissimilarity. I think it's doable, whether via my formulation or some other one.

Well, there's also the issue that there are different opinions on different sorts of transitions between actual, living humans. There will probably never be an end to the arguments over whether graffiti is art or vandalism, for example. Dissimilarities between average human and average non-human notions should probably be expected, to some extent; perhaps even beneficial, assuming alignment goes well enough otherwise.

> "Out of reach" is indexical, and it's not clear how (and whether) to even have whitelisting penalize displacing objects. Stasis notwithstanding, many misgivings we might have about an agent being able to move objects at its leisure should go away if we can say that these movements don't lead to non-whitelisted transitions (e.g., putting unshielded people in space would certainly lead to penalized transitions).

Good point. Though, it's possible to imagine displacement becoming such a transition without the harm being so overt. As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.

>If we get into functional values, that's elevating the complexity from "avoid doing unknown things to unknown objects" to "learn what to do with each object". We aren't trying to build an entire utility function - we're trying to build a sturdy, conservative convex hull, and it's OK if we miss out on some details.

My intent in bringing it up was less, "simple whitelisting is too restrictive," and more, "maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition."

In other words, it's less a replacement for the concept of whitelisting and more of a possible way of limiting its potential downsides. Of course, it would need to be implemented carefully, or else the benefits of whitelisting could also easily be lost, at least in part...

> I have a heuristic that says that the more pieces a solution has, the less likely it is to work.

While true, this reminds me of the Simple Poker series. The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).

Additional pieces can make failure more likely, but too much simplicity can preclude success.

>I think if we can get it robustly recognizing objects in its model and then projecting them into latent space, that would suffice.

True, though there are many cases in which this doesn't work so well. For a more practical and serious example, a fair number of people need to wear alert tags of some sort which identify certain medical conditions or sensitivities, or else they could be inadvertently killed by paramedics or ER treatment. Road signs and various sorts of notice also exist to fulfill similar purposes for humans.

While it would be more than possible to have a non-human agent able to read the information in such cases, written text is a form of information transmission designed and optimised for human visual processing, and comes with numerous drawbacks, including a distinct possibility that the written information is not noticed altogether, and these are things a machine-specific form of 'tagging' could likely easily bypass.

It's hardly the first line solution, of course.

Is there a situation you have in mind where the two approaches would be notably different in outcome?

Can you clarify what you mean by whitelisting objects? Would we only be OK with certain things existing, or coming into existence (i.e., whitelisting an object effectively whitelists all means of getting to that object), or something else entirely?

As an example, even humans are prone (if usually by accident) to dropping or throwing objects in such a way as to make their retrieval difficult or, in some cases, effectively impossible; a non-human agent, I think, should take care to avoid making the same mistake, where not necessary.

I hadn't thought of this, actually! So, part of me wants to pass this off to the utility function also caring about not imposing retrieval costs on itself, because if it isn't aligned enough to somewhat care about the things we do, we might be out of luck. That is, whitelisting isn't sufficient to align a wholly unaligned agent - just to make states we don't want harder to reach. If it has values orthogonal to ours, misplaced items might be the least of our concerns. Again, I think this is a valid consideration, and I'm going to think about it more!

The solution described in the second entry there was quite complicated (certainly much moreso than the Nash equilibrium), but also quite successful (including, apparently, against Nash equilibrium opponents).

Certainly more complex solutions can do better, but I imagine that the work required to formally verify an aligned system is a quadratic function of how many moving parts there are (that is, part must play nice with all previous parts).

maybe this would allow for a lesser number of lost opportunities while still coming fairly close to ensuring that things which are both unregistered and unrecognisable (by the agent in question) would not suffer an unfavourable transition.

My current thoughts are that a rich enough latent space should also pick up unknown objects and their shifts, but this would need testing. Also, wouldn't it be more likely that the wrong functions are extrapolated for new objects and we end up missing out on even more opportunities?