Epistemic Status

Discussion question.

 

See also:


Robust Values Hypothesis

Consider the following hypothesis:

  1. There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values")
    1. The larger the basin the more robust values are
    2. Example operationalisations[2] of "privileged subset" that gesture in the right direction:
      1. Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values
      2. The "minimal latents" of "benevolent"/"universal" human values
    3. Example operationalisations of "broad basin of attraction" that gesture in the right direction:
      1. A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in )
        1. Larger neighbourhood  larger basin
  2. Said subset is a "naturalish" abstraction
    1. The more natural the abstraction, the more robust values are
    2. Example operationalisations of "naturalish abstraction"
      1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
        • More privileged  more natural
      2. Most efficient representations of our universe contain a simple embedding of the subset
        • Simpler embeddings  more natural
  3. Points within this basin are suitable targets for optimisation
    1. The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
    2. Example operationalisations of "suitable targets for optimisation":
      1. Optimisation of this target is existentially safe[4]
      2. More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points

 

The above claims specify different dimensions of "robustness". Questions about robustness should be understood as asking about all of them.


Why Does it Matter?

The degree to which values are robust seems to be very relevant from an AI existential safety perspective.

  • The more robust values are, the more likely we are to get alignment by default (and vice versa).
  • The more robust values are, the easier it is to target AI systems at ideal values (and vice versa).
    • Such targeting is one approach to solve the alignment problem[5]
    • If values are insufficiently robust, then value learning may not be viable at all
      • Including approaches like RHLF, CIRL/DIRL, etc.
      • It may not be feasible to train a system to optimise for suitable targets

Questions

A. What's the best/most compelling evidence/arguments in favour of robust values

B. What's the best/most compelling evidence/arguments against robust values?

C. To what degree do you think values are robust?


I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.

  1. ^

    Using the shard theory conception of "value" as "contextual influence on decision making".

  2. ^

    To be clear, "example operationalisation" in this document does not refer to any kind of canonical formalisations. The example operationalisations aren't even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.

  3. ^

    "Benevolent": roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.

    "Universal": roughly the subset of human values that we are happy for other humans to optimise for.

  4. ^

    Including "astronomical waste" as an existential catastrophe.

  5. ^

    The other approach being to safeguard systems that may not necessarily be optimising for values that we'd be "happy" for them to pursue, were we fully informed.

    Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.

New Answer
New Comment

3 Answers sorted by

Jonathan Stray

21

I think there might be a broad set of values that emerge around group survival, essentially game-theoretic or evolutionary pressures that lead to cooperation. But I think the details beyond that are likely to incredibly specific. I'd point to the "preference construction" literature as a more realistic account of how humans make choices, without assuming an underlying consistent preference structure.

Charlie Steiner

20

My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don't ever get alignment by default by just throwing a dart into AI-space.

A load bearing claim of the robust values hypothesis for "alignment by default" is :

  1. Said subset is a "naturalish" abstraction
    1. The more natural the abstraction, the more robust values are
    2. Example operationalisations of "naturalish abstraction"
      1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
        • More privileged  more natural
      2. Most efficient representations of our universe contain a simple embedding of the subset
        • Simpler embeddings  more natural

 

The safety comes... (read more)

2Charlie Steiner
Sure. Though see Take 4.
1DragonGod
Claim #1 (about a "privileged subset") is a claim that there aren't multiple such natural abstractions (e.g. any other subset of human values that satisfies #3 would be a superset of the privileged subset, or a subset of the basin of attraction around the privileged subset.) [But I haven't yet fully read that post or your other linked posts.]

Mark Neyer

-1-2

Hi!  I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust.  Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo that an AGI wouldn't want to kill humans right away.

I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment. 

I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.


Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size.  If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.

Hence I suspect  that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop. 

There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion' 

If your future doesn't have billions of agents, you don't need to predict them.

1Mark Neyer
Fine, replace the agents with rocks. The problem still holds. There's no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on.  There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it's made of many complex pieces that can all break or fail.

The reason we're so concerned with instrumental convergence is that we're usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it's a lot smarter than us, any benefit we could give it is small compared to the risk that we'll try to kill it or create more AGIs that will.

The future is hard to predict, that's why it's safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn't that capable, we're not that concerned. But AGI will hav... (read more)