Epistemic Status
Related Posts
See also:
- Complexity of Value
- Value is Fragile
- The Hidden Complexity of Wishes
- But exactly how complex and fragile?
Robust Values Hypothesis
Consider the following hypothesis:
- There exists a "broad basin of attraction" around a privileged subset of human values[1] (henceforth "ideal values")
- The larger the basin the more robust values are
- Example operationalisations[2] of "privileged subset" that gesture in the right direction:
- Minimal set that encompasses most of the informational content of "benevolent"/"universal"[3] human values
- The "minimal latents" of "benevolent"/"universal" human values
- Example operationalisations of "broad basin of attraction" that gesture in the right direction:
- A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in )
- Larger neighbourhood larger basin
- A neighbourhood of the privileged subset with the property that all points in the neighbourhood are suitable targets for optimisation (in the sense used in )
- Said subset is a "naturalish" abstraction
- The more natural the abstraction, the more robust values are
- Example operationalisations of "naturalish abstraction"
- The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
- More privileged more natural
- Most efficient representations of our universe contain a simple embedding of the subset
- Simpler embeddings more natural
- The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
- Points within this basin are suitable targets for optimisation
- The stronger the optimisation pressure applied for which the target is still suitable, the more robust values are.
- Example operationalisations of "suitable targets for optimisation":
- Optimisation of this target is existentially safe[4]
- More strongly, we would be "happy" (where we fully informed) for the system to optimise for these points
The above claims specify different dimensions of "robustness". Questions about robustness should be understood as asking about all of them.
Why Does it Matter?
The degree to which values are robust seems to be very relevant from an AI existential safety perspective.
- The more robust values are, the more likely we are to get alignment by default (and vice versa).
- The more robust values are, the easier it is to target AI systems at ideal values (and vice versa).
- Such targeting is one approach to solve the alignment problem[5]
- If values are insufficiently robust, then value learning may not be viable at all
- Including approaches like RHLF, CIRL/DIRL, etc.
- It may not be feasible to train a system to optimise for suitable targets
Questions
A. What's the best/most compelling evidence/arguments in favour of robust values
B. What's the best/most compelling evidence/arguments against robust values?
C. To what degree do you think values are robust?
I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.
- ^
Using the shard theory conception of "value" as "contextual influence on decision making".
- ^
To be clear, "example operationalisation" in this document does not refer to any kind of canonical formalisations. The example operationalisations aren't even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.
- ^
"Benevolent": roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.
"Universal": roughly the subset of human values that we are happy for other humans to optimise for.
- ^
Including "astronomical waste" as an existential catastrophe.
- ^
The other approach being to safeguard systems that may not necessarily be optimising for values that we'd be "happy" for them to pursue, were we fully informed.
Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.
Hi! I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo that an AGI wouldn't want to kill humans right away.
I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.
I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.
Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.
Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.
There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion'
The reason we're so concerned with instrumental convergence is that we're usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it's a lot smarter than us, any benefit we could give it is small compared to the risk that we'll try to kill it or create more AGIs that will.
The future is hard to predict, that's why it's safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn't that capable, we're not that concerned. But AGI will hav... (read more)