I think there might be a broad set of values that emerge around group survival, essentially game-theoretic or evolutionary pressures that lead to cooperation. But I think the details beyond that are likely to incredibly specific. I'd point to the "preference construction" literature as a more realistic account of how humans make choices, without assuming an underlying consistent preference structure.
My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don't ever get alignment by default by just throwing a dart into AI-space.
A load bearing claim of the robust values hypothesis for "alignment by default" is :
The safety comes...
Hi! I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo that an AGI wouldn't want to kill humans right away.
I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.
I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.
Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.
Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.
There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion'
The reason we're so concerned with instrumental convergence is that we're usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it's a lot smarter than us, any benefit we could give it is small compared to the risk that we'll try to kill it or create more AGIs that will.
The future is hard to predict, that's why it's safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn't that capable, we're not that concerned. But AGI will hav...
Epistemic Status
Discussion question.
Related Posts
See also:
Robust Values Hypothesis
Consider the following hypothesis:
The above claims specify different dimensions of "robustness". Questions about robustness should be understood as asking about all of them.
Why Does it Matter?
The degree to which values are robust seems to be very relevant from an AI existential safety perspective.
Questions
A. What's the best/most compelling evidence/arguments in favour of robust values
B. What's the best/most compelling evidence/arguments against robust values?
C. To what degree do you think values are robust?
I am explicitly soliciting opinions, so do please answer even if you do not believe your opinion to be particularly informed.
Using the shard theory conception of "value" as "contextual influence on decision making".
To be clear, "example operationalisation" in this document does not refer to any kind of canonical formalisations. The example operationalisations aren't even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.
"Benevolent": roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.
"Universal": roughly the subset of human values that we are happy for other humans to optimise for.
Including "astronomical waste" as an existential catastrophe.
The other approach being to safeguard systems that may not necessarily be optimising for values that we'd be "happy" for them to pursue, were we fully informed.
Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.