Followup to: Morality is Scary, AI design as opportunity and obligation to address human safety problems
In Corrigibility, Paul Christiano argued that in contrast with ambitious value learning, an act-based corrigible agent is safer because there is a broad basin of attraction around corrigibility:
In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).
But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences (assuming a metaethics in which this makes sense, i.e., where one can be wrong about one's values). Some possible examples of human overseer's distorted preferences, in case it's not clear what I have in mind:
- Wrong object level preferences, such as overweighting values from a contemporary religion or ideology, and underweighting other plausible or likely moral concerns.
- Wrong meta level preferences (preferences that directly or indirectly influence one's future preferences), such as lack of interest in finding or listening to arguments against one’s current moral beliefs, willingness to use "cancel culture" and other coercive persuasion methods against people with different moral beliefs, awarding social status for moral certainty instead of uncertainty, and the revealed preferences of many powerful people for advisors who reinforce one’s existing beliefs instead of critical or neutral advisors.
- Ignorance / innocent mistakes / insufficiently cautious meta level preferences in the face of dangerous new situations. For example, what kinds of experiences (especially exotic experiences enabled by powerful AI) are safe or benign to have, what kinds of self-modifications to make, what kinds of people/AI to surround oneself with, how to deal with messages that are potentially AI-optimized for persuasion.
In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer's true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values. But if there actually was a broad basin of attraction around human values, then "we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on" could apply to other alignment approaches besides corrigibility / intent alignment, such as ambitious value learning, thus undermining Paul's argument in "Corrigibility". One immediate upshot seems to be that I, and others who were persuaded by that argument, should perhaps pay a bit more attention to other approaches.
I'll leave you with two further lines of thought:
- Is there actually a broad basin of attraction around human values? How do we know or how can we find out?
- How sure do AI builders need to be about this, before they can be said to have done the right thing, or have adequately discharged their moral obligations (or whatever the right way to think about this might be)?
When training a neural network, is there a broad basin of attraction around cat classifiers? Yes. There is a gigantuous number of functions that perfectly match the observed data and yet and discarded by the simplicity (and other) biases in our training algorithm in favor of well-behaving cat classifiers. Around any low kolmogorov complexity object there is an immense neighborhood of high complexity ones.
The only way I can see this making sense is if you again have a bias of simplicity for values, otherwise you are claiming that there is some value function that is more complex than the current value function of these agents and that it is privileged against the current one - but then, to arrive at this function you have to conjure information out of nowhere. If you took the information from other places, like averaging the values of many agents, then you actually want to align with the values of these many agents, or whatever else you used.
In fact it seems to be the case with your examples that you are favoring simplicity - if the agents were smarter they would realize their values were misbehaving. But that *is* looking for simpler values - if you through reasoning discovered some part of your values contradict others, you have just arrived at a simpler value function, since the contradicting parts needed extra specification, i.e. were noisy, and you weren't smart enough to see that.