Steven Byrnes's position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].…
This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences - but how it achieves these consequences doesn't have to be typical.
I think that problem may be solvable. See Section 1.1 (and especially 1.1.3) of this post. Here are two oversimplified implementations:
I was thinking of #2, not #1.
It seems like you're assuming #1. And also assuming that the result of #1 will be an AI that is "trying" to maximize the human grades / reward. Then you conclude that the AI may learn to appear to be norm-following and helpful, instead of learning to be actually norm-following and helpful. (It could also just hack into the reward channel, for that matter.) Yup, those do seem like things that would probably happen in the absence of other techniques like interpretability. I do have a shred of hope that, if we had much better interpretability than we do today, approach #1 might be salvageable, for reasons implicit in here and here. But still, I would start at #2 not #1.
I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)
I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)
I'd like to see them. I'll wait for the final (posted) versions, I think.
Notice that all those desiderata are much easier when the AI knows our (extrapolated) preferences. It is not clear at all that they can be achieved otherwise.
It seems like, as long as she wanted to, a human Alice could satisfy these desiderata when helping Bob, even though Alice doesn't know Bob's extrapolated preferences? So I'm not sure why you think an intelligent AI couldn't do the same.
Maybe you think that it's because Alice and Bob are both humans? But I also think Alice could satisfy these desiderata when helping an alien from a different planet -- she would definitely make some mistakes, but presumably not the existentially catastrophic variety*.
*unless the alien has some really unusual values where an existential catastrophe can be caused by accident, e.g. "if anyone ever utters the word $WORD, that is the worst possible universe", but those sorts of values seem very structurally different than human values.
I actually don't think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she's just projecting human assumptions in alien behaviour and statements.
She needs an alien theory of mind to understand what the alien wants
Absolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that's done).
Maybe you're saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don't feel the force of that intuition (despite the Occam's razor impossibility result).
Developing this idea a bit: https://www.lesswrong.com/posts/kMJxwCZ4mc9w4ezbs/how-an-alien-theory-of-mind-might-be-unlearnable
If we are afraid that an AI will use some creepy hypnotic techniques on a human judge, would it be possible to make two AIs with opposing goals? The first one will plan the best course of action, as judged by a human judge. The second one will provide the best argument against the plan, to another human judge. If the responses of the two human judges were dramatically incompatible, we would know that something weird happened to them.
(Would it be possible for the first AI to somehow bribe the second one? Like, to encode somewhere in the plan a message "I know this plan sucks for humans, but if you approve it, I will reward you in my glorious kingdom"? Maybe not, if the goal of the second AI is literally that the first AI's plan will not happen.)
I assume that both AIs would be as similar as possible, so that if one can hypnotize humans, the other can too.
Thus the AI needs to be able to translated human answers to questions, into human values - a task which cannot be achieved merely from observation, and requires knowing the human theory of mind.
This rings the bell of shared/collective intelligence, consisting of shared models (Friston et al. 2022):
We have noted that intelligence as self-evidencing is inherently perspectival, as it involves actively making sense of and engaging with the world from a specific point of view (i.e., given a set of beliefs). Importantly, if the origins of intelligence indeed lie in the partitioning of the universe into subsystems by probabilistic boundaries, then intelligence never arises singly but always exists on either side of such a boundary [103, 104]. The world that one models is almost invariably composed of other intelligent agents that model one in turn.
This brings us back to the insight that intelligence must, at some level, be distributed over every agent and over every scale at which agents exist. Active inference is naturally a theory of collective intelligence. There are many foundational issues that arise from this take on intelligence; ranging from communication to cultural niche construction: from theory of mind to selfhood [103–107]. On the active inference account, shared goals emerge from shared narratives, which are provided by shared generative models [108]. Furthermore—on the current analysis—certain things should then be curious about each other.
However, Hipolito & Van Es (2022) reject the blend of ToM and this collective intelligence account through model sharing:
While some social cognition theories seemingly take an enactive perspective on social cognition, they explain it as the attribution of mental states to other people, by assuming representational structures, in line with the classic Theory of Mind (ToM). Holding both enactivism and ToM, we argue, entails contradiction and confusion due to two ToM assumptions widely known to be rejected by enactivism: that (1) social cognition reduces to mental representation and (2) social cognition is a hardwired contentful ‘toolkit’ or ‘starter pack’ that fuels the model-like theorising supposed in (1).
[...] human feedback works best when the AI is already well-aligned with us. Getting that likely involves solving issues like value extrapolation.
I agree with this. I think this is because for the feedback to be helpful (reducing the divergence between probabilistic preference models of humans and AI), normative assumptions should be shared. There should also be good alignment on semantics and philosophy of language. I wrote about this in the context of "eliciting 'true' model's beliefs/aligning world models", and the reasoning about adjusting preferences via human feedback will be the same--after all, preferences are a part of the world model:
Note, again, how the philosophy of language, semantics, and inner alignment are bigger problems here. If your and model's world models are not inner-aligned (i. e., not equally grounded in reality), some linguistic statements can be misinterpreted, which makes, in turn, these methods for eliciting beliefs unreliable. Consider, for example, the question like "do you believe that killing anyone could be good?", and humans and the model are inner misaligned on what "good" means. No matter how reliable your elicitation technique, what you elicit is useless garbage if you don't already share a lot of beliefs.
It seems that it implies that the alignment process is unreliable unless humans and the model are already (almost) aligned; consequently, the alignment should start relatively early in training superhuman models, not after the model is already trained. Enter: model "development", "upbringing", vs. indiscriminate "self-supervised training" on text from the internet in random order.
What I primarily disagree with you in your program is the focus on human preferences and their extrapolation. What about the preferences of cyborg people with Neuralink-style implants? Cyborgs? Uploaded people? I'd argue that the way they would extrapolate their values would be different from "vanilla humans". So, I sense in "human value extrapolation" anthropocentrism that will not stand the test of time.
As I write in my proposal for a research agenda for a scale-free theory of ethics:
Most theories of morality (philosophical, religious, and spiritual ethics alike) that humans have created to date are deductive reconstructions of a theory of ethics from the heuristics for preference learning that humans have: values and moral intuitions.
This deductive approach couldn’t produce good, general theories of ethics, as has become evident recently with a wave of ethical questions about entities that most ethical theories of the past are totally unprepared to consider (Doctor et al. 2022), ranging from AI and robots (Müller 2020; Owe et al. 2022) to hybrots and chimaeras (Clawson & Levin 2022) and organoids (Sawai et al. 2022). And as the pace of technological progress increases, we should expect the transformation of the environments to happen even faster (which implies that the applied theories of ethics within these environments should also change), and more such novel objects of moral concern to appear.
There have been exceptions to this deductive approach: most notably, Kantian ethics. However, Kantian morality is a part of Kant’s wider theories of cognition (intelligence, agency) and philosophy of mind. The current state-of-the-art theories of cognitive science and philosophy of mind are far less wrong than Kant’s ones. So, the time is ripe for the development of new theories of axiology and ethics from first principles.
I developed this proposal completely unbeknown of your paper "Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI". I would be very interested to know your take on the "scale-free ethics" proposal.
Thanks to Rebecca Gorman for the discussions that inspired these ideas.
We recently argued that full understanding of value extrapolation[1] was necessary and almost sufficient for solving the AI alignment problem, due to the existence of morally underdefined situations.
People seemed to generally agree that some knowledge of human morality was needed for a safe powerful AI. But it wasn't clear this AI needed almost-full knowledge of human morality, as well as value extrapolation. We got some high-quality alternative suggestions. In this post, we'll try and distil those ideas and analyse them.
AI seems to behave reasonably
Steven Byrnes's position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].
Our phrasing of the idea would be (apologies for any misinterpretations):
This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences - but how it achieves these consequences doesn't have to be typical.
For example, it might make an explanatory video to convince someone to follow some (reasonable) course of action. The video itself is not unusual, but the colour scheme happens to hypnotise that specific viewer into following the suggestion. At a gross level of description, everything is fine: convincing-but-typical video convinces. But this only works because the AI has inhuman levels of knowledge and predictive ability, and carefully selects the "typical behaviour" that is the most effective.
Now, we might be able to make sure the AI doesn't have secret superhuman abilities (we've had an old underdeveloped idea that we could force the AI to use human models of the world to achieve its goals), but this is a very different problem, and a much harder one.
Then, given that the AI has access to superhuman levels of ability, the "typical consequences" is no longer a safe goal. It reduces to "consequences that seem typical to an observing human", which is not a safe goal for the AI; it can now do whatever it wants, as long as its actions and consequences look ok.
Still, it might be worth developing this idea; it might work well combined with some other safety approaches.
Advanced human feedback
Rohin's argument is that we don't need the AI to solve value transfer and extrapolation. Instead, we just need a "well-motivated" AI that asks us what we prefer. It must present the options and the consequences in an informative manner. If there are morally underdefined questions where we might give multiple answers depending on question phrasing, then the AI goes to a meta level and asks us how we would want it to ask us.
This is a potentially powerful approach. It solves the value extrapolation problem indirectly, by deference to humans. It is similar to a scaled-up version of "informed consent".
Would such an approach work to contain a malevolent AI? If a superintelligent AI wished to do us harm, but was constrained to following the above feedback approach, would it be safe?
It seems clear that it wouldn't. The malevolent AI would deploy all its resourcefulness to undermine the spirit of the feedback requirement. Unless we had it perfectly defined, this would just be a speedbump on the way to the AI's rise to unrestricted power.
Similarly, we can't rely on an AI that is motivated solely to follow the feedback requirement. That's because almost all AI goals are malevolent at the superintelligent level, and so the feedback requirement on its own would also be malevolent - for instance, the AI could fill the universe with pseudo-humans always giving it feedback.
So it seems that being "well-motivated" is a key constraint.
Well-motivated AIs asking for feedback
"Well-motivated" is similar to our "well-intentioned" AI. That term designated an AI aware of a specific problem (eg wireheading) and that was motivated to avoid it. It seemed that it might work, for some specific problems.
Could a well-motivated/intentioned AI safely make use of human feedback? This seems a harder challenge. First of all, even for simple questions, the AI needs to interpret the human's answers correctly in terms of values and preferences. Thus the AI needs to be able to translated human answers to questions, into human values - a task which cannot be achieved merely from observation, and requires knowing the human theory of mind.
This may be trainable; but what of the more complicated, morally underdefined situations? Even the issue of when to ask is non-trivial. "Shall I move six cm to the right?" is typically a pointless formality; when there is a cat trapped between the AI and the wall, this is vitally important.
So, ultimately, we want the AI to ask when the issue is important to humans. We want it to ask in a way best designed to elicitate our genuine preferences. We want it to go meta when there is key ambiguities in our preferences or our likely extrapolated preferences. And we want it to phrase the meta questions so that our answers approximate some version of our idealised selves.
Notice that all those desiderata are much easier when the AI knows our (extrapolated) preferences. It is not clear at all that they can be achieved otherwise. There are some criteria we might use, for instance "start going meta when you have the ability to get the human to answer whatever way you want". But that is a requirement about the power and ability of the AI - what if the AI already has that ability from the start (maybe through some failure of grounding of the question)? We don't want it to go meta or go complicated on all "shall I move six cm?" questions, only those that are important to us.
So an aligned AI could easily pass a system of human feedback, and achieve "informed consent". An AI that isn't "well-meaning" could not. It seems that achieving "well-meaning" requires a lot more human values (and human-style value extrapolation) than is normally assumed[3].
Where we agree
We all agree that an AI can be aligned without needing to know all of human values. What is sufficient is that it knows enough "structural assumptions" to be able to deduce enough human values as needed.
Rohin thinks that some system of human feedback can be defined (and grounded) sufficiently to serve as a "structural assumption". For the reasons mentioned above, this might not be possible - human feedback works best when the AI is already well-aligned with us. Getting that likely involves solving issues like value extrapolation.
An interesting idea may be a mix of feedback and other methods of value learning. For example, maybe other methods of value learning allow the AI to be well-motivated when asking for feedback. This itself will increase its knowledge of human values, which may improve its value learning and extrapolation - improving its use of feedback, and so on.
In any case, an area worth further investigations.
A more precise term than "model splintering". ↩︎
Key quotes in that thread:
and:
↩︎Or, if you put the issue in reverse, imagine an AI that operates on feedback and causes a existential disaster. Then we can probably trace this failure to some point where it misinterpreted some answer or it asked a question that was manipulative or dangerous. In other words, a failure of alignement in the question asking process. ↩︎