I'm not sure if you are familiar with my work on AI alignment as it related to phenomenal consciousness, but I consider this exactly in line with the kind of thinking that is likely to eventually lead us to aligned AI.
Thanks for the encouragement.
I've come across your work, but have to admit that I've only read parts of it so far. I'm still in the process of catching up to the state of the art.
The following are some non-technical ideas about AI alignment based on human beliefs, rather than our true reward function. My impression is that the role of beliefs is often implied in passing, but I haven‘t found any elaborations on the topic. I‘d be grateful for relevant references, if someone knows any.
Do as I say, not as I do
I believe we should not aim to make AIs imitate our behaviour or even just our preferences in every regard. An AI with the same reward function as an average human would want what we want, e.g. to be in control of their own lives, to make lots of money, to eat icecream, etc. At best, having such human goals would make them inefficient, at worst dangerous, but mostly, it would make no sense. Why should a robot – or even worse, a disembodied AI – want to eat icecream? But most humans like icecream and delight in eating it sometimes and there is no logical reason why an AI could not want the same, even if it made no practical sense. An AI with a copy of a human reward function would also be undesirable in that the latter would be adapted to human biases, selfishness, laziness, and the general limitations of humans, which are not easily overcome (by humans) even with ample knowledge and time to think.
Still, we want the behaviour of an AI to match what we want, and simply spelling it out is unfeasible, since we are unable to take into account every potentially relevant variable in every possible situation. I believe a solution is to use human beliefs. These include all that we think we know about our own values and preferences and everything that we want an AI to do. They are also something that we can communicate, and an Artificial General Intelligence would by definition be capable of understanding us. Communicating all our beliefs would still be unfeasible, but an AGI should also be capable of reasoning about our beliefs, fill in the gaps, and in case of doubt ask us to clarify. Once it has a clear understanding of our moral beliefs, and provided it is intelligent enough, it should be able to surmise our theoretical viewpoint about the moral aspects of any situation as well as us.
Modelling the beliefs of a human might seem more far-fetched than learning their true reward function, as the latter can in theory be deduced from observing an agent‘s behaviour, whereas the former presumes to intrude into the black box of that is a human mind. But I would argue that the beliefs held by an intelligent agent capable of learning are actually easier to model, because the behaviour of the same agent would implicitly depend upon those beliefs. If the decisions of said agent would always be the same, no matter what it believed about the current situation and about the reward that its decision might bring, then its intelligence would be useless. It would consequently be impossible to adequately infer the reward function without a „theory of mind“ that takes those beliefs into account.
Well I never
Our consciously accessible beliefs about our own values may not and often obviously do not match with our actions. This does not mean that our values are wrong, they are only idealised compared to reality, as they should (if there were no conflicting interests that could make us act against a specific value, then it would have no normative utility; it would just be a fact). We tend to also have an idealised image of ourselves, believing that we (will) adhere to these values more steadfastly than we actually do. This inconsistency is of course an issue that needs to be dealt with for an AI to make decisions based upon our beliefs, but we certainly do not want the AI to mimik our actual behaviour where it digresses from what we actually think is right. If anything, we want the AI to be even more ideal than how we imagine ourselves.
In an case, our beliefs would be the benchmark by which we determine whether the AI is well-aligned or not. After all, these beliefs encompass everything that we know about our values. Whatever reward function may govern our behaviour in a way that deviates from this is something that we regret, ignore or try to justify like it didn‘t deviate at all. Naturally, we do not want the AI to do something in the first place that it needs to make excuses for afterwards.
And lead us not into temptation
If an AI were to observe and understand our behaviour better than we understand it ourselves and if it tried to act in accordance with our „true“ reward function, then those actions might go against our conciously accessible values. This in an of itself would not necessarily be bad. It is conceivable that there is a fundamental flaw in our values that we haven‘t realised. If the AI‘s action are successful, we might even find the results entirely agreeable. At the same time, they have a chance of leading us down the proverbial slippery slope to something that is very different from what we originally wanted, because we lacked the foresight and/or moral steadfastness to stop at the right time, while the AI only faithfully fulfilled our actual desires. An AI like that would be extremely friendly, yet dangerous, simply by exacerbating problematic tendancies that we normally just barely keep under control by playing the role of someone who is not controlled by them.
Imprinting
So, how can we get AIs to learn and value our beliefs? I imagine something like the filial imprinting in animals and humans might work. Imprinting requires very few cues. In animals, just being the first moving object that a young sees may suffice to be treated as a mother. This does not mean that any moving object later on will be treated the same, as by that time, the young will have developed a more complex concept of its presumed caretaker. What is important is that the young will pay attention to this first object and value it and the sensory input it provides as highly important, thus it will be motivated to stay close to it and learn from it. In humans, the learning part works especially well and ultimately includes highly abstract moral concepts, which still mostly center around other humans. We consider humans important because we are predispositioned to pay attention to them and because later on, we learn that other humans have a high impact on our lives in many ways. This importance is transferred to whatever is connected to those humans (from our perspective), such as houses, money, fashion, or beliefs and values. Something similar should be possible in an Artificial General Intelligence (and I expect it to be simple enough to ensure compliance in subsequently modified versions or sub-agents). This of course implies that a fledgling AI requires at least one caretaker who takes a lot of time to interact with it, so that it can get a grasp of the human‘s beliefs and values. Such an AI would not be „safe“ in principal upon initialization, but it would be harmless, since it would know nothing about what it could or should do. If everything is done right, then it should preferentially learn facts related to the human caretaker(s), including their beliefs about right and wrong. As the AI gradually becomes aware of its potential and thus starts to present a theoretical risk, it would at the same become more and more aware of what it should rightly do with that potential. Contrary to common fears, such an AI would become more safe as its capabilities grow. If it has a perfect grasp of the humans wishes for it, then it will try to fulfill them as best as it can. Naturally, what is „best“ is still debatable.
I am a human, and I reserve the right to be inconsistent
If an AI can learn the beliefs of a human or group of humans and values these highly, how does this translate into what decisions it should make? I imagine there are different possible ways. One simple solution might be to treat all beliefs held by the AI and/or the humans (as far as the AI knows) as normative (as in: Everything should be as it is believed, including beliefs about how the world should be), to some degree, according to their perceived importance, and deviations as something to be rectified if possible, with corresponding priority. If what is perceived as most important is the human caretaker, then this implies that the human should be preserved as they are. This would of course be a disaster if the AI‘s concept of the human is static, instead of dynamic. A static human would be dead. A dynamic concept of a human would include facts such as the potential to grow and learn and change ones mind all the time. Even more, our beliefs are often false or implicitly contradictory, and our values can come into conflict. However, that is not something we normally want to be the case, and under most circumstances, we would welcome a clarification from the AI. Under the aforementioned assumption the AI has to reconcile these contrafactual beliefs with reality. Naturally, it can not change reality itself, so if it is aware of a fact with certainty that we are not, then the path of least resistance is to make us understand our mistake.
On the other hand, we often have wishes that are not part of reality yet, although we think they should be, and that are not impossible. Our wishes are important because we say so, the reality that they are not factual potentially less so. Now, if the AI is driven to minimize the future conflicts between the different facts that it knows, all of which represent an aspect of „how things should be“, but are not equally important, then it should act in a way that the world is manipulated in accordance with the human‘s imagination as much as possible. And since it will have to take into account everything it knows, this will be done in a way that also does as little collateral damage as possible, relative to what the human would consider damage as far as it knows. However, some other method is still required to determine how the potentially qualitatively different costs of various possible actions are weighted against one another and the corresponding expected gains in situations where humans would struggle without a clear preference and may decide on the spot depending on their mood.
Earn your happy ending
This approach would only fully work in an actual AGI of at least human-level intelligence, but it would scale upwards. That is, if the details can be worked out, if I didn‘t overlook a fundamental flaw, and if an AGI with its other necessities can be developed. Even then, the real work would have to be done by the human caretaker(s) after switching it on, and it is not something that should be rushed. In fact, it might be prudent to keep the learning rate low at the start so that eventual problems are identified early. If on the other hand the AI develops to a point where it is perfectly aligned with this approach, then any further development should not infringe upon this state.