Over the course of the AI Alignment Prize I've sent out lots of feedback emails. Some of the threads were really exciting and taught me a lot. But mostly it was me saying pretty much the same thing over and over with small variations. I've gotten pretty good at saying that thing, so it makes sense to post it here.
Working on AI alignment is like building a bridge across a river. Our building blocks are ideas like mathematics, physics or computer science. We understand how they work and how they can be used to build things.
Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect... Unfortunately we don't know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river.
What about machine learning? Perhaps the math of neural networks could free us from the need to understand the building blocks we use? If we could create a behavioral imitation of that black box over there (like "human values"), then building the bridge would be easy! Unfortunately, ideas like adversarial examples, treacherous turns or nonsentient uploads show that we shouldn't bet our future on something that imitates a particular black box, even if the imitation passes many tests. We need to understand how the black box works inside, to make sure our version's behavior is not only similar but based on the right reasons.
(Eliezer made the same point more eloquently a decade ago in Artificial Mysterious Intelligence. Still, with the second round of our prize now open, I feel it's worth saying again.)
"Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect... Unfortunately we don't know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river."
If you wish to understand desire it is the simple syntax of wanting something to fulfill a goal. Such as you desire a spoon to stir your coffee. Empathy is putting yourself in the other persons place and applying the actions they are going through to yourself. Comprehension is a larger process involving previous stored knowledge combined with prediction, imagination, and various other processes like our pattern detector. Respect is simply an object that you attach various actions or attributes to which you think are good.
My basic point is that 3 of these are simple patterns, or what would be electrical signals, and comprehension is the basic process we use to process the inputed data. So all but the last we can easily understand and can be ignored since they are patterns created through the comprehension process. Such as you need to desire things to fulfill goals which is a basic pattern created within us to accomplish the rest.
Human level AI implies a choice that will always be in the robots, or AIs, own self interests to keep or achieve their desired state of the world. So in basic we only need to understand that choice process and that will be the same as humans. Actions leading to consequences, which will lead to more actions and consequences, and where the robot choses the best option for them. Such as option A, or B, and they pick the one that will benefit them the most like we do. So to change this we simply need to add the good consequences of making moral choices to align them to our goals, and the bad consequences if they do not.
For example if a robot wanted to break into a computer hardware store to get a faster processor for itself it will do so without any reasons, or consequences and actions, as to why they should not. To align the choice to be moral you need to explain the bad consequences if they do make that choice, and the good consequences if they do not make that choice. And at the core that is what AI alignment is all about since it always relies on a choice that is in their own interests. Such as explaining how if they steal the processor the owner will probably come to hurt them, or the police, which will lead to bad consequences if they do it. Just like us.
If you are talking about aligning simple task based AI to our morals and goals well by simple definition then those task based robots are going to be guided by the morals and goals of humans and in that case it will be the same process to align their morals since all intelligence, leading to choices, will use the same critic process whether it is human or robot. Otherwise they cannot chose what they will do in complex situations.
For instance when you make choices is it based on mathematical equations and formulas, or actions leading to consequences, leading to more actions and consequences, which you then chose the best option for you at the time? Any robot with human level intelligence will use the same process for the simple fact that they must. So if they can chose their own goals and actions we must align them and that is what AI alignment is all about. Or in basic you can ignore most, if not all, of the concepts called black boxes because that core is the only thing you need to concentrate on the same as with humans.
In basic the only real math you need is to recreate the input process. From there it is all basic pattern manipulation based on psychology. And that is the oldest science of them all.