Over the course of the AI Alignment Prize I've sent out lots of feedback emails. Some of the threads were really exciting and taught me a lot. But mostly it was me saying pretty much the same thing over and over with small variations. I've gotten pretty good at saying that thing, so it makes sense to post it here.
Working on AI alignment is like building a bridge across a river. Our building blocks are ideas like mathematics, physics or computer science. We understand how they work and how they can be used to build things.
Meanwhile on the far side of the river, we can glimpse other building blocks that we imagine we understand. Desire, empathy, comprehension, respect... Unfortunately we don't know how these work inside, so from the distance they look like black boxes. They would be very useful for building the bridge, but to reach them we must build the bridge first, starting from our side of the river.
What about machine learning? Perhaps the math of neural networks could free us from the need to understand the building blocks we use? If we could create a behavioral imitation of that black box over there (like "human values"), then building the bridge would be easy! Unfortunately, ideas like adversarial examples, treacherous turns or nonsentient uploads show that we shouldn't bet our future on something that imitates a particular black box, even if the imitation passes many tests. We need to understand how the black box works inside, to make sure our version's behavior is not only similar but based on the right reasons.
(Eliezer made the same point more eloquently a decade ago in Artificial Mysterious Intelligence. Still, with the second round of our prize now open, I feel it's worth saying again.)
I think here "black-box" can be used to refer to two different things, one to refer to things in philosophy or science which we do not fully understand yet, and also to machine learning models like neural networks that seem to capture their knowledge in ways that are uninterpretable to humans.
We will almost certainly require the use of machine learning or AI to model systems that are beyond our capabilities to understand. This may include physics, complex economic systems, the invention of new technology, or yes, even human values. There is no guarantee that a theory that describes our own values can be written down and understood fully by us.
Have you ruled out any kind of theory which would allow you to know for certain that a "black-box" model is learning what you want it to learn, without understanding everything that it has learned exactly? I might not be able to actually formally verify that my neural network has learned exactly what I want it to (i.e. by extracting the knowledge out of it and comparing it to what I already know), but maybe I have formal proofs of the algorithm it is using and so I know its knowledge will be fairly robust under certain conditions. It's basically the latter we need to be aiming for.
I think what you're describing is possible, but very hard. Any progress in that direction would be much appreciated, of course.