Previous Posts:
Formal Metaethics and Metasemantics for AI Alignment
New MetaEthical.AI Summary and Q&A at UC Berkeley
This time I tried to focus less on the technical details and more on providing the intuition behind the principles guiding the project. I'm grateful for questions and comments from Stuart Armstrong and the AI Safety Reading Group. I've posted the slides on Twitter.
Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into "merely" engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.
Given such a model, the AI attributes beliefs and values to a brain in two stages. First, it identifies the syntax of a brain's mental content by selecting a decision algorithm which is i) isomorphic to the brain's causal processes and ii) best compresses its behavior while iii) maximizing charity. The semantics of that content then consists first in sense data that primitively refer to their own occurrence and then in logical and causal structural combinations of such content.
The resulting decision algorithm can capture how we decide what to do, but it can also identify the ethical factors that we seek to determine when we decide what to value or even how to decide. Unfolding the implications of those factors, we arrive at what we should do. All together, this allows us to imbue the AI with the necessary concepts to determine and do what we should program it to do.
Very interesting! More interesting to me than the last time I looked through your proposal, both because of some small changes I think you've made but primarily because I'm a lot more amenable to this "genre" than I was.
I'd like to encourage a shift in perspective from having to read preferences from the brain, to being able to infer human preferences from all sorts of human-related data. This is related to another shift from trying to use preferences to predict human behavior in perfect detail, to being content to merely predict "human-scale" facts about humans using an agential model.
These two shifts are related by the conceptual change from thinking about the human preferences as "in the human," thus being inextricably linked to understanding humans on a microscopic level, to thinking about human preferences as "in our model of the human" - as being components that need to be understood as elements of an intentional-stance story we're telling about the world.
This of course isn't to say that brains have no mutual information with values. But rather than having two separate steps in your plan like "first, figure out human values" and "later, fit those human values into the AI's model of the world," I wonder if you've explored how it could work for the AI to try to figure out human values while simultaneously locating them within a way (or ways) of modeling the world.
No, I'm definitely being more descriptivist than causal-ist here. The point I want to get at is on a different axis.
Suppose you were Laplace's demon, and had perfect knowledge of a human's brain (it's not strictly necessary to pretend determinism, but it sure makes the argument simpler). You would have no need to track the human's "wants" or "beliefs," you would just predict based on the laws of physics. Not only could you do a better job than some human psychologist on human-scale tasks (like predicting in advance which button the human will press), you w... (read more)