Previous Posts:
Formal Metaethics and Metasemantics for AI Alignment
New MetaEthical.AI Summary and Q&A at UC Berkeley
This time I tried to focus less on the technical details and more on providing the intuition behind the principles guiding the project. I'm grateful for questions and comments from Stuart Armstrong and the AI Safety Reading Group. I've posted the slides on Twitter.
Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into "merely" engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.
Given such a model, the AI attributes beliefs and values to a brain in two stages. First, it identifies the syntax of a brain's mental content by selecting a decision algorithm which is i) isomorphic to the brain's causal processes and ii) best compresses its behavior while iii) maximizing charity. The semantics of that content then consists first in sense data that primitively refer to their own occurrence and then in logical and causal structural combinations of such content.
The resulting decision algorithm can capture how we decide what to do, but it can also identify the ethical factors that we seek to determine when we decide what to value or even how to decide. Unfolding the implications of those factors, we arrive at what we should do. All together, this allows us to imbue the AI with the necessary concepts to determine and do what we should program it to do.
What do you see as advantages and disadvantages of this design compared to something like Paul Christiano's 2012 formalization of indirect normativity? (One thing I personally like about Paul's design is that it's more agnostic about meta-ethics, and I worry about your stronger meta-ethical assumptions, which I'm not very convinced about. See metaethical policing for my general views on this.)
How worried are you about this kind of observation? People's actual moral views seem at best very under-determined by their "fundamental norms", with their environment and specifically what status games they're embedded in playing a big role. If many people are currently embedded in games that cause them to want to freeze their morally relevant views against further change and reflection, how will your algorithm handle that?
I agree that people's actual moral views don't track all that well with correct reasoning from their fundamental norms. Normative reasoning is just one causal influence on our views but there's plenty of biases such as from status games that also play a causal role. That's no problem for my theory. It just carefully avoids the distortions and focuses on the paths with correct reasoning to determine the normative truths. In general, our conscious desires and first-order views don’t matter that much on my view unless they are endorsed by the standards we imp... (read more)