Previous Posts:
Formal Metaethics and Metasemantics for AI Alignment
New MetaEthical.AI Summary and Q&A at UC Berkeley
This time I tried to focus less on the technical details and more on providing the intuition behind the principles guiding the project. I'm grateful for questions and comments from Stuart Armstrong and the AI Safety Reading Group. I've posted the slides on Twitter.
Abstract: We construct a fully technical ethical goal function for AI by directly tackling the philosophical problems of metaethics and mental content. To simplify our reduction of these philosophical challenges into "merely" engineering ones, we suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available.
Given such a model, the AI attributes beliefs and values to a brain in two stages. First, it identifies the syntax of a brain's mental content by selecting a decision algorithm which is i) isomorphic to the brain's causal processes and ii) best compresses its behavior while iii) maximizing charity. The semantics of that content then consists first in sense data that primitively refer to their own occurrence and then in logical and causal structural combinations of such content.
The resulting decision algorithm can capture how we decide what to do, but it can also identify the ethical factors that we seek to determine when we decide what to value or even how to decide. Unfolding the implications of those factors, we arrive at what we should do. All together, this allows us to imbue the AI with the necessary concepts to determine and do what we should program it to do.
I think the simplest intentional systems just refer to their own sensory states. It's true that we are able to refer to external things but that's not by somehow having different external causes of our cognitive states from that of those simple systems. External reference is earned by reasoning in such a way that attributing content like 'the cause of this and that sensory state ...' is a better explanation of our brain's dynamics and behavior than just 'this sensory state', e.g. reasoning in accordance with the axioms of Pearl's causal models. This applies to the content of both our beliefs and desires.
In philosophical terms, you seem to be thinking in terms of a causal theory of reference whereas I'm taking a neo-descriptivist approach. Both theories acknowledge that one aspect of meaning is what terms refer to and that obviously depends on the world. But if you consider cases like 'creature with a heart' and 'creature with a kidney' which may very well refer to the same things but clearly still differ in meaning, you can start to see there's more to meaning than reference.
Neo-descriptivists would say there's an intension, which is roughly a function from possible worlds to the term's reference in that world. It explains how reference is determined and unlike reference does not depend on the external world. This makes it well-suited to explaining cognition and behavior in terms of processes internal to the brain, which might otherwise look like spooky action at a distance if you tried explaining in terms of external reference. In context of my project, I define intension here. See also Chalmers on two-dimensional semantics.