Previous Intro: Formal Metaethics and Metasemantics for AI Alignment
I’m nearing the completion of a hopefully much more readable version of the ideas previously released as set-theoretic code. This takes the form of a detailed outline, currently in WorkFlowy, in which you can easily expand/collapse subsections which elaborate on their parents’ content. You can find the current draft here.
Although it’s not polished, I’m releasing it in preparation for a Q&A I’ll be holding at the University of California Berkeley AI and Philosophy working group, which I hope you will attend. I’ll likely make some brief introductory remarks but reserve most of the time for answering questions. The working group is part of the UC Berkeley Social Science Matrix and will be held at:
Barrows Hall, 8th Floor, Mezzanine Level
Wed, Dec 4th 12:30-2:30pm
(only the first hour is reserved for this Q&A)
Here I’ve reproduced just the first few levels of the outline. Click here to see their elaboration (currently ~4,800 words).
- Given mathematical models of the world and the adult human brains in it, an ethical goal function for AI can be constructed by applying a social welfare function to the set of extensional rational utility functions of the brains.
- The mathematical model of a world or brain is to be given as a causal Markov model.
- A causal Markov model is a convenient model for generating a causal model.
- The notion of a causal model is taken directly from Judea Pearl.
- A causal model is composed of:
- A causal Markov model is composed of:
- A causal Markov model (cmm) generates a causal model (cm) as follows:
- A brain’s rational utility function is the utility function that would be arrived at by the brain’s decision algorithm if it were to make more optimal decisions while avoiding unrelated distortions of value.
- A brain’s decision algorithm is the one that best satisfies these desiderata:
- First, it must take the mathematical form of a decision algorithm, which is a tuple composed of:
- Next, there must be an implementation function which maps brain states to decision states such that these two routes from a brain state to a decision event always arrive at the same result:
- It achieves a high rate of compression of the brain’s causal transition function.
- It is probabilistically coherent, including with its represented causal models.
- It is instrumentally rational in both its first-order and higher-order utility functions.
- It is ambitious, trying to explain as much as possible with the decision algorithm.
- The final formulation specifying the rational utility function gets rather complicated but we can build up to it with a couple initial approximations:
- Final specification: Simulate all possible continuations of an agent and apply a social welfare function to their utility functions while weighting them by optimality of prescriptions, agential identity and likelihood.
- The advantages of this metaethics include:
- Extension: The rational utility function of a brain above is couched in terms of the brain’s own represented expressions, but for interpersonal comparisons, we first cash them out extensionally in terms of their referents in the world.
- The social welfare function might be thought of as choosing a center of gravity between the extensional rational utility functions.
- The details above form an initial prototype.
Read the full version here.
What constitutes utility here, then? For example, some might say utility is grounded in happiness or meaning, in economics we often measure utility in money, and I've been thinking along the lines of grounding utility (through value) in minimization of prediction error. It's fine that you are concerned with higher-order processes (I'm assuming by that you mean processes about processes, like higher-order outputs is outputs about outputs, higher-order utility is utility about utility), and maybe you are primarily concerned with abstractions that let you ignore these details, but then it must still be that those abstractions can be embodied in specifics at some point or else they are abstractions that don't describe reality well. After all, meta-values/preferences/utility functions are still values/preferences/utility functions.
How do you distinguish whether something is a distortion or not? You point to some things that you consider distortions, but I'm still unclear on the criteria by which you know distortions from the rational values you are looking for. One person's bias may be another person's taste. I realize some of this may depend on how you identify higher-order processes, but even if that's the case we're still left with the question as it applies to those directly, i.e. is some particular higher-order decision criterion a distortion or rational?
This seems strange to me, because much of what makes a person unique lies in their distortions (speaking loosely here), not in their lack. Normally when we think of distortions they are taking an agent away from a universal perfected norm, and that universal norm would ideally be the same for all agents if it weren't for distortions. What leads you to think there are some personal dispositions that are not distortions and not universal because they are caused by the shared rationality norm?