A brain’s rational utility function is the utility function that would be arrived at by the brain’s decision algorithm if it were to make more optimal decisions while avoiding unrelated distortions of value.
By what mechanism do you think we can assess how unrelated and how much distortion of value is happening? Put another way, what are "values" in this model such that they are are separate from the utility function and how could you measure whether or not the utility function is better optimizing for those values?
Here, the optimal decisions would be the higher-order outputs which maximize higher-order utility. They are decisions about what to value or how to decide rather than about what to do.
To capture rational values, we are trying to focus on the changes to values that flow out of satisfying one’s higher-order decision criteria. By unrelated distortions of value, I pretty much mean changes in value from any other causes, e.g. from noise, biases, or mere associations.
In the code and outline I call the lack of distortion Agential Identity (similar to personal identity). I had previously tried to just extract the criteria out of the brain and directly operate on them. But now, I think the brain is sufficiently messy that we can only simulate many continuations and aggregate them. That opens up a lot of potential to stray far from the original state. This Agential Identity helps ensure we’re uncovering your dispositions rather than that of a stranger or a funhouse mirror distortion.
Here, the optimal decisions would be the higher-order outputs which maximize higher-order utility. They are decisions about what to value or how to decide rather than about what to do.
What constitutes utility here, then? For example, some might say utility is grounded in happiness or meaning, in economics we often measure utility in money, and I've been thinking along the lines of grounding utility (through value) in minimization of prediction error. It's fine that you are concerned with higher-order processes (I'm assuming by that you mean processes about processes, like higher-order outputs is outputs about outputs, higher-order utility is utility about utility), and maybe you are primarily concerned with abstractions that let you ignore these details, but then it must still be that those abstractions can be embodied in specifics at some point or else they are abstractions that don't describe reality well. After all, meta-values/preferences/utility functions are still values/preferences/utility functions.
To capture rational values, we are trying to focus on the changes to values that flow out of satisfying one’s higher-order decision criteria. By unrelated distortions of value, I pretty much mean changes in value from any other causes, e.g. from noise, biases, or mere associations.
How do you distinguish whether something is a distortion or not? You point to some things that you consider distortions, but I'm still unclear on the criteria by which you know distortions from the rational values you are looking for. One person's bias may be another person's taste. I realize some of this may depend on how you identify higher-order processes, but even if that's the case we're still left with the question as it applies to those directly, i.e. is some particular higher-order decision criterion a distortion or rational?
In the code and outline I call the lack of distortion Agential Identity (similar to personal identity). I had previously tried to just extract the criteria out of the brain and directly operate on them. But now, I think the brain is sufficiently messy that we can only simulate many continuations and aggregate them. That opens up a lot of potential to stray far from the original state. This Agential Identity helps ensure we’re uncovering your dispositions rather than that of a stranger or a funhouse mirror distortion.
This seems strange to me, because much of what makes a person unique lies in their distortions (speaking loosely here), not in their lack. Normally when we think of distortions they are taking an agent away from a universal perfected norm, and that universal norm would ideally be the same for all agents if it weren't for distortions. What leads you to think there are some personal dispositions that are not distortions and not universal because they are caused by the shared rationality norm?
Officially, my research is metaethical. I tell the AI how to identify someone’s higher-order utility functions but remain neutral on what those actually are in humans. Unofficially, I suspect they amount to some specification of reflective equilibrium and prescribe changing one’s values to be more in line with that equilibrium.
On distortion, I’m not sure what else to say but repeat myself. Distortions are just changes in value not governed by satisfying higher-order decision criteria. The examples I gave are not part of the specification, they’re just things I expect to be included.
Distortion is also not meant to specify all irrationality or nonoptimality. It’s just a corrective to a necessary part of the parliamentary procedure. We must simulate the brain’s continuation in some specific circumstance or other and that brings its own influences. So, I wouldn’t call a higher-order criterion a distortion even if it gets rejected. It’s more like a prima facie reason that gets overruled. In any case, we can evaluate such criteria as rational or not but we’d be doing so by invoking some (other unless reflective) higher-order criteria.
For the most part, I don’t believe in norms universal to all agents. Given our shared evolutionary history, I expect significant overlap among humans but that there’d also be some subtle differences from development and the environment. It may also be worth mentioning that even with the same norm, we can preserve uniqueness if for instance, it takes one’s current state into consideration.
Your social welfare function produces a total preference ordering over outcomes, but not a mapping to real-numbered utilities. How do you decide between the median outcome and a 50% chance of either the best or worst? Do you plan to have the preference ordering be over probability distributions over outcomes?
The social welfare compromise varies with our ontology. If everyone agrees that two outcomes are the same, and therefore we represent them as the same outcome, it changes relative distances between preference orderings.
Nice catch. Yes, I think I’ll have to change the ordinal utility functions to range over lotteries rather than simply outcomes.
In this initial version, I am just assuming the ontology of the world is given, perhaps from just an oracle or the world model the AI has inferred.
Previous Intro: Formal Metaethics and Metasemantics for AI Alignment
I’m nearing the completion of a hopefully much more readable version of the ideas previously released as set-theoretic code. This takes the form of a detailed outline, currently in WorkFlowy, in which you can easily expand/collapse subsections which elaborate on their parents’ content. You can find the current draft here.
Although it’s not polished, I’m releasing it in preparation for a Q&A I’ll be holding at the University of California Berkeley AI and Philosophy working group, which I hope you will attend. I’ll likely make some brief introductory remarks but reserve most of the time for answering questions. The working group is part of the UC Berkeley Social Science Matrix and will be held at:
Barrows Hall, 8th Floor, Mezzanine Level
Wed, Dec 4th 12:30-2:30pm
(only the first hour is reserved for this Q&A)
Here I’ve reproduced just the first few levels of the outline. Click here to see their elaboration (currently ~4,800 words).
Read the full version here.