1 min read1 comment
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by Max Harms. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1 comment, sorted by Click to highlight new comments since:

 Here are my current thoughts on "human values." There are a decent number of confusions here, which I'll try to flag either explicitly or with a (?).


Let's start with a distribution over possible worlds, where we can split each world into a fixed past and a future function which takes an action.[1] We also need a policy, which is a sensors -> action function,[2] where the state of the sensors is drawn from the world's past.[3]

Assume that there exists either an obvious channel in many worlds that serves as a source of neutral[4] information (i.e. helpful for identifying which world the sensor data was drawn from, but "otherwise unimportant in itself"(?)), or that we can modify the actual worlds/context to add this information pathway.

We can now see how the behavior of the policy changes as we increase how informed it is, including possibly at the limit of perfect information. In some policies we should be able to (:confused arm wiggles:) factor out a world modeling step from the policy, which builds a distribution over worlds by updating on the setting of the sensors, and then feeds that distribution to a second sub-function with type world distribution -> action. (We can imagine an idealized policy that, in the limit of perfect information, is able to form a delta-spike on the specific world that its sensor-state was drawn from.) For any given delta-spike on a particular world, we can say that the action this sub-function chooses gives rise to an overall preference for the particular future[5] selected over the other possible futures. If the overall preferences conform to the VNM axioms we say that the sub-function is a utility function. Relevant features of the world that contribute to high utility scores are "values."

I think it makes sense to use the word "agent" to refer to policies which can be decomposed into world modelers and utility functions. I also think it makes sense to be a bit less strict in conversation and say that policies which are "almost"(?) able to be decomposed in this way are basically still agents, albeit perhaps less centrally so.

Much of this semi-formalism comes from noticing a subjective division within myself and some of the AI's I've made where it seems natural to say that "this part of the agent is modeling the world" and "this part of the agent is optimizing X according to the world model." Even though the abstractions seem imperfect, they feel like a good way of gesturing at the structure of my messy sense of how individual humans work. I am almost certainly incoherent in some ways, and I am confused how to rescue the notion of values/utility given that incoherence, but I have a sense that "he's mostly coherent" can give rise to "he more-or-less values X."


Two agents can either operate independently or cooperate for some surplus. Ideally there's a unique way to fairly split the surplus, perhaps using lotteries or some shared currency which they can use to establish units of utility. It seems obvious to me that there are many cooperative arrangements that are decidedly unfair, but I'm pretty confused about whether it's always possible to establish a fair split (even without lotteries? even without side-payments?) and whether there's an objective and unique Schelling point for cooperation.

If there is a unique solution, it seems reasonable to me to, given a group of agents, consider the meta-agent that would be formed if each agent committed fully to engaging in fair cooperation. This meta-agent's action would essentially be an element of the cartesian product of each agent's action space. In the human context, this story gives rise to a hypothetical set of "human values" which capture the kinds of things that humans optimize for when cooperating.

This seems a bit limited, since it neglects things that real humans optimize for that are part of establishing cooperation (e.g. justice). Does it really make sense to say that justice isn't a value of human societies because in the fully-cooperative context it's unnecessary to take justice-affirming actions? (??)


Even when considering a single agent, we can consider the coalition of that agent's time-slices(?). Like, if we consider Max at t=0 and Max at t=1 as distinct agents, we can consider how they'd behave if they were cooperative with each other. This frame brings in the confusions and complications from group-action, but it also introduces issues such as the nature of future-instances being dependent on past-actions. I have a sense that I only need to cooperate with real-futures, and am free to ignore the desires of unreal-counterfactuals, even if my past/present actions are deciding which futures are real. This almost certainly introduces some fixed-point shenanigans where unrealizing a future is uncooperative with that future but cooperative with the future that becomes realized, and I feel quite uncertain here. More generally, there's the whole logical-connective stuff from FDT/TDT/UDT.

I currently suspect that if we get a good theory of how to handle partial-coherence, how to handle multi-agent aggregation, and how to handle intertemporal aggregation, then "human values" will shake out to be something like "the mostly-coherent aggregate of all humans that currently exist, and all intertemporal copies of that aggregate" but I might be deeply wrong. :confused wiggles:

  1. ^

    The future function either returns a single future state or a distribution over future states. It doesn't really matter since we can refactor the uncertainty from the distribution over futures into the distribution over worlds.

  2. ^

    "sensors " is meant to include things like working memories and other introspection.

  3. ^

    Similarly to the distribution over futures we can either have a distribution over contexts given a past or we can have a fixed context for a given past and pack the uncertainty into our world distribution. See also anthropics and "bridge laws" and related confusions.

  4. ^

    Confusion alert! Sometimes a source of information contains a bias where it's selected for steering someone who's listening. I don't know how to prove an information channel doesn't have this property, but I do have a sense that neutrality is the default, so I can assume it here without too much trouble.

  5. ^

    ..in the context of that particular past! Sometimes the future by itself doesn't have all the relevant info (e.g. optimizing for the future matching the past).