In the Pointers Problem, John Wentworth points out that human values are a function of latent variables in the world models of humans.
My question is about a specific kind of latent variable, one that's restricted to be downstream of what we can observe in terms of causality. Suppose we take a world model in the form of a causal network and we split variables into ones that are upstream of observable variables (in the sense that there's some directed path going from the variable to something we can observe) and ones that aren't. Say that the variables that have no causal impact on what is observed are "latent" for the purposes of this post. In other words, latent variables are in some sense "epiphenomenal". This definition of "latent" is more narrow but I think the distinction between causally relevant and causally irrelevant hidden variables is quite important, and I'll only be focusing on the latter for this question.
In principle, we can always unroll any latent variable model into a path-dependent model with no hidden variables. For example, if we have an (inverted) hidden Markov model with one observable and one hidden state (subscripts denote time), we can draw a causal graph like this for the "true model" of the world (not the human's world model, but the "correct model" which characterizes what happens to the variables the human can observe):
Here the are causally irrelevant latent variables - they have no impact on the state of the world but for some reason or another humans care about what they are. For example, if a sufficiently high capacity model renders "pain" a causally obsolete concept, then pain would qualify as a latent variable in the context of this model.
The latent variable at time depends directly on both and , so to accurately figure out the probability distribution of we need to know the whole trajectory of the world from the initial time: .
We can imagine, however, that even if human values depend on latent variables, these variables don't feed back into each other. In this case, how much we value some state of the world would just be a function of that state of the world itself - we'd only need to know to figure out what is. This naturally raises the question I ask in the title: empirically, what do we know about the role of path-dependence in human values?
I think path-dependence comes up often in how humans handle the problem of identity. For example, if it were possible to clone a person perfectly and then remove the original from existence through whatever means, even if the resulting states of the world were identical, humans who have different trajectories of how we got there in their mental model could evaluate questions of identity in the present differently. Whether I'm me or not depends on more information than my current physical state or even the world's current physical state.
This looks like it's important for purposes of alignment because there's a natural sense in which path-dependence is an undesirable property to have in your model of the world. If an AI doesn't have that as an internal concept, it could be simpler for it to learn a strategy of "trick the people who believe in path-dependence into thinking the history that got us here was good" rather than "actually try to optimize for whatever their values are".
With all that said, I'm interested in what other people think about this question. To what extent are human values path-dependent, and to what extent do you think they should be path-dependent? Both general thoughts & comments and concrete examples of situations where humans care about path-dependence are welcome.
Nitpick: this is not strictly correct. This would be the internal energy of a thermodynamic system, but "heat" in thermodynamics refers to energy that's exchanged between systems, not energy that's in a system.
Aside from the nitpick, however, point taken.
I think there is a general problem with these path-dependent concepts in that the ideal version of the concept might be path-dependent, but in practice we can only work within the physical state to keep track of what the path used to be. It's analogous to how an idealized version of personal identity might require a continuous stream of gradually changing agents and so on, but in practice all we have to go on is what memories people have about how things used to be.
For example, in Lockean property rights theory, "who is the rightful owner of a house" is a path-dependent question. You need to trace the entire history of the house in order to figure out who should own it right now. However, in practice we have to implement property rights by storing some information about the ownership of the house in the current physical state.
If you then train an AI to understand the ownership relation and it learns the relation that we have actually implemented rather than the idealized version we have in mind, it can think that what we really care about is who is "recorded" as the owner of a house in the current physical state rather than who is "legitimately" the owner of the house, and in the extreme cases that can lead it to take some bizarre actions when you ask it to optimize something that has to do with the concept of property rights.
In the end, I think it comes down to which way of doing it takes up less complexity or less bits of information in whatever representation the AI is using to encode these relations. If path-dependent concepts are naturally more complicated for the AI to wrap its head around, SGD can find something that's path-independent and that fits the training data perfectly, and then you could be in trouble. This is a general story with alignment failure but if we decide we really care about path-dependence then it's also a concept we'll want to get the AI to care about somehow.