I've mentioned in posts twice (and previously in several comments) that I'm excited about predictive coding, specifically the idea that the human brain either is or can be modeled as a hierarchical system of (negative feedback) control systems that try to minimize error in predicting their inputs with some strong (possibly un-updatable) prediction set points (priors). I'm excited because I believe this approach better describes a wide range of human behavior, including subjective mental experiences, than any other theory of how the mind works, it's compatible with many other theories of brain and mind, and it may give us an adequate way to ground human values precisely enough to be useful in AI alignment.
A predictive coding theory of human values
My general theory of how to ground human values in minimization of prediction error is simple and straightforward:
- Neurons form hierarchical control systems.
- cf. a grounding of phenomenological idealism using control systems, and its implications
- cf. the hierarchy is recursive and reasserts itself at higher levels of organization
- Those control systems aim to minimize prediction error via negative feedback (homeostatic) loops.
- The positive signal of the control system occurs when prediction error is minimized; the negative signal of the control system occurs when prediction error is maximized.
- There is also a neutral signal when there is insufficient information to activate the positive or negative signal "circuitry".
- cf. feeling/sensation is when the mind makes a determination about sense data, and sensations are positive, negative, or neutral
- "Good", "bad", and "neutral" are then terms given to describe the experience of these positive, negative, and neutral control signals, respectively, as they move up the hierarchy.
I've thought about this for a while so I have a fairly robust sense in my mind of how this works that allows me to verify it against a wide variety of situations, but I doubt I've conveyed that to you already. I think it will help if I give some examples of what this theory predicts happens in various situations that accounts for the behavior people observe and report in themselves and others.
- Mixed emotions/feelings are the result of a literal mix of different control systems under the same hierarchy receiving positive and negative signals as a result of producing less or more prediction error.
- Hard-to-predict people are perceived as creepy or, stated with less nuance, bad.
- Familiar things feel good by definition: they are easy to predict.
- Similarly, there's a feeling of loss (bad) when familiar things change.
- Mental illnesses result from failures of neurons to set good/bad thresholds appropriately, to update set points at an appropriate rate to match current rather than old circumstances, and from sensory input issues causing either prediction error or internally correct predictions that are poorly correlated with reality (this broadly including issues related both to sight, sound, smell, taste, touch and to mental inputs from long term memory, short term memory, and otherwise from other neurons).
- Desire and aversion are what it feels like to notice prediction error is high and for the brain to take actions it predicts will lower it either by something happening (seeing sensory input) or not happening (not seeing sensory input), respectively.
- Good and bad feel like natural categories because they are, but ones that are the result of a brain interacting with the world rather than features of the externally observed world.
- Etc.
Further exploration of these kinds of cases will help in verifying the theory via whether or not adequate and straightforward applications of the theory can explain various phenomena (I view it as being in a similar epistemic state to evolutionary psychology, including the threat of misleading ourselves with just-so stories). It does to some extent hinge on questions I'm not situated to evaluate experimentally myself, especially whether or not the brain actually implements hierarchical control systems of the type described, but I'm willing to move forward because even if the brain is not literally made of hierarchical control systems the theory appears to model what the brain does well enough that whatever theory replaces it will also have to be compatible with many of its predictions. Hence I think we can use it as a provisional grounding even as we keep an eye out for ways in which it may turn out to be an abstraction that we will have to reconsider in the light of future evidence, and that work we do based off of it will be amendable to translation to whatever new, more fundamental grounding we may discover in the future.
Relation to AI alignment
So that's the theory. How does it relate to AI alignment?
First note that this theory is naturally a foundation of axiology, or the study of values, and by extension a foundation for the study of ethics, to the extent that ethics is about reasoning about how agents, each with their own (possibly identical) values, interact. This is relevant for reasons I and more recently Stuart Armstrong have explored:
- What it would mean for an AI to be aligned is currently only defined in natural language. We don't know how to make a precise specification of what AI alignment means without presupposing a solution that makes many assumptions.
- Since alignment is alignment with human values, we must understand human values well enough that we can adequately ground the definitions so they don't come apart under optimization and we can verify whether or not we have achieved alignment via understanding we have that is not given to us by AI and thus suspect in the case that we failed to achieve alignment.
- Further, since we can neither resolve value conflicts nor even learn human values without making normative assumptions, however weak they may be, we would do well to find assumptions that are rooted in things that are unlikely to create x-risks/s-risks because they are strongly correlated with what we observe at the most basic physical level and are unlikely to be the result of perceptual biases (anthropic bias, cultural bias, etc.) that may lock us into outcomes we would, under reflection, disprefer.
Stuart has been exploring one approach by grounding human values in an improvement on the abstraction for human values used in inverse reinforcement learning that I think of as a behavioral economics theory of human values. My main objection to this approach is that it is behaviorist: it appears to me to be grounded in what can be observed from external human behavior by other agents and has to infer the internal states of agents across a large inferential gap, true values being a kind of hidden and encapsulated variable an agent learns about via observed behavior. To be fair this has proven an extremely useful approach over the past 100 years or so in a variety of fields, but it also suffers an epistemic problem in that it requires lots of inference to determine values, and I believe this makes it a poor choice given the magnitude of Goodharting effects we expect to be at risk from with superintelligence-levels of optimization.
In comparison, I view a predictive-coding-like theory of human values as offering a much better method of grounding human preferences. It is
- parsimonious: the behavioral economics approach to human values allows comparatively complicated value specifications and requires many modifications to make it reflect a wide variety of observed human behavior, whereas this theory lets them be specified in simple terms that become complex by recursive application of the same basic mechanism;
- requires little inference: if it is totally right, only the inference of measuring neuron activity creates room for epistemic error within the model;
- captures internal state: true values/internal state is assessed as directly as possible rather than inferred from behavior;
- broad: works for both rational and non-rational agents without modification;
- flexible: even if the control theory model is wrong, the general "Bayesian brain" approach is probably right enough for us to make useful progress over what is possible with a behaviorist approach such that we could translate work that assumes predictive coding to another, better model.
Thus I am quite excited about the possibility that predictive coding approach may allow us to ground human values precisely enough to enable successfully aligning AI with human values.
This is a first attempt to explain what has been my "big idea" for the last year or so now that it has finally come together enough in my head that I'm confident presenting it, so I very much welcome feedback, questions, and comments that may help us move towards a more complete evaluation and exploration of this idea.
I agree that
However, I don't agree that we should think of values as being predictable from the concept of minimizing prediction error.
The tone of the following is a bit more adversarial than I'd like; sorry for that. My attitude toward predictive processing comes from repeated attempts to see why people like it, and all the reasons seeming to fall flat to me. If you respond, I'm curious about your reaction to these points, but it may be more useful for you to give the positive reasons why you think your position is true (or even just why it would be appealing), particularly if they're unrelated to what I'm about to say.
Evolved Agents Probably Don't Minimize Prediction Error
If we look at the field of reinforcement learning, it appears to be generally useful to add intrinsic motivation for exploration to an agent. This is the exact opposite of predictability: in one case we add reward for entering unpredictable states, whereas in the other case we add reward for entering predictable states. I've seen people try to defend minimizing prediction error by showing that the agent is still motivated to learn (in order to figure out how to avoid unpredictability). However, the fact remains: it is still motivated to learn strictly less than an unpredictability-loving agent. RL has, in practice, found it useful to add reward for unpredictability; this suggests that evolution might have done the same, and suggests that it would not have done the exact opposite. Agents operating under a prediction-error penalty would likely under-explore.
It's Easy to Overestimate The Degree to which Agents Minimize Prediction Error
I often enjoy variety -- in food, television, etc -- and observe other humans doing so. Naively, it seems like humans sometimes prefer predictability and sometimes prefer variety.
However: any learning agent, almost no matter its values, will tend to look like it is seeking predictability once it has learned its environment well. It is taking actions it has taken before, and steering toward the environmental states similar to what it always steers for. So, one could understandably reach the conclusion that it is reliability itself which the agent likes.
In other words: if I seem to eat the same foods quite often (despite claiming to like variety), you might conclude that I like familiarity when it's actually just that I like what I like. I've found a set of foods which I particularly enjoy (which I can rotate between for the sake of variety). That doesn't mean it is familiarity itself which I enjoy.
I'm not denying that mere familiarity has some positive valence for humans; I'm just saying that for arbitrary agents, it seems easy to over-estimate the importance of familiarity in their values, so we should be a bit suspicious about it for humans too. And I'm saying that it seems like humans enjoy surprises sometimes, and there's evolutionary/machine-learning reasoning to explain why this might be the case.
We Need To Explain Why Humans Differentiate Goals and Beliefs, Not Just Why We Conflate Them
You mention that good/bad seem like natural categories. I agree that people often seem to mix up "should" and "probably is", "good" and "normal", "bad" and "weird", etc. These observations in themselves speak in favor of the minimize-prediction-error theory of values.
However, we also differentiate these concepts at other times. Why is that? Is it some kind of mistake? Or is the conflation of the two the mistake?
I think the mix-up between the two is partly explained by the effect I mentioned earlier: common practice is optimized to be good, so there will be a tendency for commonality and goodness to correlate. So, it's sensible to cluster them together mentally, which can result in them getting confused. There's likely another aspect as well, which has something to do with social enforcement (ie, people are strategically conflating the two some of the time?) -- but I'm not sure exactly how that works.
Ah, I guess I don't expect it to end up ignoring the parts of the network that can't learn because I don't think error minimization, learning, or anything else is a top level goal of the network. That is, there are only low-level control systems interacting, and parts of the network get not ignored by their being more powerful in various ways, probably by being positioned such that they are located in the network such that they have more influence on behavior than other parts of the network that perform Bayesian learning. This does mean I ex... (read more)