I've mentioned in posts twice (and previously in several comments) that I'm excited about predictive coding, specifically the idea that the human brain either is or can be modeled as a hierarchical system of (negative feedback) control systems that try to minimize error in predicting their inputs with some strong (possibly un-updatable) prediction set points (priors). I'm excited because I believe this approach better describes a wide range of human behavior, including subjective mental experiences, than any other theory of how the mind works, it's compatible with many other theories of brain and mind, and it may give us an adequate way to ground human values precisely enough to be useful in AI alignment.
A predictive coding theory of human values
My general theory of how to ground human values in minimization of prediction error is simple and straightforward:
- Neurons form hierarchical control systems.
- cf. a grounding of phenomenological idealism using control systems, and its implications
- cf. the hierarchy is recursive and reasserts itself at higher levels of organization
- Those control systems aim to minimize prediction error via negative feedback (homeostatic) loops.
- The positive signal of the control system occurs when prediction error is minimized; the negative signal of the control system occurs when prediction error is maximized.
- There is also a neutral signal when there is insufficient information to activate the positive or negative signal "circuitry".
- cf. feeling/sensation is when the mind makes a determination about sense data, and sensations are positive, negative, or neutral
- "Good", "bad", and "neutral" are then terms given to describe the experience of these positive, negative, and neutral control signals, respectively, as they move up the hierarchy.
I've thought about this for a while so I have a fairly robust sense in my mind of how this works that allows me to verify it against a wide variety of situations, but I doubt I've conveyed that to you already. I think it will help if I give some examples of what this theory predicts happens in various situations that accounts for the behavior people observe and report in themselves and others.
- Mixed emotions/feelings are the result of a literal mix of different control systems under the same hierarchy receiving positive and negative signals as a result of producing less or more prediction error.
- Hard-to-predict people are perceived as creepy or, stated with less nuance, bad.
- Familiar things feel good by definition: they are easy to predict.
- Similarly, there's a feeling of loss (bad) when familiar things change.
- Mental illnesses result from failures of neurons to set good/bad thresholds appropriately, to update set points at an appropriate rate to match current rather than old circumstances, and from sensory input issues causing either prediction error or internally correct predictions that are poorly correlated with reality (this broadly including issues related both to sight, sound, smell, taste, touch and to mental inputs from long term memory, short term memory, and otherwise from other neurons).
- Desire and aversion are what it feels like to notice prediction error is high and for the brain to take actions it predicts will lower it either by something happening (seeing sensory input) or not happening (not seeing sensory input), respectively.
- Good and bad feel like natural categories because they are, but ones that are the result of a brain interacting with the world rather than features of the externally observed world.
- Etc.
Further exploration of these kinds of cases will help in verifying the theory via whether or not adequate and straightforward applications of the theory can explain various phenomena (I view it as being in a similar epistemic state to evolutionary psychology, including the threat of misleading ourselves with just-so stories). It does to some extent hinge on questions I'm not situated to evaluate experimentally myself, especially whether or not the brain actually implements hierarchical control systems of the type described, but I'm willing to move forward because even if the brain is not literally made of hierarchical control systems the theory appears to model what the brain does well enough that whatever theory replaces it will also have to be compatible with many of its predictions. Hence I think we can use it as a provisional grounding even as we keep an eye out for ways in which it may turn out to be an abstraction that we will have to reconsider in the light of future evidence, and that work we do based off of it will be amendable to translation to whatever new, more fundamental grounding we may discover in the future.
Relation to AI alignment
So that's the theory. How does it relate to AI alignment?
First note that this theory is naturally a foundation of axiology, or the study of values, and by extension a foundation for the study of ethics, to the extent that ethics is about reasoning about how agents, each with their own (possibly identical) values, interact. This is relevant for reasons I and more recently Stuart Armstrong have explored:
- What it would mean for an AI to be aligned is currently only defined in natural language. We don't know how to make a precise specification of what AI alignment means without presupposing a solution that makes many assumptions.
- Since alignment is alignment with human values, we must understand human values well enough that we can adequately ground the definitions so they don't come apart under optimization and we can verify whether or not we have achieved alignment via understanding we have that is not given to us by AI and thus suspect in the case that we failed to achieve alignment.
- Further, since we can neither resolve value conflicts nor even learn human values without making normative assumptions, however weak they may be, we would do well to find assumptions that are rooted in things that are unlikely to create x-risks/s-risks because they are strongly correlated with what we observe at the most basic physical level and are unlikely to be the result of perceptual biases (anthropic bias, cultural bias, etc.) that may lock us into outcomes we would, under reflection, disprefer.
Stuart has been exploring one approach by grounding human values in an improvement on the abstraction for human values used in inverse reinforcement learning that I think of as a behavioral economics theory of human values. My main objection to this approach is that it is behaviorist: it appears to me to be grounded in what can be observed from external human behavior by other agents and has to infer the internal states of agents across a large inferential gap, true values being a kind of hidden and encapsulated variable an agent learns about via observed behavior. To be fair this has proven an extremely useful approach over the past 100 years or so in a variety of fields, but it also suffers an epistemic problem in that it requires lots of inference to determine values, and I believe this makes it a poor choice given the magnitude of Goodharting effects we expect to be at risk from with superintelligence-levels of optimization.
In comparison, I view a predictive-coding-like theory of human values as offering a much better method of grounding human preferences. It is
- parsimonious: the behavioral economics approach to human values allows comparatively complicated value specifications and requires many modifications to make it reflect a wide variety of observed human behavior, whereas this theory lets them be specified in simple terms that become complex by recursive application of the same basic mechanism;
- requires little inference: if it is totally right, only the inference of measuring neuron activity creates room for epistemic error within the model;
- captures internal state: true values/internal state is assessed as directly as possible rather than inferred from behavior;
- broad: works for both rational and non-rational agents without modification;
- flexible: even if the control theory model is wrong, the general "Bayesian brain" approach is probably right enough for us to make useful progress over what is possible with a behaviorist approach such that we could translate work that assumes predictive coding to another, better model.
Thus I am quite excited about the possibility that predictive coding approach may allow us to ground human values precisely enough to enable successfully aligning AI with human values.
This is a first attempt to explain what has been my "big idea" for the last year or so now that it has finally come together enough in my head that I'm confident presenting it, so I very much welcome feedback, questions, and comments that may help us move towards a more complete evaluation and exploration of this idea.
I'll reply to your points soon because I think doing that is a helpful way for me and others to explore this idea, although it might take me a little time since this is not the only thing I have to do, but first I'll respond to this request that I seemingly left out.
I have two main lines of evidence that come together to make me like this theory.
One is that it's elegant, simple, and parsimonious. Control systems are simple, they look to me to be the simplest thing we might reasonably call "alive" or "conscious" if we try to redefine those terms in ways that are not anchored on our experience here on Earth. I think the reason it's so hard to answer questions about what is alive and what is conscious is because the naive categories we form and give those names are ultimately rooted in simple phenomena involving information "pumping" that locally reduce entropy but there are many things that do this that are outside our historical experience of what we could observe to generate information which historically made more sense to think of as "dead" than "alive". In a certain sense this leads me to a position you might call "cybernetic panpsychism", but that's just fancy words for saying there's nothing so special going on in the universe that makes us different from rocks and stars than (increasingly complex) control systems creating information.
Another is that it fits with a lot of my understanding of human psychology. Western psychology doesn't really get down to a level where it has a solid theory of what's going on at the lowest levels of the mind, but Buddhist's psychology of the abhidharma does, and it says that right after "contact" (stuff interacting with neurons) comes "feeling/sensing", and this is claimed to always contain a signal of positive, negative, or neutral judgement. My own experience with meditation showed me something similar such that when I learned about this theory it seemed like an obviously correct way of explaining what I was experiencing. This makes me strongly believe that any theory of value we want to develop should account for this experience of valence showing up and being attached to every experience.
In light of this second reason, I'll add to my first reason that it seems maximally parsimonious that if we were looking for an origin of valence it would have to be about something simple that could be done by a control system, and the simplest thing it could do that doesn't simply ignore the input is test how far off an observed input is from a set point. If something more complex is going on, I think we'd need an explanation for why sending a signal indicating distance from a set point is not enough.
I briefly referenced these above, but left it all behind links.
I think there are also some other lines of evidence that are less compelling to me but seem worth mentioning: