Abstract: An attempt to map a best-guess model of how human values and motivations work to several more technical research questions. The mind-model is inspired by predictive processing / active inference framework and multi-agent models of the mind.
The text has slightly unusual epistemic structure:
1st part: my current best-guess model of how human minds work.
2nd part: explores various problems which such mind architecture would pose for some approaches to value learning. The argument is: if such a model seems at least plausible, we should probably extend the space of active research directions.
3rd part: a list of specific research agendas, sometimes specific research questions, motivated by the previous.
I put more credence in the usefulness of research questions suggested in the third part than in the specifics of the model described the first part. Also, you should be warned I have no formal training in cognitive neuroscience and similar fields, and it is completely possible I’m making some basic mistakes. Still, my feeling is even if the model described in the first part is wrong, something from the broad class of “motivational systems not naturally described by utility functions” is close to reality, and understanding problems from the 3rd part can be useful.
How minds work
As noted, this is a “best guess model”. I have large uncertainty about how human minds actually work. But if I could place just one bet, I would bet on this.
The model has two prerequisite ideas: predictive processing and the active inference framework. I'll give brief summaries and links for elsewhere.
In the predictive processing / the active inference framework, brains constantly predict sensory inputs, in a hierarchical generative way. As a dual, action is also “generated” by the same machinery (changing environment to match “predicted” desirable inputs and generating action which can lead to them). The “currency” on which the whole system is running is prediction error (or something in style of free energy, in that language).
Another important ingredient is bounded rationality, i.e. a limited amount of resources being available for cognition. Indeed, the specifics of hierarchical modelling, neural architectures, principle of reusing and repurposing everything, all seem to be related to quite brutal optimization pressure, likely related to brain’s enormous energy consumption (It is unclear to me if this can be also reduced to the same “currency”. Karl Friston would probably answer "yes").
Assuming this whole, how do motivations and “values” arise? The guess is, in many cases something like a “subprogram” is modelling/tracking some variable, “predicting” its desirable state, and creating the need for action by “signalling” prediction error. Note that such subprograms can work on variables on very different hierarchical layers of modelling - e.g. tracking a simple variable like “feeling hungry” vs. tracking a variable like “social status”. Such sub-systems can be large: for example tracking “social status” seems to require lot of computation.
How does this relate to emotions? Emotions could be quite complex processes, where some higher-level modelling (“I see a lion”) leads to a response in lower levels connected to body states, some chemicals are released, and this interoceptive sensation is re-integrated in the higher levels in the form of emotional state, eventually reaching consciousness. Note that the emotional signal from the body is more similar to “sensory” data - the guess is body/low level responses are a way how genes insert a reward signal into the whole system.
How does this relate to our conscious experience, and stuff like Kahneman's System 1/System 2? It seems for most people the light of consciousness is illuminating only a tiny part of the computation, and most stuff is happening in the background. Also, S1 has much larger computing power. On the other hand it seems relatively easy to “spawn background processes” from the conscious part, and it seems possible to illuminate larger part of the background processing than is usually visible through specialized techniques and efforts (for example, some meditation techniques).
Another ingredient is the observation that a big part of what the conscious self is doing is interacting with other people, and rationalizing our behaviour. (Cf. press secretary theory, elephant in the brain.) It is also quite possible the relation between acting rationally and the ability to rationalize what we did is bidirectional, and significant part of motivation for some rational behaviour is that it is easy to rationalize it.
Also, it seems important to appreciate that the most important part of the human “environment” are other people, and what human minds are often doing is likely simulating other human minds (even simulating how other people would be simulating someone else!).
Problems with prevailing value learning approaches
While the above sketched picture is just a best guess, it seems to me at least compelling. At the same time, there are notable points of tension between it and at least some approaches to AI alignment.
No clear distinction between goals and beliefs
In this model, it is hardly possible to disentangle “beliefs” and “motivations” (or values). “Motivations” interface with the world only via a complex machinery of hierarchical generative models containing all other sorts of “beliefs”.
To appreciate the problems for the value learning program, consider a case of someone who’s predictive/generative model strongly predicts failure and suffering. Such person may take actions which actually lead to this outcome, minimizing the prediction error.
Less extreme but also important problem is that extrapolating “values” outside of the area of validity of generative models is problematic and could be fundamentally ill-defined. (This is related to “ontological crisis”.)
No clear self-alignment
It seems plausible the common formalism of agents with utility functions is more adequate for describing the individual “subsystems” than the whole human minds. Decisions on the whole mind level are more like results of interactions between the sub-agents; results of multi-agent interaction are not in general an object which is naturally represented by utility function. For example, consider the sequence of game outcomes in repeated PD game. If you take the sequence of game outcomes (e.g. 1: defect-defect, 2:cooperate-defect, ... ) as a sequence of actions, the actions are not representing some well behaved preferences, and in general not maximizing some utility function.
Note: This is not to claim VNM rationality is useless - it still has the normative power - and some types of interaction lead humans to approximate SEU optimizing agents better.
One case is if mainly one specific subsystem (subagent) is in control, and the decision does not go via too complex generative modelling. So, we should expect more VNM-like behaviour in experiments in narrow domains than in cases where very different sub-agents are engaged and disagree.
Another case is if sub-agents are able to do some “social welfare function” style aggregation, bargain, or trade - the result could be more VNM-like, at least in specific points of time, with the caveat that such “point” aggregate function may not be preserved in time.
On the contrary, cases where the resulting behaviour is very different from VNM-like may be caused by sub-agents locked in some non-cooperative Nash equilibria.
What we are aligning AI with
Given this distinction between the whole mind and sub-agents, there are at least four somewhat different notions of what alignment can mean.
1. Alignment with the outputs of the generative models, without querying the human. This includes for example proposals centered around approval. In this case, generally only the output of the internal aggregation has some voice.
2. Alignment with the outputs of the generative models, with querying the human. This includes for example CIRL and similar approaches. The problematic part of this is, by carefully crafted queries, it is possible to give voice to different sub-agenty systems (or with more nuance, give them very different power in the aggregation process). One problem with this is, if the internal human system is not self-aligned, the results could be quite arbitrary (and the AI agent has a lot of power to manipulate)
3. Alignment with the whole system, including the human aggregation process itself. This could include for example some deep NN based black-box trained on a large amount of human data, predicting what would the human want (or approve).
4. Adding layers of indirection to the question, such as defining alignment as a state where the “A is trying to do what H wants it to do.”
In practice, options 1. and 2. can collapse into one, as far as there is some feedback loop between the AI agent actions and the human reward signal. (Even in case 1, the agent can take an action with the intention to elicit feedback from some subpart.)
We can construct a rich space of various meanings of "alignment" by combining basic directions.
Now, we can analyze how these options interact with various alignment research programs.
Probably the most interesting case is IDA. IDA-like schemes can probably carry forward arbitrary properties to more powerful systems, as long as we are able to construct the individual step preserving the property. (I.e. one full cycle of distillation and amplification, which can be arbitrarily small).
Distilling and amplifying the alignment in sense #1 (what the human will actually approve) is conceptually easiest, but, unfortunately, brings some of the problems of potentially super-human system optimizing for manipulating the human for approval.
Alignment in sense #3 creates a very different set of problems. One obvious risk are mind-crimes. More subtle risk is related to the fact that as the implicit model of human “wants” scales (becomes less bounded), I. the parts may scale at different rates II. the outcome equilibria may change even if the sub-parts scale at the same rate.
Alignment in sense #4 seems more vague, and moves the burden of understanding the problem in part to the side of the AI. We can imagine that at the end the AI will be aligned with some part of the human mind in a self-consistent way (the part will be a fixed point of the alignment structure). Unfortunately, it is a priori unclear if a unique fixed point exists. If not, the problems become similar to case #2. Also, it seems inevitable the AI will need to contain some structure representing what the human wants the AI to do, which may cause problems similar to #3.
Also, in comparison with other meanings, it is much less clear to me how to even establish some system has this property.
Rider-centric and meme-centric alignment
Many alignment proposals seem to focus on interacting just with the conscious, narrating and rationalizing part of mind. If this is just a one part entangled in some complex interaction with other parts, there are specific reasons why this may be problematic.
One: if the “rider” (from the rider/elephant metaphor) is the part highly engaged with tracking societal rules, interactions and memes. It seems plausible the “values” learned from it will be mostly aligned with societal norms and interests of memeplexes, and not “fully human”.
This is worrisome: from a meme-centric perspective, humans are just a substrate, and not necessarily the best one. Also - a more speculative problem may be - schemes learning human memetic landscape and “supercharging it” with superhuman performance may create some hard to predict evolutionary optimization processes.
Metapreferences and multi-agent alignment
Individual “preferences” can often in fact be mostly a meta-preference to have preferences compatible with other people, based on simulations of such people.
This may make it surprisingly hard to infer human values by trying to learn what individual humans want without the social context (necessitating inverting several layers of simulation). If this is the case, the whole approach of extracting individual preferences from a single human could be problematic. (This is probably more relevant to some “prosaic” alignment problems)
Implications
Some of the above mentioned points of disagreements point toward specific ways how some of the existing approaches to value alignment may fail. Several illustrative examples:
- Internal conflict may lead to inaction (also to not expressing approval or disapproval). While many existing approaches represent such situation only by the outcome of the conflict, the internal experience of the human seems to be quite different with and without the conflict
- Difficulty with splitting “beliefs” and “motivations”.
- Learning inadequate societal equilibria and optimizing on them.
Upside
On the positive side, it could be expected the sub-agents still easily agree on things like “it is better not to die a horrible death”.
Also, the mind-model with bounded sub-agents which interact only with their local neighborhood and do not actually care about the world may be a viable design from the safety perspective.
Suggested technical research directions
While the previous parts are more in backward-chaining mode, here I attempt to point toward more concrete research agendas and questions where we can plausibly improve our understanding either by developing theory, or experimenting with toy models based on current ML techniques.
Often it may be the case that some research was already done on the topic, just not with AI alignment in mind, and a high value work could be “importing the knowledge” into safety community.
Understanding hierarchical modelling.
It seems plausible the human hierarchical models of the world optimize some "boundedly rational" function. (Remembering all details is too expensive, too much coarse-graining decreases usefulness. A good bounded rationality model can work as a principle for how to select models. In a similar way to the minimum description length principle, just taking some more “human” (energy?) costs as cost function.)
Inverse Game Theory.
Inverting agent motivations in MDPs is a different problem from inverting motivations in multi-agent situations where game-theory style interactions occur. This leads to the inverse game theory problem: observe the interactions, learn the objectives.
Learning from multiple agents.
Imagine a group of five closely interacting humans. Learning values just from person A may run into the problem that big part of A’s motivation is based on A simulating B,C,D,E (on the same “human” hardware, just incorporating individual differences). In that case, learning the “values” just from A’s actions could be in principle more difficult than observing the whole group, trying to learn some “human universals” and some “human specifics”. A different way of thinking about this could be by making a parallel with meta-learning algorithms (e.g. REPTILE) but in IRL frame.
What happens if you put a system composed of sub-agents under optimization pressure?
It is not clear to me what would happen if you, for example, successfully “learn” such a system of “motivations” from a human, and then put it inside of some optimization process selecting for VNM-like rational behaviour.
It seems plausible the somewhat messy system will be forced to get more internally aligned; for example, one way how it can happen is one of the sub-agent systems takes control and “wipes out the opposition”.
What happens if you make a system composed of sub-agents less computationally bounded?
It is not clear that the relative powers of sub-agents will scale the same with the whole system becoming less computationally bounded. (This is related to MIRI’s sub-agents agenda)
Suggested non-technical research directions
Human self-alignment.
All other things being equal, it seem safer to try to align AI with humans which are self-aligned.
Notes & Discussion
Motivations
Part of my motivation for writing this was an annoyance: there is a plenty of reasons to believe the view
- human mind is a unified whole,
- at first approximation optimizing some utility function,
- this utility is over world-states,
is neither a good model of humans, nor the best model how to think about AI. Yet, it is the paradigm shaping a lot of thoughts and research. I hope if the annoyance surfaced in the text, it is not too distractive.
Multi-part minds in literature
There are dozens of schemes describing mind as some sort of multi-part system, so there is nothing original about this claim. Based on a very shallow review, it seems the way how psychologists often conceptualize the sub-agents is as subpersonalities, which are almost fully human. This seems to err on the side of sub-agents being too complex, and anthropomorphising instead of trying to describe formally. (Explaining humans as a composition of humans is not much useful for AI alignment). On the other hand, Minsky’s “Society of Mind” has sub-agents which often seem to be too simple (e.g. similar in complexity to individual logic gates). If there is some literature having sub-agent complexity right, and sub-agents being inside predictive processing, I’d be really excited about it!
Discussion
When discussion the draft, several friends noted something along the line: “It is overdetermined that approaches like IRL are doomed. There are many reasons for that and the research community is aware of them”. To some extent, I agree this is the case, on the other hand 1. the described model of mind may pose problems even for more sophisticated approaches 2. My impression is many people still have something like utility-maximizing agent as a the central example.
The complementary objection is that while interacting sub-agents may be a more precise model, it seems in practice it is often enough to think about humans as unified agents is good enough, and may be good enough even for the purpose of AI alignment. My intuitions on this is based on the connection of rationality to exploitability: it seems humans are usually more rational and less exploitable when thinking about narrow domains, but can be quite bad when vastly different subsystems are in in play (imagine on one side a person exchanging stock and money, on the other side some units of money, free time, friendship, etc.. In the second case, many people are willing to trade in different situations by very different rates)
I’d like to thank Linda Linsefors , Alexey Turchin, Tomáš Gavenčiak, Max Daniel, Ryan Carey, Rohin Shah, Owen Cotton-Barratt and others for helpful discussions. Part of this originated in the efforts of the “Hidden Assumptions” team on the 2nd AI safety camp, and my thoughts about how minds work are inspired by CFAR.
I would take this as introspective evidence in favor of something model-based, but it could look more like model-based RL rather than active inference. (I am not specifically advocating for model-based RL as the right model of human thinking.)
I believe this claim based on social dynamics -- among social creatures, it seems evolutionarity useful to try to prove your models right. An adaptation for doing this may influence your behavior even when you have no reason to believe anyone is looking or knows about the model you are confirming.
So, an experiment which would differentiate between socio-evolutionary causes and active inference would be to look for the effect in non-social animals. An experiment which comes to mind is that you somehow create a situation where an animal is trying to achieve some goal, but you give false feedback so that the animal momentarily thinks it is less successful than it is. Then, you suddenly replace the false feedback with real feedback. Does the animal try and correct to the previously believed (false) situation, in order to minimize predictive error? Rather than continuing to optimize in a way consistent with the task reward?
There are a lot of confounders. For example, one version of the experiment would involve trying to put your paw as high in the air as possible, and (somehow) initially getting false feedback about how well you are doing. When you suddenly start getting good feedback, do you re-position the paw to restore the previous level of feedback (minimizing predictive error) before trying to get it higher again? A problem with the experiment is that you might re-position your paw just because the real feedback changes the cost-benefit ratio, so a rational agent would try less hard at the task if it found out it was doing better than it thought.
A second example: pushing an object to a target location on the floor. If (somehow) you initially get bad feedback about where you are on the floor, and suddenly the feedback gets corrected, do you go to the location you thought you were at before continuing to make progress toward the goal? A confounder here is that you may have learned a procedure for getting the object to the desired location, and you are more confident in the results of following the procedure than you are otherwise. So, you prefer to push the object to the target location along the familiar route rather than in the efficient route from the new location, but this is a consequence of expected utility maximization under uncertainty about the task rather than any special desire to increase familiarity.
Note that I don't think of this as a prediction made by active inference, since active inference broadly speaking may precisely replicate max-expected-utility, or do other things. However, it seems like a prediction made by your favored version of active inference.
I think we may be able to make some progress on the question of its theoretical beauty. I share a desire for unified principles of epistemic and instrumental reasoning. However, I have an intuition that active inference is just not the right way to go about it. The unification is too simplistic, and has too many degrees of freedom. It should have some initial points for its simplicity, but it should lose those points when the simplest versions don't seem right (eg, when you conclude that the picture is missing goals/motivation).
FWIW, I want to mention logical induction as a theory of bounded rationality. It isn't really bounded enough to be the picture of what's going on in humans, but it is certainly major progress on the question of what should happen to probability theory when you have bounded processing power.
I mention this not because it is directly relevant, but because I think people don't necessarily realize logical induction is in the "bounded rationality" arena (even though "logical uncertainty" is definitionally very very close to "bounded rationality", the type of person who tends to talk about logical uncertainty is usually pretty different from the type of person who talks about bounded rationality, I think).
---
Another thing I want to mention -- although not every version of active inference predicts that organisms actively seek out the familiar and avoid the unfamiliar, it does seem like one of the central intended predictions, and a prediction I would guess most advocates of active inference would argue matches reality. One of my reasons for not liking the theory much is because I don't think it is likely to capture curiosity well. Humans engage in both familiarity-seeking and novelty-seeking behavior, and both for a variety of reasons (both terminal-goal-ish and instrumental-goal-ish), but I think we are closer to novelty-seeking than active inference would predict.
In Delusion, Survival, and Intelligent Agents (Ring & Orseau), behavior of a knowledge-seeking agent and a predictive-accuracy seeking agent are compared. Note that the knowledge-seeking agent and predictive-accuracy seeking agent have exactly opposite utility functions: the knowledge-seeking agent likes to be surprised, whereas the accuracy-seeking agent dislikes surprises. The knowledge-seeking agent behaves in (what I see as) a much more human way than the accuracy-seeking agent. The accuracy-seeking agent will try to gain information to a limited extent, but will ultimately try to remove all sources of novel stimuli to the extent possible. The knowledge-seeking agent will try to do new things forever.
I would also expect evolution to produce something more like the knowledge-seeking agent than the accuracy-seeking agent. In RL, curiosity is a major aid to learning. The basic idea is to augment agents with an intrinsic motive to gain information, in order to ultimately achieve better task performance. There are a wide variety of formulas for curiosity, but as far as I know they are all closer to valuing surprise than avoiding surprise, and this seems like what they should be. So, to the extent that evolution did something similar to designing a highly effective RL agent, it seems more likely that organisms seek novelty as opposed to avoid it.
So, I think the idea that organisms seek familiar experiences over unfamiliar is actually the opposite of what we should expect overall. It is true that for an organism which has learned a decent amount about its environment, we expect to see it steering toward states that are familiar to it. But this is just a consequence of the fact that it has optimized its policy quite a bit; so, it steers toward rewarding states, and it will have seen rewarding states frequently in the past for the same reason. However, in order to get organisms to this place as reliably as possible, it is more likely that evolution would have installed a decision procedure which steers disproportionately toward novelty (all else being equal) than one which steers disproportionately away from novelty (all else being equal).