An AI actively trying to figure out what I want might show me snapshots of different possible worlds and ask me to rank them. Of course, I do not have the processing power to examine entire worlds; all I can really do is look at some pictures or video or descriptions. The AI might show me a bunch of pictures from one world in which a genocide is quietly taking place in some obscure third-world nation, and another in which no such genocide takes place. Unless the AI already considers that distinction important enough to draw my attention to it, I probably won’t notice it from the pictures, and I’ll rank those worlds similarly - even though I’d prefer the one without the genocide. Even if the AI does happen to show me some mass graves (probably secondhand, e.g. in pictures of news broadcasts), and I rank them low, it may just learn that I prefer my genocides under-the-radar.
The obvious point of such an example is that an AI should optimize for the real-world things I value, not just my estimates of those things. I don't just want to think my values are satisfied, I want them to actually be satisfied. Unfortunately, this poses a conceptual difficulty: what if I value the happiness of ghosts? I don't just want to think ghosts are happy, I want ghosts to actually be happy. What, then, should the AI do if there are no ghosts?
Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model (i.e. in the real world). Trying to talk about my values "actually being satisfied" is a type error.
Some points to emphasize here:
- My values are not just a function of my sense data, they are a function of the state of the whole world, including parts I can't see - e.g. I value the happiness of people I will never meet.
- I cannot actually figure out or process the state of the whole world
- … therefore, my values are a function of things I do not know and will not ever know - e.g. whether someone I will never encounter is happy right now
- This isn’t just a limited processing problem; I do not have enough data to figure out all these things I value, even in principle.
- This isn’t just a problem of not enough data, it’s a problem of what kind of data. My values depend on what’s going on “inside” of things which look the same - e.g. whether a smiling face is actually a rictus grin
- This isn’t just a problem of needing sufficiently low-level data. The things I care about are still ultimately high-level things, like humans or trees or cars. While the things I value are in principle a function of low-level world state, I don’t directly care about molecules.
- Some of the things I value may not actually exist - I may simply be wrong about which high-level things inhabit our world.
- I care about the actual state of things in the world, not my own estimate of the state - i.e. if the AI tricks me into thinking things are great (whether intentional trickery or not), that does not make things great.
These features make it rather difficult to “point” to values - it’s not just hard to formally specify values, it’s hard to even give a way to learn values. It’s hard to say what it is we’re supposed to be learning at all. What, exactly, are the inputs to my value-function? It seems like:
- Inputs to values are not complete low-level world states (since people had values before we knew what quantum fields were, and still have values despite not knowing the full state of the world), but…
- I value the actual state of the world rather than my own estimate of the world-state (i.e. I want other people to actually be happy, not just look-to-me like they’re happy).
How can both of those intuitions seem true simultaneously? How can the inputs to my values-function be the actual state of the world, but also high-level objects which may not even exist? What things in the low-level physical world are those “high-level objects” pointing to?
If I want to talk about "actually satisfying my values" separate from my own estimate of my values, then I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world.
I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now. This includes alignment of both “principled” and “prosaic” AI. The one major exception is pure human-mimicking AI, which suffers from a mostly-unrelated set of problems (largely stemming from the shortcomings of humans, especially groups of humans).
I have yet to see this problem explained, by itself, in a way that I’m satisfied by. I’m stealing the name from some of Abram’s posts, and I think he’s pointing to the same thing I am, but I’m not 100% sure.
The goal of this post is to demonstrate what the problem looks like for a (relatively) simple Bayesian-utility-maximizing agent, and what challenges it leads to. This has the drawback of defining things only within one particular model, but the advantage of showing how a bunch of nominally-different failure modes all follow from the same root problem: utility is a function of latent variables. We’ll look at some specific alignment strategies, and see how and why they fail in this simple model.
One thing I hope people will take away from this: it’s not the “values” part that’s conceptually difficult, it’s the “pointers” part.
The Setup
We have a Bayesian expected-utility-maximizing agent, as a theoretical stand-in for a human. The agent’s world-model is a causal DAG over variables , and it chooses actions to maximize - i.e. it’s using standard causal decision theory. We will assume the agent has a full-blown Cartesian boundary, so we don’t need to worry about embeddedness and all that. In short, this is a textbook-standard causal-reasoning agent.
One catch: the agent’s world-model uses the sorts of tricks in Writing Causal Models Like We Write Programs, so the world-model can represent a very large world without ever explicitly evaluating probabilities of every variable in the world-model. Submodels are expanded lazily when they’re needed. You can still conceptually think of this as a standard causal DAG, it’s just that the model is lazily evaluated.
In particular, thinking of this agent as a human, this means that our human can value the happiness of someone they’ve never met, never thought about, and don’t know exists. The utility can be a function of variables which the agent will never compute, because the agent never needs to fully compute u in order to maximize it - it just needs to know how u changes as a function of the variables influenced by its actions.
Key assumption: most of the variables in the agent’s world-model are not observables. Drawing the analogy to humans: most of the things in our world-models are not raw photon counts in our eyes or raw vibration frequencies/intensities in our ears. Our world-models include things like trees and rocks and cars, objects whose existence and properties are inferred from the raw sense data. Even lower-level objects, like atoms and molecules, are latent variables; the raw data from our eyes and ears does not include the exact positions of atoms in a tree. The raw sense data itself is not sufficient to fully determine the values of the latent variables, in general; even a perfect Bayesian reasoner cannot deduce the true position of every atom in a tree from a video feed.
Now, the basic problem: our agent’s utility function is mostly a function of latent variables. Human values are mostly a function of rocks and trees and cars and other humans and the like, not the raw photon counts hitting our eyeballs. Human values are over inferred variables, not over sense data.
Furthermore, human values are over the “true” values of the latents, not our estimates - e.g. I want other people to actually be happy, not just to look-to-me like they’re happy. Ultimately, is the agent’s estimate of its own utility (thus the expectation), and the agent may not ever know the “true” value of its own utility - i.e. I may prefer that someone who went missing ten years ago lives out a happy life, but I may never find out whether that happened. On the other hand, it’s not clear that there’s a meaningful sense in which any “true” utility-value exists at all, since the agent’s latents may not correspond to anything physical - e.g. a human may value the happiness of ghosts, which is tricky if ghosts don’t exist in the real world.
On top of all that, some of those variables are implicit in the model’s lazy data structure and the agent will never think about them at all. I can value the happiness of people I do not know and will never encounter or even think about.
So, if an AI is to help optimize for , then it’s optimizing for something which is a function of latent variables in the agent’s model. Those latent variables:
- May not correspond to any particular variables in the AI’s world-model and/or the physical world
- May not be estimated by the agent at all (because lazy evaluation)
- May not be determined by the agent’s observed data
… and of course the agent’s model might just not be very good, in terms of predictive power.
As usual, neither we (the system’s designers) nor the AI will have direct access to the model; we/it will only see the agent’s behavior (i.e. input/output) and possibly a low-level system in which the agent is embedded. The agent itself may have some introspective access, but not full or perfectly reliable introspection.
Despite all that, we want to optimize for the agent’s utility, not just the agent’s estimate of its utility. Otherwise we run into wireheading-like problems, problems with the agent’s world model having poor predictive power, etc. But the agent’s utility is a function of latents which may not be well-defined at all outside the context of the agent’s estimator (a.k.a. world-model). How can we optimize for the agent’s “true” utility, not just an estimate, when the agent’s utility function is defined as a function of latents which may not correspond to anything outside of the agent’s estimator?
The Pointers Problem
We can now define the pointers problem - not only “pointers to values”, but the problem of pointers more generally. The problem: what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model? And what does that “correspondence” even mean - how do we turn it into an objective for the AI, or some other concrete thing outside the agent’s own head?
Why call this the “pointers” problem? Well, let’s take the agent’s perspective, and think about what its algorithm feels like from the inside. From inside the agent’s mind, it doesn’t feel like those latent variables are latent variables in a model. It feels like those latent variables are real things out in the world which the agent can learn about. The latent variables feel like “pointers” to real-world objects and their properties. But what are the referents of these pointers? What are the real-world things (if any) to which they’re pointing? That’s the pointers problem.
Is it even solvable? Definitely not always - there probably is no real-world referent for e.g. the human concept of a ghost. Similarly, I can have a concept of a perpetual motion machine, despite the likely-impossibility of any such thing existing. Between abstraction and lazy evaluation, latent variables in an agent’s world-model may not correspond to anything in the world.
That said, it sure seems like at least some latent variables do correspond to structures in the world. The concept of “tree” points to a pattern which occurs in many places on Earth. Even an alien or AI with radically different world-model could recognize that repeating pattern, realize that examining one tree probably yields information about other trees, etc. The pattern has predictive power, and predictive power is not just a figment of the agent’s world-model.
So we’d like to know both (a) when a latent variable corresponds to something in the world (or another world model) at all, and (b) what it corresponds to. We’d like to solve this in a way which (probably among other use-cases) lets the AI treat the things-corresponding-to-latents as the inputs to the utility function it’s supposed to learn and optimize.
To the extent that human values are a function of latent variables in humans’ world-models, this seems like a necessary step not only for an AI to learn human values, but even just to define what it means for an AI to learn human values. What does it mean to “learn” a function of some other agent’s latent variables, without necessarily adopting that agent’s world-model? If the AI doesn’t have some notion of what the other agent’s latent variables even “are”, then it’s not meaningful to learn a function of those variables. It would be like an AI “learning” to imitate grep, but without having any access to string or text data, and without the AI itself having any interface which would accept strings or text.
Pointer-Related Maladies
Let’s look at some example symptoms which can arise from failure to solve specific aspects of the pointers problem.
Genocide Under-The-Radar
Let’s go back to the opening example: an AI shows us pictures from different possible worlds and asks us to rank them. The AI doesn’t really understand yet what things we care about, so it doesn’t intentionally draw our attention to certain things a human might consider relevant - like mass graves. Maybe we see a few mass-grave pictures from some possible worlds (probably in pictures from news sources, since that’s how such information mostly spreads), and we rank those low, but there are many other worlds where we just don’t notice the problem from the pictures the AI shows us. In the end, the AI decides that we mostly care about avoiding worlds where mass graves appear in the news - i.e. we prefer that mass killings stay under the radar.
How does this failure fit in our utility-function-of-latents picture?
This is mainly a failure to distinguish between the agent’s estimate of its own utility , and the “real” value of the agent’s utility (insofar as such a thing exists). The AI optimizes for our estimate, but does not give us enough data to very accurately estimate our utility in each world - indeed, it’s unlikely that a human could even handle that much information. So, it ends up optimizing for factors which bias our estimate - e.g. the availability of information about bad things.
Note that this intuitive explanation assumes a solution to the pointers problem: it only makes sense to the extent that there’s a “real” value of from which the “estimate” can diverge.
Not-So-Easy Wireheading Problems
The under-the-radar genocide problem looks roughly like a typical wireheading problem, so we should try a roughly-typical wireheading solution: rather than the AI showing world-pictures, it should just tell us what actions it could take, and ask us to rank actions directly.
If we were ideal Bayesian reasoners with accurate world models and infinite compute, and knew exactly where the AI’s actions fit in our world model, then this might work. Unfortunately, the failure of any of those assumptions breaks the approach:
- We don’t have the processing power to predict all the impacts of the AI’s actions
- Our world models may not be accurate enough to correctly predict the impact of the AI’s actions, even if we had enough processing power
- The AI’s actions may not even fit neatly into our world model - e.g. even the idea of genetic engineering might not fit the world-model of premodern human thinkers
Mathematically, we’re trying to optimize , i.e. optimize expected utility given the AI’s actions. Note that this is necessarily an expectation under the human’s model, since that’s the only context in which is well-defined. In order for that to work out well, we need to be able to fully evaluate that estimate (sufficient processing power), we need the estimate to be accurate (sufficient predictive power), and we need to be defined within the model in the first place.
The question of whether our world-models are sufficiently accurate is particularly hairy here, since accuracy is usually only defined in terms of how well we estimate our sense-data. But the accuracy we care about here is how well we “estimate” the values of latent variables and . What does that even mean, when the latent variables may not correspond to anything in the world?
People I Will Never Meet
“Human values cannot be determined from human behavior” seems almost old-hat at this point, but it’s worth taking a moment to highlight just how underdetermined values are from behavior. It’s not just that humans have biases of one kind or another, or that revealed preferences diverge from stated preferences. Even in our perfect Bayesian utility-maximizer, utility is severely underdetermined from behavior, because the agent does not have perfect estimates of its latent variables. Behavior depends only on the agent’s estimate, so it cannot account for “error” in the agent’s estimates of latent variable values, nor can it tell us about how the agent values variables which are not coupled to its own choices.
The happiness of people I will never interact with is a good example of this. There may be people in the world whose happiness will not ever be significantly influenced by my choices. Presumably, then, my choices cannot tell us about how much I value such peoples’ happiness. And yet, I do value it.
“Misspecified” Models
In Latent Variables and Model Misspecification, jsteinhardt talks about “misspecification” of latent variables in the AI’s model. His argument is that things like the “value function” are latent variables in the AI’s world-model, and are therefore potentially very sensitive to misspecification of the AI’s model.
In fact, I think the problem is more severe than that.
The value function’s inputs are latent variables in the human’s model, and are therefore sensitive to misspecification in the human’s model. If the human’s model does not match reality well, then their latent variables will be something wonky and not correspond to anything in the world. And AI designers do not get to pick the human’s model. These wonky variables, not corresponding to anything in the world, are a baked-in part of the problem, unavoidable even in principle. Even if the AI’s world model were “perfectly specified”, it would either be a bad representation of the world (in which case predictive power becomes an issue) or a bad representation of the human’s model (in which case those wonky latents aren’t defined).
The AI can’t model the world well with the human’s model, but the latents on which human values depend aren’t well-defined outside the human’s model. Rock and a hard place.
Takeaway
Within the context of a Bayesian utility-maximizer (representing a human), utility/values are a function of latent variables in the agent’s model. That’s a problem, because those latent variables do not necessarily correspond to anything in the environment, and even when they do, we don’t have a good way to say what they correspond to.
So, an AI trying to help the agent is stuck: if the AI uses the human’s world-model, then it may just be wrong outright (in predictive terms). But if the AI doesn’t use the human’s world-model, then the latents on which the utility function depends may not be defined at all.
Thus, the pointers problem, in the Bayesian context: figure out which things in the world (if any) correspond to the latent variables in a model. What do latent variables in my model “point to” in the real world?
Why This Post Is Interesting
This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise.
Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap.
Warning: Inductive Gap
This post builds on top of two important pieces for modelling embedded agents which don't have their own posts (to my knowledge). The pieces are:
In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand.
Lazy World Models
One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up.
That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing.
In the same way, we can represent a large world (potentially even an infinite world) using a smaller amount of memory. We specify the model via a generator, and then evaluate queries against the model lazily. If we're thinking in terms of probabilistic models, then our generator could be e.g. a function in a probabilistic programming language, or (equivalently but through a more mathematical lens) a probabilistic causal model leveraging recursion. The generator compactly specifies a model containing many random variables (potentially even infinitely many), but we never actually run inference on the full infinite set of variables. Instead, we use lazy algorithms which only reason about the variables necessary for particular queries.
Once we know to look for it, it's clear that humans use some kind of lazy world models in our own reasoning. We never directly estimate the state of the entire world. Rather, when we have a question, we think about whatever "variables" are relevant to that question. We perform inference using whatever "generator" we already have stored in our heads, and we avoid recursively unpacking any variables which aren't relevant to the question at hand.
Lazy Utility/Values
Building on the notion of lazy world models: it's not very helpful to have a lazy world model if we need to evaluate the whole data structure in order to make a decision. Fortunately, even if our utility/values depend on lots of things, we don't actually need to evaluate utility/values in order to make a decision. We just need to compare the utility/value across different possible choices.
In practice, most decisions we make don't impact most of the world in significant predictable ways. (More precisely: the impact of most of our decisions on most of the world is wiped out by noise.) So, rather than fully estimating utility/value we just calculate how each choice changes total utility/value, based only on the variables significantly and predictably influenced by the decision.
A simple example (from here): if we have a utility function ∑if(Xi), and we're making a decision which only effects X3, then we don't need to estimate the sum at all; we only need to estimate f(X3) for each option.
Again, once we know to look for it, it's clear that humans do something like this. Most of my actions do not effect a random person in Mumbai (and to the extent there is an effect, it's drowned out by noise). Even though I value the happiness of that random person in Mumbai, I never need to think about them, because my actions don't significantly impact them in any way I can predict. I never actually try to estimate "how good the whole world is" according to my own values.
Where This Post Came From
In the second half of 2020, I was thinking about existing real-world analogues/instances of various parts of the AI alignment problem and embedded agency, in hopes of finding a case where someone already had a useful frame or even solution which could be translated over to AI. "Theory of the firm" (a subfield of economics) was one promising area. From wikipedia:
To the extent that we can think of companies as embedded agents, these mirror a lot of the general questions of embedded agency. Also, alignment of incentives is a major focus in the literature on the topic.
Most of the existing literature I read was not very useful in its own right. But I generally tried to abstract out the most central ideas and bottlenecks, and generalize them enough to apply to more general problems. The most important insight to come out of this process was: sometimes we cannot tell what happened, even in hindsight. This is a major problem for incentives: for instance, if we can't tell even in hindsight who made a mistake, then we don't know where to assign credit/blame. (This idea became the post When Hindsight Isn't 20/20: Incentive Design With Imperfect Credit Allocation.)
Similarly, this is a major problem for bets: we can't bet on something if we cannot tell what the outcome was, even in hindsight.
Following that thread further: sometimes we cannot tell how good an outcome was, even in hindsight. For instance, we could imagine paying someone to etch our names on a plaque on a spacecraft and then launch it on a trajectory out of the solar system. In this case, we would presumably care a lot that our names were actually etched on the plaque; we would be quite unhappy if it turned out that our names were left off. Yet if someone took off the plaque at the last minute, or left our names off of it, we might never find out. In other words, we might not ever know, even in hindsight, whether our values were actually satisfied.
There's a sense in which this is obvious mathematically from Bayesian expected utility maximization. The "expected" part of "expected utility" sure does suggest that we don't know the actual utility. Usually we think of utility as something we will know later, but really there's no reason to assume that. The math does not say we need to be able to figure out utility in hindsight. The inputs to utility are random variables in our world model, and we may not ever know the values of those random variables.
Once I started actually paying attention to the idea that the inputs to the utility function are random variables in the agent's world model, and that we may never know the values of those variables, the next step followed naturally. Of course those variables may not correspond to anything observable in the physical world, even in principle. Of course they could be latent variables. Then the connection to the Pointer Problem became clear.
It seems like "generators" should just be simple functions over natural abstractions? But I see two different ways to go with this, inspired either by the minimal latents approach, or by the redundant-information one.
First, suppose I want to figure out a high-level model of some city, say Berlin. I already have a "city" abstraction, let's call it P(Λ), which summarizes my general knowledge about cities in terms of a probability distribution over possible structures. I also know a bunch of facts about Berlin specifically, let's call th... (read more)