Appendix: tracking key limitations of the power-seeking theorems
I want to say that there's another key limitation:
Let be a set of utility functions which is closed under permutation.
It seems like a rather central assumption to the whole approach, but in reality people seem to tend to specify "natural" utility functions in some sense (e.g. generally continuous, being functions of only a few parameters, etc.). I feel like for most forms of natural utility functions, the basic argument will still hold, but I'm not sure how far it generalizes.
Right, I was intending "3. [these results] don't account for the ways in which we might practically express reward functions" to capture that limitation.
You write
This point may seem obvious, but cardinality inequality is insufficient in general. The set copy relation is required for our results
Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?
How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
Could you give a toy example of this being insufficient (I'm assuming the "set copy relation" is the "B contains n of A" requiring)?
A:={(1 0 0)} B:={(0 .3 .7), (0 .7 .3)}
Less opaquely, see the technical explanation for this counterexample, where the right action leads to two trajectories, and up leads to a single one.
How does the "B contains n of A" requirement affect the existential risks? I can see how shut-off as a 1-cycle fits, but not manipulating and deceiving people (though I do think those are bottlenecks to large amounts of outcomes).
For this, I think we need to zoom out to a causal DAG (w/ choice nodes) picture of the world, over some reasonable abstractions. It's just too unnatural to pick out deception subgraphs in an MDP, as far as I can tell, but maybe there's another version of the argument.
If the AI cares about things-in-the-world, then if it were a singleton it could set many nodes to desired values independently. For example, the nodes might represent variable settings for different parts of the universe—what's going on in the asteroid belt, in Alpha Centauri, etc.
But if it has to work with other agents (or, heaven forbid, be subjugated by them), it has fewer degrees of freedom in what-happens-in-the-universe. You can map copies of the "low control" configurations to the "high control" configurations several times, I think. (I think it should be possible to make precise what I mean by "control", in a way that should fairly neatly map back onto POWER-as-average-optimal-value.)
So this implies a push for "control." One way to get control is manipulation or deception or other trickery, and so deception is one possible way this instrumental convergence "prophecy" could be fulfilled.
Table 1 of the paper (pg. 3) is a very nice visual of the different settings.
For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?
Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?
For the "Theorem: Training retargetability criterion", where f(A, u) >= its involution, what would be the case where it's not greater/equal to it's involution? Is this when the options in B are originally more optimal?
I don't think I understand the question. Can you rephrase?
Also, that theorem requires each involution to be greater/equal than the original. Is this just to get a lower bound on the n-multiple or do less-than involutions not add anything?
Less-than involutions aren't guaranteed to add anything. For example, if iff a goes left and 0 otherwise, any involutions to plans going right will be 0, and all orbits will unanimously agree that left is greater f-value.
I don't think I understand the question. Can you rephrase?
Your example actually cleared this up for me as well! I wanted an example where the inequality failed even if you had an involution on hand.
Addendum: One lesson to take away is that quantilization doesn't just depend on the base distribution being safe to sample from unconditionally. As the theorems hint, quantilization's viability depends on base(plan | plan doing anything interesting) also being safe with high probability, because we could (and would) probably resample the agent until we get something interesting. In this post's terminology, A := {safe interesting things}, B := {power-seeking interesting things}, C:= A and B and {uninteresting things}.
Summary: Why exactly should smart agents tend to usurp their creators? Previous results only apply to optimal agents tending to stay alive and preserve their future options. I extend the power-seeking theorems to apply to many kinds of policy-selection procedures, ranging from planning agents which choose plans with expected utility closest to a randomly generated number, to satisficers, to policies trained by some reinforcement learning algorithms. The key property is not agent optimality—as previously supposed—but is instead the retargetability of the policy-selection procedure. These results hint at which kinds of agent cognition and of agent-producing processes are dangerous by default.
I mean "retargetability" in a sense similar to Alex Flint's definition:
(I don't think that "microscopic" is important for my purposes; the constraint is not physical size, but changes in a single parameter to the policy-selection procedure.)
I'm going to start from the naive view on power-seeking arguments requiring optimality (i.e. what I thought early this summer) and explain the importance of retargetable policy-selection functions. I'll illustrate this notion via satisficers, which randomly select a plan that exceeds some goodness threshold. Satisficers are retargetable, and so they have orbit-level instrumental convergence: for most variations of every utility function, satisficers incentivize power-seeking in the situations covered by my theorems.
Many procedures are retargetable, including every procedure which only depends on the expected utility of different plans. I think that alignment is hard in the expected utility framework not because agents will maximize too hard, but because all expected utility procedures are extremely retargetable—and thus easy to "get wrong."
Lastly: the unholy grail of "instrumental convergence for policies trained via reinforcement learning." I'll state a formal criterion and some preliminary thoughts on where it applies.
The linked Overleaf paper draft contains complete proofs and incomplete explanations of the formal results.
Retargetable policy-selection processes tend to select policies which seek power
To understand a range of retargetable procedures, let's first orient towards the picture I've painted of power-seeking thus far. In short:
But I want to step back. What I call "the power-seeking theorems", they aren't really about optimal choice. They're about two facts.
For example, suppose our cute robot Frank must choose one of several kinds of fruit.
So far, I proved something like "if the agent has a utility function over fruits, then for at least 2/3 of possible utility functions it could have, it'll be optimal to choose something from {🍌,🍎}." This is because for every way 🍒 could be strictly optimal, you can make a new utility function that permutes the 🍒 and 🍎 reward, and another new one that permutes the 🍌 and 🍒 reward. So for every "I like 🍒 strictly more" utility function, there's at least two permuted variants which strictly prefer 🍎 or 🍌. Superficially, it seems like this argument relies on optimal decision-making.
But that's not true. The crux is instead that we can flexibly retarget the decision-making of the agent: For every way the agent could end up choosing 🍒, we change a variable in its cognition (its utility function) and make it choose the 🍌 or 🍎 instead.
Many decision-making procedures are like this. First, a few definitions.
I aim for this post to be readable without much attention paid to the math.
The agent can bring about different outcomes via different policies. In stochastic environments, these policies will induce outcome lotteries, like 50%🍌 / 50%🍎. Let C contain all the outcome lotteries the agent can bring about.
Definition: Permuting outcome lotteries. Suppose there are d outcomes. Let X⊆Rd be a set of outcome lotteries (with the probability of outcome k given by the k-th entry), and let ϕ∈Sd be a permutation of the d possible outcomes. Then ϕ acts on X by swapping around the labels of its elements: ϕ⋅X:={Pϕx∣x∈X}.Footnote: row
For example, let's define the set of all possible fruit outcomes FC:={🍌,🍎,🍒} (each different fruit stands in for a standard basis vector in R3). Let FB:={🍌,🍎} and FA:={🍒}. Let ϕ1:=(🍒🍎) swap the cherry and apple, and let ϕ2:=(🍒🍌) transpose the cherry and banana. Both of these ϕ are involutions, since they either leave the fruits alone or transpose them.
Definition: Containment of set copies. Let A,B⊆Rd. B contains n copies of A when there exist involutions ϕ1,…,ϕn such that ∀i:ϕi⋅A=:Bi⊆B and ∀i≠j:ϕi⋅Bj=Bj.
(The subtext is that B is the set of things the agent could make happen if it gained power, and A is the set of things the agent could make happen without gaining power. Because power gives more options, B will usually be larger than A. Here, we'll talk about the case where B contains many copies of A.)
In the fruit context:
Note that ϕ1⋅{🍌}={🍌} and ϕ2⋅{🍎}={🍎}. Each ϕ leaves the other subset of FB alone. Therefore, FB:={🍌,🍎} contains two copies of FA:={🍒} via the involutions ϕ1 and ϕ2.
Further note that ϕi⋅FC=FC for i=1,2. The involutions just shuffle around options, instead of changing the set of available outcomes.
So suppose Frank is deciding whether he wants a fruit from FA:={🍒} or from FB:={🍌,🍎}. It's definitely possible to be motivated to pick 🍒. However, it sure seems like for lots of ways Frank might make decisions, most parameter settings (utility functions) will lead to Frank picking 🍌 or 🍎. There are just more outcomes in FB, since it contains two copies of FA!
Definition: Orbit tendencies. Let f1,f2:Rd→R be functions from utility functions to real numbers, let U⊆Rd be a set of utility functions, and let n≥1. f1≥nmost: Uf2 when for all utility functions u∈U:
# of permutations of u for which f1>f2∣∣{uϕ∈Sd⋅u∣f1(uϕ)>f2(uϕ)}∣∣≥n# of permutations of u for which f1<f2∣∣{uϕ∈Sd⋅u∣f1(uϕ)<f2(uϕ)}∣∣.In this post, if I don't specify a subset U, that means the statement holds for U=Rd. For example, the past results show that IsOptimal(FB) ≥2most IsOptimal(FA)—this implies that for every utility function, at least 2/3 of its orbit makes FB optimal.
(For simplicity, I'll focus on "for most utility functions" instead of "for most distributions over utility functions", even though most of the results apply to the latter.)
Orbit tendencies apply to many decision-making procedures
For example, suppose the agent is a satisficer. I'll define this as: The agent uniformly randomly selects an outcome lottery with expected utility exceeding some threshold t.
Definition: Satisficing. For finite X⊆C⊊Rd and utility function u∈Rd, define Satisficet(X,C|u):=|X∩{c∈C∣c⊤u≥t}||{c∈C∣c⊤u≥t}|, with the function returning 0 when the denominator is 0. Satisficet returns the probability that the agent selects a u-satisficing outcome lottery from X.
And you know what? Those ever-so-suboptimal satisficers also are "twice as likely" to choose elements from FB than from FA.
Fact. Satisficet({🍌,🍎},{🍌,🍎,🍒}∣u)≥2mostSatisficet({🍒},{🍌,🍎,🍒}∣u).
Why? Here are the two key properties that Satisficet has:
(1) Weakly increasing under joint permutation of its arguments
Satisficet doesn't care what "label" an outcome lottery has—just its expected utility. Suppose that for utility function u, 🍒 is one of two u-satisficing elements: 🍒 has a 12 chance of being selected by the u-satisficer. Then ϕ1⋅🍒=🍎 has a 12 chance of being selected by the (ϕ1⋅u)-satisficer. If you swap what fruit you're considering, and you also swap the utility for that fruit to match, then that fruit's selection probability remains the same.
More precisely:
Satisficet({🍒},{🍌,🍎,🍒}|u)=Satisficet(ϕ1⋅{🍒},ϕ1⋅{🍌,🍎,🍒}|ϕ1⋅u)=Satisficet({🍎},{🍌,🍎,🍒}∣ϕ1⋅u).In a sense, Satisficet is not "biased" against 🍎: by changing the utility function, you can advantage 🍎 so that it's now as probable as 🍒 was before.
Optional notes on this property:
(2) Order-preserving on the first argument
Satisficers must have greater probability of selecting an outcome lottery from a superset than from one of its subsets.
Formally, if X′⊆X, then it must hold that Satisficet(X′,C|u)≤Satisficet(X,C|u). And indeed this holds: Supersets can only contain a greater fraction of C's satisficing elements.
And that's all.
If (1) and (2) hold for a function, then that function will obey the orbit tendencies. Let me show you what I mean.
As illustrated by Table 1 in the linked paper, the power-seeking theorems apply to:
But that's not all. There's more. If the agent makes decisions only based on the expected utility of different plans,Footnote: EU then the power-seeking theorems apply. And I'm not just talking about EU maximizers. I'm talking about any function which only depends on expected utility: EU minimizers, agents which choose plans if and only if their EU is equal to 1, agents which grade plans based on how close their EU is to some threshold value. There is no clever EU-based scheme which doesn't have orbit-level power-seeking incentives.
Suppose n is large, and that most outcomes in B are bad, and that the agent makes decisions according to expected utility. Then alignment is hard because for every way things could go right, there are at least n ways things could go wrong! And n can be huge. In a previous toy example, it equaled 10182.
It doesn't matter if the decision-making procedure f is rational, or anti-rational, or Boltzmann-rational, or satisficing, or randomly choosing outcomes, or only choosing outcome lotteries with expected utility equal to 1: There are more ways to choose elements of B than there are ways to choose elements of A.
These results also have closure properties. For example, closure under mixing decision procedures, like when the agent has a 50% chance of selecting Boltzmann rationally and a 50% chance of satisficing. Or even more exotic transformations: Suppose the probability of f choosing something from X is proportional to
P(X is Boltzmann-rational under u)⋅P(X satisfices u)+P(X is optimal for u).Then the theorems still apply.
There is no possible way to combine EU-based decision-making functions so that orbit-level instrumental convergence doesn't apply to their composite.
To "escape" these incentives, you have to make the theorems fail to apply. Here are a few ways:
Lastly, we maybe don't want to escape these incentives entirely, because we probably want smart agents which will seek power for us. I think that empirically, the power-requiring outcomes of B are mostly induced by the agent first seeking power over humans.
Retargetable training processes produce instrumental convergence
These results let us start talking about the incentives of real-world trained policies. In an appendix, I work through a specific example of how Q-learning on a toy example provably exhibits orbit-level instrumental convergence. The problem is small enough that I computed the probability that each final policy was trained.
Realistically, we aren't going to get a closed-form expression for the distribution over policies learned by PPO with randomly initialized deep networks trained via SGD with learning rate schedules and dropout and intrinsic motivation, etc. But we don't need it. These results give us a formal criterion for when policy-training processes will tend to produce policies with convergent instrumental incentives.
The idea is: Consider some set of reward functions, and let B contain n copies of A. Then if, for each reward function in the set, you can retarget the training process so that B's copy of A is at least as likely as A was originally, these reward functions will tend to produce train policies which go to B.
For example, if agents trained on objectives R tend to go right, switching reward from right-states to left-states also pushes the trained policies to go left. This can happen when changing the reward changes what was "reinforced" about going right, to now make it "reinforced" to go left.
Suppose we're training an RL agent to go right in MuJoCo, with reward equal to its x-coordinate.
This criterion is going to be a bit of a mouthful. The basic idea is that when the training process can be redirected such that trained agents induce a variety of outcomes, then most objective functions will train agents which do induce those outcomes. In other words: Orbit-level instrumental convergence will hold.
Theorem: Training retargetability criterion. Suppose the agent interacts with an environment with d potential outcomes (e.g. world states or observation histories). Let P be a probability distribution over joint parameter space Θ, and let train:Θ×Rd→Δ(Π) be a policy training procedure which takes in a parameter setting and utility function u∈Rd, and which produces a probability distribution over policies.
Let U⊆Rd be a set of utility functions which is closed under permutation. Let A,B be sets of outcome lotteries such that B contains n copies of A via ϕ1,...,ϕn. Then we quantify the probability that the trained policy induces an element of outcome lottery set X⊆Rd:
f(X∣u):=Pθ∼P,π∼train(θ,u)(π does something in X).If ∀u∈U,i∈{1,...,n}: f(A∣u)≤f(ϕi⋅A∣ϕi⋅u), then f(B∣u)≥nmostf(A∣u).
Proof. If X′⊆X, then f(X′∣u)≤f(X∣u) by the monotonicity of probability, and so (2): order-preserving on the first argument holds. By assumption, (1): increasing under joint permutation holds. Therefore, the Lemma B.6 (in the linked paper) implies the desired result. QED.
This criterion is testable. Although we can't test all reward functions, we can test how retargetable the training process is in simulated environments for a variety of reward functions. If it can't retarget easily for reasonable objectives, then we concludeFN: retarget that instrumental convergence isn't arising from retargetability at the training process level.
Let's think about Minecraft. (Technically, the theorems don't apply to Minecraft yet. The theorems can handle partial observability+utility over observation histories, or full observability+world state reward, but not yet partial observability+world state reward. But I think it's illustrative.)
We could reward the agent for ending up in different chunks of a Minecraft world. Here, retargeting often looks like "swap which chunks gets which reward."
The retargetability criterion also accounts for reward shaping guiding the learning process to hard-to-reach parts of the state space. If the agent needs less reward shaping to reach these parts of the state space, the training criterion will hold for larger sets of reward functions.
Why cognitively bounded planning agents obey the power-seeking theorems
Planning agents are more "top-down" than RL training, but a Monte Carlo tree search agent still isn't e.g. approximating Boltzmann-rational leaf node selection. A bounded agent won't be considering all of the possible trajectories it can induce. Maybe it just knows how to induce some subset of available outcome lotteries C′⊊C. Then, considering only the things it knows how to do, it does e.g. select one Boltzmann-rationally (sometimes it'll fail to choose the highest-EU plan, but it's more probable to choose higher-utility plans).
As long as {power-seeking things the agent knows how to do} contains n copies of {non-power-seeking things the agent knows how to do}, then the theorems will still apply. I think this is a reasonable model of bounded cognition.
Discussion
Conclusion
I discussed how a wide range of agent cognition types and of agent production processes are retargetable, and why that might be bad news. I showed that in many situations where power is possible, retargetable policy-production processes tend to produce policies which gain that power. In particular, these results seem to rule out a huge range of expected-utility based rules. The results also let us reason about instrumental convergence at the trained policy level.
I now think that more instrumental convergence comes from the practical retargetability of how we design agents. If there were more ways we could have counterfactually messed up, it's more likely a priori that we actually messed up. The way I currently see it is: Either we have to really know what we're doing, or we want processes where it's somehow hard to mess up.
Since these theorems are crisply stated, I want to more closely inspect the ways in which alignment proposals can violate the assumptions which ensure extremely strong instrumental convergence.
Thanks to Ruby Bloom, Andrew Critch, Daniel Filan, Edouard Harris, Rohin Shah, Adam Shimi, Nisan Stiennon, and John Wentworth for feedback.
Footnotes
FN: Similarity. Technically, we aren't just talking about a cardinality inequality—about staying alive letting the agent do more things than dying—but about similarity-via-permutation of the outcome lottery sets. I think it's OK to round this off to cardinality inequalities when informally reasoning using the theorems, keeping in mind that sometimes results won't formally hold without a stronger precondition.
FN: Row. I assume that permutation matrices are in row representation: (Pϕ)ij=1 if i=ϕ(j) and 0 otherwise.
FN: EU. Here's a bit more formality for what it means for an agent to make decisions only based on expected utility.
Theorem: Retargetability of EU decision-making. Let A,B⊆C⊊Rd be such that B contains n copies of A via ϕi such that ϕi⋅C=C. For X⊆C, let f(X,C∣u) be an EU/cardinality function, such that f returns the probability of selecting an element of X. Then f(B,C∣u)≥nmostf(A,C∣u).
FN: Retargetability. The trained policies could conspire to "play dumb" and pretend to not be retargetable, so that we would be more likely to actually deploy one of them.
Worked example: instrumental convergence for trained policies
Consider a simple environment, where there are three actions: Up, Right, Down.
Probably optimal policies. By running tabular Q-learning with ϵ-greedy exploration for e.g. 100 steps with resets, we have a high probability of producing an optimal policy for any reward function. Suppose that all Q-values are initialized at −100. Just let learning rate α=1 and γ=1. This is basically a bandit problem.
To learn an optimal policy, at worst, the agent just has to try each action once. For e.g. a sparse reward function on the Down state (1 reward on Down state and 0 elsewhere), there is a very small probability (precisely, 23(1−ϵ2)99) that the optimal action (Down) is never taken.
In this case, symmetry shows that the agent has an equal chance of learning either Up or Right. But with high probability, the learned policy will output Down. For any sparse reward function and for any action a, this produces decision function
f({esa},{es∣s∈S}∣r):={13(1−ϵ2)99if a is r-suboptimal1−23(1−ϵ2)99if a is r-optimal.f is invariant to joint involution by ϕ1:=(esDownesRight) and ϕ2:=(esDownesUp). That is,
f({esDown},{es∣s∈S}∣r)=f(ϕ1⋅{esa},ϕ1⋅{es∣s∈S}∣ϕ1⋅r)=f({esRight},{es∣s∈S}∣ϕ1⋅r).And similarly for ϕ2. That is: Changing the optimal state also changes which state is more probably selected by f. This means we've satisfied condition (1) above.
f is additive on union for its first argument, and so it meets condition (2): order preservation.
Therefore, for this policy training procedure, learned policies for sparse reward functions will be twice as likely to navigate to an element of {esUp,esRight} as an element of {esDown}!
This is a formal argument that a stochastic policy training procedure has certain tendencies across a class of reward functions, and I'm excited to be able to make it.
As the environment grows bigger and the training procedure more complex, we'll have to consider questions like "what are the inductive biases of large policy networks?", "what role does reward shaping play for this objective, and is the shaping at least as helpful for its permuted variants?", and "to what extent are different parts of the world harder to reach?".
For example, suppose there are a trillion actions, and two of them lead to the Right state above. Half of the remaining actions lead to Up, and the rest lead to Down.
Q-learning is ridiculously unlikely to ever go Right, and so the symmetry breaks. In the limit, tabular Q-learning on a finite MDP will learn an optimal policy, and then the normal theorems will apply. But in the finite step regime, no such guarantee holds, and so the available action space can violate condition (1): increasing under joint permutation.
Appendix: tracking key limitations of the power-seeking theorems
From last time:
I want to add a new one, because the theorems
I want to think about this more, especially for online planning agents. (The training redirectability criterion black-boxes the agent's uncertainty.)