Previously: Seeking Power is Provably Instrumentally Convergent in MDPs
Rohin Shah and Vanessa Kosoy pointed out a subtle problem with my interpretation of the power-seeking theorem from the last post. To understand the distinction, we first need to shore up some intuitions.
Correcting pre-formal intuitions about instrumental convergence
Imagine you're able to either attend college and then choose one of two careers, or attend a trade school for a third career. If you wanted, you could also attend college after trade school.
If every way of rewarding careers is equally likely, then of the time, you just go to college straight away. This is true even though going to trade school increases your power (your ability to achieve goals in general) compared to just going to college. That is, Power() > Power(), but going to is instrumentally convergent.
We define instrumental convergence as optimal agents being more likely to take one action than another at some point in the future.
I think this captures what we really meant when we talked about instrumental convergence. Recently, however, an alignment researcher objected that instrumental convergence shouldn't depend on what state the world is in. I think the intuition was that Basic AI Drives-esque power-seeking means the agent should always seek out the powerful states, no matter their starting point.
I think this is usually true, but it isn't literally true. Sometimes states with high power are just too out-of-the-way! If you buy my formalization of power, then in what way is going to "instrumentally convergent"? It isn't optimal for most goals!
This suggests that naive intuitions about instrumental convergence are subtly wrong. To figure out where optimal policies tend to go, you must condition on where they come from. In other words, the best course of action depends on where you start out.
Correcting the last post's implications
Unfortunately, the above example kills any hope of a theorem like "the agent seeks out the states in the future with the most resources / power". The nice thing about this theorem would be that we just need to know a state has more resources in order for the agent to pursue it.
Everything should add up to normalcy, though: we should still be able to make statements like "starting from a given state, the agent tends to seek out states which give it more control over the future". This isn't quite what my current results show. For involved technical reasons,[1] one of the relationships I showed between power and instrumental convergence is a bit tautological, with both sides of the equation implicitly depending on the same variable. Accordingly, I'll be softening the language in the previous post for the moment.
I think there's a pretty good chance the theorem we're looking for exists in full generality ("starting from a given state, the agent tends to seek out states which give it more control over the future"). However, maybe it doesn't, and the relationships I gave are the best we can get in general. I do think the Tic-Tac-Toe reasoning from the last post is a strong conceptual argument for power-seeking being instrumentally convergent, but a few technicalities stop it from being directly formalized.
Failure to prove power-seeking in full generality would mostly affect the presentation to the broader AI community; we'd just be a little less aggressive in the claims. I think a reasonable reader can understand how and why power-seeking tends to happen, and why it doesn't go away just because some of the cycles aren't self-loops, or something silly like that.
In summary, the power-seeking theorem wasn't as suggestive as I thought. I'm still excited about this line of inquiry. We can still say things like "most agents stay alive in Pac-Man and postpone ending a Tic-Tac-Toe game", but only in the limit of farsightedness () by taking advantage of the distribution of terminal states. The theory does still (IMO) meaningfully deconfuse us about power and instrumental convergence. None of the proofs are known to me to be incorrect, and similar implications can be drawn (albeit slightly more cautiously or differently worded).
After the holidays, I'll see if we can't get a more appropriate theorem.
Thanks to Rohin Shah and Vanessa Kosoy for pointing out the interpretive mistake. Rohin suggested the college example as a non-abstract story for that environmental structure.
For those of you who have read the paper, I'm talking about the last theorem. The problem: saying the POWER contribution of some possibilities relates to their optimality measure doesn't tell us anything without already knowing that optimality measure. ↩︎
Here's my explanation of what's going on with that last theorem:
Consider some state s in a deterministic finite MDP with a perfectly optimal agent, where the rewards for each state are sampled uniformly and iid from the interval [0, 1]. We can "divide up" POWER(s) into contributions from all of the possibilities that are optimal for at least one reward, with the contributions weighted by the optimality measure for each possibility. (This is why POWER contribution depends on the optimality measure.) The paper proves that if one set of paths contributes 2K times as much power as another set, the first set must be at least K times more likely.
I was initially confused why this notion of power doesn't directly correspond to instrumental convergence, but instead only puts a bound on instrumental convergence. This is because expected reward can vary across possibilities. In particular, if you have two non-dominated possibilities f1 and f2, and you choose a random reward r1 (respectively, r2) that f1 (respectively, f2) is optimal for, then expected reward of f1 under r1 can be different from expected reward of f2 under r2. This changes the relative balance of power between them but doesn’t change the relative balance of the probability of each possibility.