LESSWRONG
LW

Curated Posts

Archive Recommendations

Spotlight Items

Seeking Power is Often Convergently Instrumental in MDPs
Best of LessWrong 2019

Alex Turner lays out a framework for understanding how and why artificial intelligences pursuing goals often end up seeking power as an instrumental strategy, even if power itself isn't their goal. This tendency emerges from basic principles of optimal decision-making.

But, he cautions that if you haven't internalized that Reward is not the optimization target, the concepts here, while technically accurate, may lead you astray in alignment research.

by TurnTrout
Load More
12TurnTrout
One year later, I remain excited about this post, from its ideas, to its formalisms, to its implications. I think it helps us formally understand part of the difficulty of the alignment problem. This formalization of power and the Attainable Utility Landscape have together given me a novel frame for understanding alignment and corrigibility. Since last December, I’ve spent several hundred hours expanding the formal results and rewriting the paper; I’ve generalized the theorems, added rigor, and taken great pains to spell out what the theorems do and do not imply. For example, the main paper is 9 pages long; in Appendix B, I further dedicated 3.5 pages to exploring the nuances of the formal definition of ‘power-seeking’ (Definition 6.1).  However, there are a few things I wish I’d gotten right the first time around. Therefore, I’ve restructured and rewritten much of the post. Let’s walk through some of the changes. ‘Instrumentally convergent’ replaced by ‘robustly instrumental’ Like many good things, this terminological shift was prompted by a critique from Andrew Critch.  Roughly speaking, this work considered an action to be ‘instrumentally convergent’ if it’s very probably optimal, with respect to a probability distribution on a set of reward functions. For the formal definition, see Definition 5.8 in the paper. This definition is natural. You can even find it echoed by Tony Zador in the Debate on Instrumental Convergence: (Zador uses “set of scenarios” instead of “set of reward functions”, but he is implicitly reasoning: “with respect to my beliefs about what kind of objective functions we will implement and what the agent will confront in deployment, I predict that deadly actions have a negligible probability of being optimal.”) While discussing this definition of ‘instrumental convergence’, Andrew asked me: “what, exactly, is doing the converging? There is no limiting process. Optimal policies just are.”  It would be more appropriate to say that an ac
65johnswentworth
This review is mostly going to talk about what I think the post does wrong and how to fix it, because the post itself does a good job explaining what it does right. But before we get to that, it's worth saying up-front what the post does well: the post proposes a basically-correct notion of "power" for purposes of instrumental convergence, and then uses it to prove that instrumental convergence is in fact highly probable under a wide range of conditions. On that basis alone, it is an excellent post. I see two (related) central problems, from which various other symptoms follow: 1. POWER offers a black-box notion of instrumental convergence. This is the right starting point, but it needs to be complemented with a gears-level understanding of what features of the environment give rise to convergence. 2. Unstructured MDPs are a bad model in which to formulate instrumental convergence. In particular, they are bad for building a gears-level understanding of what features of the environment give rise to convergence. Some things I've thought a lot about over the past year seem particularly well-suited to address these problems, so I have a fair bit to say about them. Why Unstructured MDPs Are A Bad Model For Instrumental Convergence The basic problem with unstructured MDPs is that the entire world-state is a single, monolithic object. Some symptoms of this problem: * it's hard to talk about "resources", which seem fairly central to instrumental convergence * it's hard to talk about multiple agents competing for the same resources * it's hard to talk about which parts of the world an agent controls/doesn't control * it's hard to talk about which parts of the world agents do/don't care about * ... indeed, it's hard to talk about the world having "parts" at all * it's hard to talk about agents not competing, since there's only one monolithic world-state to control * any action which changes the world at all changes the entire world-state; there's no built-in w
472Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
946AGI Ruin: A List of Lethalities
Ω
Eliezer Yudkowsky
3y
Ω
711
902Where I agree and disagree with Eliezer
Ω
paulfchristiano
3y
Ω
224
867Eight Short Studies On Excuses
Scott Alexander
15y
253
845Preface
Eliezer Yudkowsky
10y
17
787The Best Textbooks on Every Subject
lukeprog
15y
416
680What an actually pessimistic containment strategy looks like
lc
3y
138
678SolidGoldMagikarp (plus, prompt generation)
Ω
Jessica Rumbelow, mwatkins
2y
Ω
206
656AI 2027: What Superintelligence Looks Like
Ω
Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, romeo
3mo
Ω
222
650Simulators
Ω
janus
3y
Ω
168
132
An Opinionated Guide to Using Anki Correctly
Luise
2d
46
141
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Ω
Buck
5d
Ω
20
493
A case for courage, when speaking of AI danger
So8res
8d
121
269
Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
11d
Ω
102
97
Proposal for making credible commitments to AIs.
Cleo Nardo
15d
43
156
X explains Z% of the variance in Y
Leon Lang
18d
33
224
Do Not Tile the Lightcone with Your Confused Ontology
Ω
Jan_Kulveit
21d
Ω
27
173
Futarchy's fundamental flaw
dynomight
24d
48
168
Estrogen: A trip report
cube_flipper
1mo
41
81
A Straightforward Explanation of the Good Regulator Theorem
Alfred Harwood
1mo
29
142
Broad-Spectrum Cancer Treatments
sarahconstantin
1mo
10
150
The Best Reference Works for Every Subject
Parker Conley
1mo
27
Load More