I basically agree with the argument here. I think that approaches to alignment that try to avoid instrumental convergence are generally unlikely to succeed for exactly the reason that this removes the usefulness of AGI.[1] I also agree with jacob_cannell that the terminology choice of "power seeking" is unfortunate and misleading in this regard.
I think this is (at least for me) also one of the core generators of why alignment is so hard: AGI is dangerous for exactly the same reason why it is useful; the danger comes from not one specific kind of failure or one specific module in the model or whatever, but rather the fact that the things we want and don't want fall out of the exact same kind of cognition.
[1]: I do think there exists some work here that might be able to weasel out of this by making use of the surprising effectiveness of less-general intelligence plus the fact that capabilities research mostly pushes this kind of work currently, but this kind of thing has to hinge on a lot of specific assumptions, and I wouldn't bet on it.
I basically agree with the argument here. I think that approaches to alignment that try to avoid instrumental convergence are generally unlikely to succeed for exactly the reason that this removes the usefulness of AGI.
Note that this doesn't need to be a philosophical point, it's a physical fact that appears self-evident if you look at it through the lens of Active Inference: Active Inference as a formalisation of instrumental convergence.
The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence.
Yes, this is still underappreciated in most alignment discourse, perhaps because power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking. For example human friendships increase long term optionality (more social invites, social support, dating and business opportunities, etc), so a human trading some wealth for activities that increase and strengthen friendships can be instrumentally rational for optionality-maximizing empowerment, even though that doesn't fit the (incorrect) stereotype of 'power-seeking'.
The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.
Well if humans are also agents for which instrumental convergence applies, as you suggest here:
Imitation learning is useful due to Aumann's Agreement Theorem and because instrumental convergence also applies to human intelligence
Then that suggests that we can use instrumental convergence to help solve alignment, because optimizing for human empowerment becomes equivalent to optimizing for our unknown long term values.
There are some caveats of course: we may still need to incorporate some model of short term values like hedonic reward, and it's also important to identify the correct agency to empower which is probably not as simple as individual human brains. Humans are not purely selfish rational but instead are partially altruistic; handling that probably requires something like empowering humanity or generic agency more broadly, or empowering distributed software simulacra minds instead of brains.
I totally agree that the choice of "power seeking" is very unfortunate because of the same reasons you describe. I don't think optionality is quite it, though. I think "consequentialist" or "goal seeking" might be better (or we could just stick with "instrumental convergence"--it at least has neutral affect).
As for underappreciatedness, I think this is possibly true, though anecdotally at least for me I already strongly believed this and in fact a large part of my generator of why I think alignment is difficult is based on this.
I think I disagree about leveraging this for alignment but I'll read your proposal in more detail before commenting on that further.
Power-seeking has unfortunate negative connotations. A better less loaded term might be Optionality-seeking.
I think some people use the term "power-seeking" to refer specifically to the negative connotations of the term (hacking into a data center, developing harmful bioweapons and deploying them to retain control, etc).
I think to maximize legibility to different kinds of people, it helps to include "instrumental convergence usually involves all things converging on the path of accumulating resources (including information) and self-preservation, and, by extension, anything that contributes to self-preservation and accumulating resources".
Instrumental Convergence might seem like common language, but it certainly is not, and it really should be. Evocative examples increase word count, but they're so helpful that cutting them out in order to lower word count is basically Goodharting. At minimum, it will help an unfamiliar reader quickly and conveniently explain it to others in many casual conversations.
Also, this post is very helpful, I have archived it in my open-source intelligence/research folder and I plan to cite it in the future. The way this post was written is actually the ideal way to explain AI safety to someone for the first time since it allows a wide variety of intelligent people to really dive into the problem in a way they understand. I recommend submitting it to the AI Safety Public Materials Contest which I recently confirmed will still read, evaluate, and implement good papers for explaining AI safety to people for the first time. Regardless of whether they have any money to award, they will still credit the author with a ranking that will build credibility.
Hmm. I see what you mean about how the post could be expanded into an intro to AI safety if I gave more examples and such. It's not exactly the target audience I had in mind while writing it, I was more thinking someone who already had the basic intro but who might not have a strong framework to think definitely about broader questions.
I'll look into making a beginner-friendly version of it, and ping you if I have anything. I'll likely be busy, though, so I would be open to you adding the examples etc. for beginners and then splitting the prize for the contest if you want.
I basically agree with this post but want to push back a little bit here:
The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.
Yes, some level of power-seeking-like behavior is necessary for the AI to do impressive stuff. But I don't think that means giving up on the idea of limiting power-seeking. One model could look like this: for a given task, some level of power-seeking is necessary (e.g. to build working nanotech, you need to do a bunch of experiments and simulations, which requires physical resources, compute, etc.). But by default, the solution an optimization process would find might do even more power-seeking than that (killing all humans to ensure they don't intervene, turning the entire earth into computers). This higher level of power-seeking does increase the success probability (e.g. humans interfering is a genuine issue in terms of the goal of building nanotech). But this increase in success probability clearly isn't necessary from our perspective: if humans try to shut down the AI, we're fine with the AI letting itself be shut off (we want that, in fact!). So the argument "we want power-seeking" isn't strong enough to imply "we want arbitrary amounts of power-seeking, and trying to limit it is mis-guided".
I think of this as two complementary approaches to AI safety:
I see this post as a great write-up for "We need some power-seeking/instrumentally convergent behavior, so AI safety isn't about avoiding that entirely" (a rock would solve that problem, it doesn't seek any power). I just want to add that my best guess is we'll want to do some mix of 1. and 2. above, not just 1. (or at least, we should currently pursuer both strategies, because it's unclear how tractable each one is).
I don't totally disagree, but two points:
No Free Lunch theorems only apply to a system that is at maximum entropy. Generally intelligent systems (e.g. AIXI) are possible because the simplicity prior is useful in our own universe (in which entropy is not at a maximum). Instrumental convergence isn't at the root of intelligence, simplicity is.
As an example, consider two tasks with no common subgoals: say factoring large integers and winning at Go. Imagine we are trying to find an algorithm that will excel at both of these while running on a Turing machine. There are no real-world resources to acquire, hence instrumental convergence isn't even relevant. However an algorithm that assumes a simplicity prior (like AIXI) will still outperform one that doesn't (say sampling all possible Go playing/number factoring algorithms and then picking the one that performs the best).
No Free Lunch theorems only apply to a system that is at maximum entropy. Generally intelligent systems (e.g. AIXI) are possible because the simplicity prior is useful in our own universe (in which entropy is not at a maximum). Instrumental convergence isn't at the root of intelligence, simplicity is.
I think "maximum entropy" is mostly just another way of saying "no common tasks". One can probably artificially construct cases where this isn't the case, but the cases that show up in reality seem to follow that structure.
As an example, consider two tasks with no common subgoals: say factoring large integers and winning at Go.
Existing general-purpose methods for winning at Go abstract the state space and action space down to what is specified by the rules, i.e. a 19x19 grid where you can place black or white stones under certain conditions. Thus a wide variety of different Go games will, for instance, start with the same state according to these general-purpose methods, and therefore be reduced to common subgoals.
They go further. There are patterns such as "ladders" that can be learned to inform strategy in many different states, and there are simple rules that can be used to figure out the winner from a final state.
These are the sorts of things that make it possible to create general-purpose Go algorithms.
Now you are right that these things don't show up with factoring large integer at all. Under my model, this predicts that the algorithms we use for solving Go are basically distinct from the algorithms that we use for factoring large integers. Which also seems true. That said, there are some common subgoals between factoring large integers and winning at Go, such as having a lot of computation power; and indeed we often seem to use computers for both.
Imagine we are trying to find an algorithm that will excel at both of these while running on a Turing machine. There are no real-world resources to acquire, hence instrumental convergence isn't even relevant. However an algorithm that assumes a simplicity prior (like AIXI) will still outperform one that doesn't (say sampling all possible Go playing/number factoring algorithms and then picking the one that performs the best).
I think AIXI would underperform the algorithm of "figure out whether you are playing Go or factoring numbers, and then either play Go or factor the number"? Not sure what you are getting at here.
TL;DR: General intelligence is possible because solving real-world problems requires solving common subtasks. Common subtasks are what give us instrumental convergence. Common subtasks are also what make AI useful; you want AIs to pursue instrumentally convergent goals. Capabilities research proceeds by figuring out algorithms for instrumentally convergent cognition. Consequentialism and search are fairly general ways of solving common subtasks.
General intelligence is possible because solving real-world problems requires solving common subtasks
No-free-lunch theorems assert that any cognitive algorithm is equally successful when averaged over all possible tasks. This might sound strange, so here's an intuition pump. Suppose you get a test like
and so on. One cognitive algorithm would be to evaluate the arithmetic expression and fill the answer in as the result. This algorithm seems so natural that it's hard to imagine how the no-free-lunch theorem could apply to this; what possible task could ever make arithmetic score poorly on questions like the above?
Easy: While an arithmetic evaluator would score well if you e.g. get 1 point for each expression you evaluate arithmetically, it would score very poorly if you e.g. lose 1 point for each expression you evaluate arithmetically.
This doesn't matter much in the real world because you are much more likely to encounter situations where it's useful to do arithmetic right than you are to encounter situations where it's useful to do arithmetic wrong. No-free-lunch theorems point out that when you average all tasks, useful tasks like "do arithmetic correctly" are perfectly cancelled out by useless tasks like "do arithmetic wrong"; but in reality you don't average over all conceivable tasks.
If there were no correlations between subtasks, there would be no generally useful algorithms. And if every goal required a unique algorithm, general intelligence would not exist in any meaningful sense; the generally-useful cognitions are what constitutes general intelligence.
Common subtasks are what give us instrumental convergence
Instrumental convergence basically reduces to acquiring and maintaining power (when including resources under the definition of power). And this is an instance of common subtasks: lots of strategies require power, so a step in lots of strategies is to accumulate or preserve power. Therefore, just about any highly capable cognitive system is going to be good at getting power.
"Common subtasks" views instrumental convergence somewhat more generally than is usually emphasized. For instance, instrumental convergence is not just about goals, but also about cognitive algorithms. Convolutions and big matrix multiplications seem like a common subtask, so they can be considered instrumentally convergent in a more general sense. I don't think this is a major shift from how it's usually thought of; computation and intelligence are usually considered as instrumentally convergent goals, so why not algorithms too?
Common subtasks are also what make AI useful; you want AIs to pursue instrumentally convergent goals
The logic is simple enough: if you have an algorithm that solves a one-off task, then it is at most going to be useful once. Meanwhile, if you have an algorithm that solves a common task, then that algorithm is commonly useful. An algorithm that can classify images is useful; an algorithm that can classify a single image is not.
This applies even to power-seeking. One instance of power-seeking would be earning money; indeed an AI that can autonomously earn money sounds a lot more useful than one that cannot. It even applies to "dark" power-seeking, like social manipulation. For instance, I bet the Chinese police state would really like an AI that can dissolve rebellious social networks.
The problem is not that we don't know how to prevent power-seeking or instrumental convergence, because we want power-seeking and instrumental convergence. The problem is that we don't know how to align this power-seeking, how to direct the power towards what we want, rather than having side-effects that we don't want.
Capabilities research proceeds by figuring out algorithms for instrumentally convergent cognition
Instrumentally convergent subgoals are actually fairly nontrivial. "Acquire resources" isn't a primitive action, it needs a lot of supporting cognition. The core of intelligence isn't "simple" per se; rather it is complex algorithms distilled from experience (or evolution) against common tasks. A form of innate wisdom, if you will.
In principle it might seem simple; we have basic theorems showing that ideal agency looks somewhat like π=argmaxpE[u|do(p)] or something roughly like that. The trouble is that this includes an intractable maximum and an intractable expected value. Thus we need to break it down into tractable subproblems; these subproblems exploit lots of detail about the structure of reality, and so they are themselves highly detailed.
The goal of capabilities research is basically to come up with algorithms that do well on commonly recurring subproblems. 2D CNNs are commonly useful due to the way light interacts with the world. Self-supervised learning from giant scrapes of the internet is useful because the internet scrapes are highly correlated with the rest of reality. Imitation learning is useful due to Aumann's Agreement Theorem and because instrumental convergence also applies to human intelligence. And so on.
Maybe we find a way to skip past all the heuristics and unleash a fully general learner that can independently figure out all the tricks, without needing human capabilities researchers to help further. This is not a contradiction to common subtasks being what drives general intelligence, since "figure out generally useful tricks" seems like a generally useful subtask to be able to solve. However, the key point is that even if there is no efficient "simple core of intelligence", the "common tasks" perspective still gives a reason why capabilities research would discover instrumentally convergent general intelligence, through accumulating tons of little tricks.
Consequentialism and search are fairly general ways of solving common subtasks
Reality seems to have lots of little subproblems that you can observe, model, analyze, and search for solutions to. This is basically what consequentialism is about. It gives you a very general way of solving problems, as long as you have sufficiently accurate models. There are good reasons to expect consequentialism to be pervasive.
AI researchers are working on implementing general consequentialist algorithms, e.g. in the reinforcement learning framework. So far, the search method their algorithms use are often of the naive "try lots of things and do what seems to work" form, but this is not the only form of search that exists. Efficient general-purposes search instead tends to involve reasoning about abstract constraints rather than particular policies. Because search and consequentialism are so commonly useful, we have lots of reason to expect it to exist in general intelligences.
Thanks to Justis Mills for proofreading and feedback.