Capability and Agency as Cornerstones of AI risk ­— My current model

wilm

tldr: In this post I am explaining my current model of AI risk after failing at my goal of understanding everything completely, without relying on arguments by authority. I hope this post is valuable in explaining the topic and also by giving an example of what a particular newbie found and did not find easy to understand / accessible.

This was not planned as a submission for the AI Safety Public Materials contest, but feel free to consider it.

I thank all those who helped me understand more about AI risk and AI safety and I thank benjaminalt, Leon Lang, Linda Linsefors, and Magdalena Wache for their helpful comments.

Ideas about advanced artificial intelligence (AI) have always been one of the core topics on LessWrong and in the rationality community in general. One central aspect of this is the concern that the development of artificial general intelligence (AGI) might pose an existential threat to humanity. Since a few years ago, these ideas have gained a lot of traction and the prevention of risk from A(G)I has grown into one of the central cause areas in Effective Altruism (EA). Recently, the buzz around the topic has increased even more, at least in my perception, after Eliezer's April fools post and list of fatalities.

Having first read about AGI and risk from AGI in 2016, I have gotten more and more interested about the topic until this year I finally decided to actually invest some time and effort to learn more about it. I therefore signed up for the AGI safety fundamentals course by EA Cambridge. Unfortunately, after completing it I still felt pretty confused about my core question of how seriously I should take existential risk (x-risk) from AI. In particular, I was feeling confused about how much of the concern for AI safety was due to the topic being a fantastic nerd-snipe, that might be pushed by group-think and halo effects around high-profile figures in the rationality and EA communities.

I therefore set out to understand the topic from the ground up, explicitly making sense of the arguments and basic assumptions without relying on arguments by authority. I did not fully succeed at this, but in this post I present my current model about x-risk from advanced AI, as well as a brief discussion of what I feel confused about. The focus lies on understanding existential risk directly caused by AI systems, as this is what I was most confused about. In particular, I am not trying to explain risks caused by more complex societal and economic interactions of AI systems, as this seems like somewhat of a different problem and I understand it even less.

ⅰ. Summary

Very roughly, I think the potential risk posed by an AI system depends on the combination of its capabilities and its agency. An AI system with sufficiently high and general capabilities as well as strong agency very likely poses a catastrophic and even existential risk. If otherwise the system is lacking in either capabilities or agency, it is not a direct source of catastrophic or existential risk.

Outline. The remainder of this post will quickly explain what I mean by capabilities and by agency. I will then explain why the presence of sufficient capability and agency are dangerous. Subsequently, the post discusses my understanding of why both strong capabilities and agency can be expected to be features of future AI systems. Finally, I'll point out the major confusions I still have.

ⅱ. What do I mean by capability?

With capability I refer to the cognitive power of the system or to its 'intelligence'. Concepts and terms such as Artificial General Intelligence (AGI), Artificial Superintelligence (ASI), High-Level Machine Intelligence (HLMI), Process for Automating Scientific and Technological Advancement (PASTA), Transformative AI (TAI) and others can all be seen as describing certain capability levels. When describing such levels of capabilities, I feel that it often makes sense to separate two distinct dimensions of 'wideness' and 'depth'. Depth with respect to a certain task refers to how good a system is at this task, while wideness describes how diverse and general the types of tasks are that a system can perform. This also roughly fits into the definition of intelligence by Legg and Hutter (see Intelligence - Wikipedia), which is also the definition I would settle on for this post, if this was necessary.

Quick note on AGI

The term AGI seems to assume that the concept of general intelligence makes sense. It potentially even assumes that humans have general intelligence. I personally find the assumption of general intelligence problematic, as I am not convinced that generality is something that can easily be judged from the inside. I don’t know general intelligence would entail and whether humans are actually capable of it. Fortunately, human cognition clearly is capable of things as general as writing poetry, running civilization and landing rockets on the moon (see also Four background claims, by Nate Soares). Admittedly, this seems to be at least somewhat general. I like the terms such as TAI or PASTA, as they do not require the assumption of generality and instead just describe what some system is capable of doing.

ⅲ. What do I mean by agency?

The concept of agency seems more fuzzy than that of capability. Very roughly speaking, a system has agency, if it plans and executes actions in order to achieve goals (for example as specified by a reward or loss-function). This often includes having a model of the environment or world and perceptions via certain input channels.

The concept of agents appears in several fields of science like game theory, economics and artificial intelligence. Examples for agency include:

a company that tries to optimize profits
a predator trying to catch prey
a player trying to win a game

Russel and Norvig define an agent as: "anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators'' and define rational agent as "an agent that acts so as to maximize the expected value of a performance measure based on past experience and knowledge". I am not completely sure how much I like their definition, but together with the examples above, it should give a good intuition of what I mean by agency. Also note that I sometimes treat agency as a continuum, based on how strongly a system behaves like an agent.^[1]

ⅳ. Why do sufficient agency and capabilities lead to undesired outcomes?

With my notion of the basic concepts layed out, I'll explain why I think highly capable and agentic AI systems pose a threat. The basic argument here is that regardless of the specific goal / reward-function that an agent pursues, there are some sub-goals that are generally useful and likely to be part of the agent's overall plan. For example, no matter the actual goal, it is generally in the interest of that goal to not let it be altered. Along the same line of thought, any agent will want to avoid being switched off, as this stands in the way of basically any useful goal not specifically designed for corrigibility (goals that make the agent want to shut itself off are counterexamples to this, but rather pointless). Goals that are subgoals for almost any other specific goal, are called convergent instrumental goals (see Bostrom, and Instrumental convergence - Wikipedia). Other examples for convergent instrumental goals include resource acquisition and self-improvement.

Another way to frame this idea is that for many environments in which an agent is trained, "hacking" the environment leads to higher rewards than possible with legal actions in the environment. Here I use the term hacking to refer to anything between unwanted exploitation of misspecified goals (funny gif, longer explanation), bugs or glitches and actual hacking of the computational environment. This is already somewhat annoying with rather weak agents barely capable enough to play Atari games. However, with an agent possessing sufficiently strong and general cognitive power, it becomes something to worry about. For instance a sufficiently smart agent with enough knowledge about the real world, can be expected to want to prevent the physical hardware it is running on to be shut off. It could maybe be argued that a sufficiently smart Atari agent might also not only realize that the Atari emulator has an exploitable bug, but try to take actions in the real world. Admittedly, I find it hard to take this as a realistic concern for even highly capable agents, because it would require the agent to acquire knowledge about its territory that it can hardly attain. This point is discussed further in Section ⅵ, Why worry about agency.

I think common objections to this can be summarized in something akin to “Ok, I get instrumental convergence, but I do not see how a computer program could actually [take over the world]”. In my opinion, this type of objection is mostly due to a lack of imagination. I find it highly plausible that a system with high cognitive capabilities and enough knowledge about its environment can find ways of influencing the world (even if that was not intended by human designers). Some evidence for this comes from the fact that other humans came up with ways for misaligned AI systems to lead to catastrophic outcomes that I did not anticipate (see for example No Physical Substrate, No Problem by Scott Alexander).

ⅴ. Why expect highly capable / transformative AI?

This question seems to get increasingly easier to answer on an intuitive level, as AI progress continues to march on. However, even if computer systems can now beat humans at Starcraft, Go, protein folding and show impressive generalization at text (see e.g. GPT-3 Creative Fiction · Gwern.net) and image synthesis (An advanced guide to writing prompts for Midjourney ( text-to-image) | by Lars Nielsen | MLearning.ai | Sep, 2022 | Medium), one might still question how much this implies about reaching dangerously strong and general capabilities.

I do not know how progress in AI will continue in the future, how far current ML/AI approaches will go and how new paradigms or algorithms might shape the path. Additionally, I do not know at which level of capabilities AI systems could start to pose existential risks (I suspect this might depend on how much agency a system has). I am however quite convinced that the development of very highly capable systems is possible and would expect it to happen at some point in the future.

To see why the development of AI systems with very strong capabilities is possible, consider the design of the human brain. Evolution managed to create a system with impressively general and strong cognitive capabilities under severely(!) limiting constraints with respect to size (the human brain could not be much bigger without causing problems during birth and/or early development) and power consumption (the human brain is running on ~20W (The Physics Factbook)), all while not even optimizing for cognitive capabilities, but for inclusive genetic fitness. Clearly, if human researchers and engineers aim to create systems with high cognitive capabilities, they are far less limited in their designs. For example digital computing infrastructure can be far more easily scaled than biological hardware, power consumption is far less limited, and cognitive power can be more directly aimed for. I think from this perspective it is very plausible that AI systems can surpass human levels of cognition (both with relation to depth and width) at some point in the future, if enough effort is aimed towards this.

Beyond being possible, designing systems with high capabilities is also incentivized economically. Clearly, there is great economic value in automation and optimization, which both fuel the demand for capable AI systems. I think this makes it likely that a lot of effort will continue to be put into the design of ever more capable AI systems. Relatedly, economic progress in general leads to higher availability of resources (including computing hardware), which also makes it easier to put effort into AI progress^[2]. In summary, I expect tremendous progress in AI to be possible and worked on with a lot of effort. Even though I find it hard to judge the course of future progress in AI systems as well as the level at which to expect those to pose existential risk, this combination seems to be worrying.

Sidenote: This is separate from, but also related to the concept of self-improving AI systems. The argument here is that highly capable AI systems might be able to modify their own architecture and code and thereby improve themselves, potentially leading to a rapid intelligence-explosion. While I see that capable AI systems can accelerate the development of more advanced AI systems, I don’t think this necessarily leads to a rapid feedback loop. I think that our current understanding of intelligence/cognitive capabilities/optimization power is not sufficient to predict how hard it is for systems of some intelligence to design systems that surpass them^[3]. I therefore do not want to put a lot of weight on this argument.

So far, we have established that systems with sufficiently strong capabilities and agency are an existential threat and that the development of strong capabilities is probably possible and also happening at some point, due to economic incentives. It remains, to analyze the plausibility of such systems also having sufficient agency.

ⅵ. Why worry about agency?

The most direct reason why the development of AI systems with agency is to be expected is that one fundamental approach in machine learning, reinforcement learning, is about building systems that are pretty much the textbook example for agents. Reinforcement learning agents are either selected for their ability to maximize reward or are more directly learning policies / behavior based on received reward. This means that in any successful application of reinforcement learning there are agents sufficiently good at achieving some goal that are selected for or trained. In cases of bad inner-misalignment, reinforcement learning might also produce systems with strongly corrupted goals that are effectively lacking in agency. I am thus not completely certain about how necessary the use of RL to build extremely capable systems is dangerous.

Another problem is that reinforcement learning systems or generally AI systems with agency might outperform non-agentic AI systems on hard and more general tasks. If these tasks are useful and economically valuable, then the development and usage of agentic AI systems can be expected. I do expect hard and complex tasks that include interactions with the real world to be candidates for such tasks. The reason for this is that for such a problem an agent can (and has to) come up with a strategy itself, even if that strategy involves exploration of different approaches, experimentation, and/or the acquisition of more knowledge. While individual steps in such a process could also be solved or assisted by tool AIs, this would require much more manual work that would probably be worth automating. Also I expect it to become prohibitively hard to build non-agentic tool AIs for increasingly complex and general goals. Unfortunately I do not have concrete and realistic examples of economically valuable tasks that are far more easily solved using agentic AI systems. Running a company is probably not solvable with supervised learning, but (at least with current approaches) also not with reinforcement learning as exploration is too costly. More narrow tasks such as optimizing video suggestions for watch time on the other hand seem too narrow and constrained in the types of actions / outputs of a potential agent. There might however be tasks somewhere in between that, like these two examples, also are economically valuable.

This is also related to a point I raised earlier about agents needing to have enough knowledge about the real world in order to be able to consider influencing it. Again, for many tasks, for which agentic AI systems might be used, it is probably helpful or necessary to provide the agent with either direct knowledge or the ability to query relevant information. This would certainly be the case for complex tasks involving interactions with the real world, such as (helping to) run a company and potentially also many tasks involving the interaction with people, like for example content suggestions. In summary, I suspect there to be economically valuable tasks for which agentic systems with information about the real world are the cheapest option and I strongly expect that such systems would be used.

Another path towards solving very hard and general tasks without using agentic systems could lie in very general tool AIs. For example, an extremely capable text prediction AI (think GPT-n) could potentially be used to solve very general and hard problems with prompts such as “The following research paper explains how to construct X”. I am not sure how realistic the successful development of such a prediction AI is (as you would want the prompt to be completed with a construction of X and not something that just looks plausible but is wrong). Also there is another (at least conceptual) problem here. Very general prediction tasks are likely to involve agents and therefore an AI that tries to predict such a scenario well is likely modeling these agents internally. If these agents are modeled with enough granularity, then the AI is basically running a simulation of the agents that might itself be subject to misalignment and problematic instrumental goals. While this argument seems conceptually valid to me, I don't know how far it carries into the real world with finite numbers. Intuitively it seems like extremely vast computational resources and also some additional information are needed until a spontaneously emerged sub-agent can become capable enough to realize it is running inside a machine learning model on some operating system on some physical hardware inside a physical world. I would rather expect tool AIs to not pose existential risk before more directly agentic AIs, which are likely also selected for economically.

ⅶ. Conclusion

This concludes my current model of the landscape of direct existential risk from AI systems. In a nutshell, capable and agentic systems will have convergent instrumental goals. These are potentially conflicting with the goals of humanity and in the limit also its continued existence, as they would incentivize the AI to, e.g., gather resources and avoid being shut off. Unfortunately, highly capable systems are realistic, economically valuable, and can be expected to be built at some point. Agentic systems are unfortunately also being built, in part because a common machine learning paradigm produces (almost?) agents and because it can be expected to be economically valuable.

My current model does not directly give quantitative estimates, but at least it makes my key-uncertainties explicit. These are:

In order to expect real world consequences because of instrumental convergence:
- How strong and general are the required capabilities?
- How much agency is required?
- How much knowledge about the environment and real world is required?
- How likely are emergent sub-agents in non-agentic systems?
How quickly and how far will capabilities of AI systems improve in the foreseeable future (next few months / years / decades)?
How strong are economic incentives to build agentic AI systems?
- What are concrete economically valuable tasks that are more easily solved by agentic approaches than by (un)supervised learning?
How hard is it to design increasingly intelligent systems?
- How (un)likely is an intelligence explosion?

Apart from these rather concrete and partly also somewhat empirical questions I also want to state a few more general and also maybe more confused questions:

How high-dimensional is intelligence? How much sense does it make to speak of general intelligence?
How independent are capabilities and agency? Can you have arbitrarily capable non-agentic systems?
How big is the type of risk explained in this post compared to more fuzzy AI risk related to general risk from Moloch, optimization, and competition?
- How do these two problems differ in scale, tractability, neglectedness?
What does non-existential risk from AI look like? How big of a problem is it? How tractable? How neglected?

^{^}
Some authors seem to like the term "consequentialist" more than agent, even though I have the impression it refers to the (almost?) same thing. I do not understand why this is the case.
^{^}
Also, it seems plausible that AI systems will help with the design of future more capable AI systems, which can be expected to increase the pace of development. This is arguably already starting to happen with e.g. GitHub Copilot and the use of AI for chip-design
^{^}
It might require ω(X) insights to build a system of intelligence X, that can only generate insights at pace O(X) or even o(x).

Great job writing up your thoughts, insights, and model!

My mind is mainly attracted to the distinction you make between capabilities and agency. In my own model, agency is a necessary part of increasing capabilities, and will per definition emerge in superhuman intelligence. I think the same conclusion follows from the definitions you use as follows:

You define "capabilities" by the Legg and Hutter definition you linked to, which reads:

Intelligence measures an agent's ability to achieve goals in a wide range of environments

You define "agency" as:

if it plans and executes actions in order to achieve goals

Thus an entity with higher intelligence achieves goals in a wider range of environments. What would it take to achieve goals in a wider range of environments? The ability to plan.

There is a ceiling where environments become so complex that 1-step actions will not work to achieve one's goals. Planning is the mental act of stringing multiple action steps together to achieve goals. Thus planning is a key part of intelligence. It's the scaling up of one of the parameters of intelligence: The step-size of your action sequences. Therefore it follows that no superhuman intelligence will exist without including agency, by definition of what superhuman intelligence is. And thus AGI risk is about agentic AI that is better at achieving goals in a wider range of environments than we can. And it will generate plans that we can't understand, cause it will literally be "more" agentic than us, in as far as one can use that term to indicate degree by referring to the complexity of the plans one can consider and generate.

I'm also going to take a shot at your questions as a way to test myself. I'm new to this, so forgive me if my answers are off base. Hopefully other people will pitch in and push in other directions if that's the case, and I'll learn more in the process.

In order to expect real world consequences because of instrumental convergence:
How strong and general are the required capabilities?

Unsure. You raise an interesting point that I tend to mostly worry about superhuman intelligence cause that intelligence can per definition beat us at everything. Your question makes me wonder if we are already in trouble before that point. Then again, on the practical side, Presumably the problem happens somewhere between "the smartest animal we know" and "our intelligence", and once we are near that, recursive self-improvement will make the distinction moot, I'm guessing (as AGI will shoot past our intelligence level in no time). So, I don't know ... this is part of my motivation to just work on this stuff ASAP, just in case the point is sooner than we might imagine.

How much agency is required?

I don't think splitting of agency is a relevant variable in this case, as soon above. In my model, you are simply asking "how many steps should agents have in their plans before they cause trouble", which is a proxy for "how smart do they need to be to cause trouble", and thus the same as the question before this.

How much knowledge about the environment and real world is required?

If you tell them about the world as a hypothetical, and they engage in that hypothetical as if you are indeed the person you say you are in that hypothetical, then they can already wreak havoc by "pretending" to implement what they would do in this hypothetical and apply social manipulation to you such that you set off a destructive chain of actions in the world. So, very little. And any amount of information they would need to be useful to us, would be enough for it to also hurt us.

How likely are emergent sub-agents in non-agentic systems?

See answer 2 questions up.

How quickly and how far will capabilities of AI systems improve in the foreseeable future (next few months / years / decades)?

I don't know. I've been looking at Steinhardt's material on forecasting and now have Superforecasting on my reading list. But then I realized, why even bother? Personally I consider a 10% risk of superhuman intelligence being created in the next 150 years to be too much. I think I read somewhere that 80% of ML researchers (not specific to AIS or LW) think AGI will happen in the next 100 years or so. I should probably log that source somewhere, in case I'm misquoting. Anyway, unless your action plans are different based on specific thresholds, I think spending too much time forecasting is a sort of emotional red herring that drains resources that could go in to actual alignment work. Not saying you have actually done this, but instead, I'm trying to warn against people going in that direction.

How strong are economic incentives to build agentic AI systems?
What are concrete economically valuable tasks that are more easily solved by agentic approaches than by (un)supervised learning?

Very strong: Solve the climate change crisis. Bioengineer better crops with better yields. Do automated research that actually invents more efficient batteries, vehicle designs, supercomputers, etc. The list is endless. Intelligence is the single most valuable asset a company can develop, cause it solve problems, and solving problems makes money.

How hard is it to design increasingly intelligent systems?
How (un)likely is an intelligence explosion?

Not sure... I am interested in learning more about what techniques we can use to ensure a slow take-off. My intuitions are currently that if we don't try really hard, a fast take-off will be the default, cause an intelligent system will converge toward recursive self-improvement as per Instrumentally Convergent goal of Cognitive Enhancement.

Apart from these rather concrete and partly also somewhat empirical questions I also want to state a few more general and also maybe more confused questions:
How high-dimensional is intelligence? How much sense does it make to speak of general intelligence?

I love this question. I do think intelligence has many subcomponents that science hasn't teased apart entirely yet. I'd be curious to learn more about what we do know. Additionally, I have a big worry around what new emergent capabilities a superhuman intelligence may have. This is analoguous to our ability to plan, to have consciousness, to use language, etc. At some point between single-celled organism and homo sapience, these abilities just emerged from scaling up computational power (I think? I'm curious about counterarguments around architecture or other aspects mattering or being crucial as well)

How independent are capabilities and agency? Can you have arbitrarily capable non-agentic systems?

See above answers.

How big is the type of risk explained in this post compared to more fuzzy AI risk related to general risk from Moloch, optimization, and competition?
How do these two problems differ in scale, tractability, neglectedness?

Not sure I understand the distinction entirely ... I'd need more words to understand how these are separate problems. I can think of different ways to interpret the question.

What does non-existential risk from AI look like? How big of a problem is it? How tractable? How neglected?

My own intuition is actually that pure S-risk is quite low/negligible. Suffering is a waste of energy, basically inefficient, toward any goal except itself. It's like an anti-convergent goal, really. Of course, we do generate a lot of suffering while trying to achieve our goals, but these are inefficiencies. E.g., if you think of the bio-industry. If we created lab meat grown from single cells, there would be no suffering, and this would presumably be far more efficient. Thus any S-risks from AGI will be quite momentary. Once it figures out how to do what it really wants to do without wasting resources/energy making us suffer, it will just do that. Of course there is a tiny sliver of probability that it will take our suffering as it's primary goal, but I think this is ridiculously unlikely, and something our minds get glued on to due to that bias which makes you focus on the most absolutely horrible option no matter the probability (forgot what it's called).

LESSWRONG
LW

10

Capability and Agency as Cornerstones of AI risk — My current model

10

ⅰ. Summary

ⅱ. What do I mean by capability?

Quick note on AGI

ⅲ. What do I mean by agency?

ⅳ. Why do sufficient agency and capabilities lead to undesired outcomes?

ⅴ. Why expect highly capable / transformative AI?

ⅵ. Why worry about agency?

ⅶ. Conclusion

10

10

Capability and Agency as Cornerstones of AI risk ­— My current model

10

ⅰ. ­Summary

ⅱ. What do I mean by capability?

Quick note on AGI

ⅲ. What do I mean by agency?

ⅳ. Why do sufficient agency and capabilities lead to undesired outcomes?

ⅴ. Why expect highly capable / transformative AI?

ⅵ. Why worry about agency?

ⅶ. Conclusion

10

Capability and Agency as Cornerstones of AI risk — My current model

ⅰ. Summary