What can the principal-agent literature tell us about AI risk?

apc

This work was done collaboratively with Tom Davidson.

Thanks to Paul Christiano, Ben Garfinkel, Daniel Garrett, Robin Hanson, Philip Trammell and Takuro Yamashita for helpful comments and discussion. Errors our own.

Introduction

The AI alignment problem has similarities with the principal-agent problem studied by economists. In both cases, the problem is: how do we get agents to try to do what we want them to do? Economists have developed a sophisticated understanding of the agency problem and a measure of the cost of failure for the principal, “agency rents”.

If principal-agent models capture relevant aspects of AI risk scenarios, they can be used to assess their plausibility. Robin Hanson has argued that Paul Christiano’s AI risk scenario is essentially an agency problem, and therefore that it implies extremely high agency rents. Hanson believes that the principal-agent literature (PAL) provides strong evidence against rents being this high.

In this post, we consider whether PAL provides evidence against Christiano’s scenario and the original Bostrom/Yudkowsky scenario. We also examine whether the extensions to the agency framework could be used to gain insight into AI risk, and consider some general difficulties in applying PAL to AI risk.

Summary

PAL isn’t in tension with Christiano’s scenario because his scenario doesn’t imply massive agency rents; the big losses occur outside of the principal-agent problem, and the agency literature can’t assess the plausibility of these losses. Extensions to PAL could potentially shed light on the size of agency rents in this scenario, which are an important determinant of the future influentialness of AI systems.
Mapped onto a PAL model, the Bostrom/Yudkowsky scenario is largely about the principal’s unawareness of the agent’s catastrophic actions. Unawareness models are rare in PAL probably because they usually aren’t very insightful. This lack of insightfulness also seems to prevent existing PAL models or possible extensions from teaching us much about this scenario.
There are also a number of more general difficulties with using PAL to assess AI risk, some more problematic than others.
- PAL models rarely consider weak principals and more capable agents
- PAL models are brittle
- Agency rents are too narrow a measure
- PAL models typically assume contract enforceability
- PAL models typically assume AIs work for humans because they are paid
Overall, findings from PAL do not straightforwardly transfer to the AI risk scenarios considered, so don’t provide much evidence for or against these scenarios. But new agency models could teach us about the levels of agency rents which AI agents could extract.

PAL and Christiano’s AI risk scenarios

Christiano’s scenario has two parts:

Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.")

Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.)

Hanson argued that “Christiano instead fears that as AIs get more capable, the AIs will gain so much more agency rents, and we will suffer so much more due to agency failures, that we will actually become worse off as as result. And not just a bit worse off; we apparently get apocalypse level worse off!”

PAL isn’t in tension with Christiano’s story and isn’t especially informative

We asked Christiano whether his scenario actually implies extremely high agency rents. He doesn’t think so:

On my view the problem is just that agency rents make AI systems collectively better off. Humans were previously the sole superpower and so as a class we are made worse off when we introduce a competitor, via the possibility of eventual conflict with AI who have been greatly enriched via agency rents…humans are better off in absolute terms unless conflict leaves them worse off (whether military conflict or a race for scarce resources). Compare: a rising China makes Americans better off in absolute terms. Also true, unless we consider the possibility of conflict....[without conflict] humans are only worse off relative to AI (or to humans who are able to leverage AI effectively). The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term.

Christiano’s concern isn’t that agency rents will skyrocket because of some distinctive features of the human-AI agency relationship. Instead, “proxies” and “influence seeking” are two specific ways AI interests will diverge from actual human goals. This leads to typical levels of agency rents; PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents.^[1]

The main loss occurs later in time and outside of the principal-agent context, due to the fact that these rents eventually lead AIs to wield more total influence on the future than humans.^[2] This is bad because, even if humanity is richer overall, we humans also “care about our fraction of influence over the future.”^[3] Compared to a world with aligned AI systems, humanity is leaving value on the table, permanently if these systems can’t be rooted out. The biggest potential downside comes from influence-seeking systems which Christiano believes could make humans worse off absolutely, by engaging in violent conflict.

These later failures aren’t examples of massive agency rents (as the term is used in PAL) because failure is not expected to occur when the agent works on the task it was delegated.^[4] Rather, the influence-seeking systems become more influential via typical agency rents, and then at some later point use these rents to influence the future, possibly by entering into conflict with humans. PAL studies the size of agency rents which can be extracted, but not what the agents decide to do with this wealth and influence.

Overall, PAL is consistent with AI agents extracting some agency rents, which occurs in both parts of Christiano’s story (and we’ll see next that putting more structure on agency models could tell us more about the level of rent extraction). But it has nothing to say about the plausibility of AI agents using their rents to exert influence over the long term future (parts 1 and 2) or engage in conflict (part 2).^[5]

Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario

Christiano’s scenario doesn’t rely on something distinctive about the human-AI agency relationship generating higher-than-usual agency rents.^[6] But perhaps there is something distinctive and rents will be atypical. In any case, the level of agency rents seems like a crucial consideration: if we think AI’s can extract little to no rents, we probably shouldn’t expect them to exert much influence over the future, because agency rents are what make AI rich.^[7] Agency models could help give us a better understanding of the size of agency rents in Christiano’s story, and for future AI systems more generally.

The size of agency rents are determined by a number of factors, including the agent’s private information, the nature of the task, the noise in the principal’s estimate of the value produced by the agent, and the degree of competition. For instance, more complex tasks tend to cause higher rents. From The (ir)resistible rise of agency rents:

In the presence of moral hazard, principals must leave rents to agents, to incentivize appropriate actions. The more complex and opaque the task delegated to the agent, the more difficult it is to monitor his actions, the larger his rents.

If, as AI agents become more intelligent, monitoring gets increasingly difficult, or tasks get more complex, then we would expect agency rents to increase.

On the other hand, competitive pressures between AI agents might be greater (it’s easy to copy and run an AI; it’s hard to increase the human workforce by transferring human capital from one brain to another via teaching). This would limit rents:

The agents desire to capture rents, however, could be kept in check by market forces and competition among [agents]. If each principal could run an auction with several, otherwise identical, [agents], he could select the agent with the smallest incentive problem, and hence the smallest rent.

Modelling the most relevant factors in an agency model seems like a tractable research question (we discuss some potential difficulties below). Economists have only just started thinking about AI, and there doesn’t seem to be any work studying rent extraction by AI agents.

PAL and AI risk from “accidents”

Ben Garfinkel has called the class of risks most associated with Bostrom and Yudkowsky, risks from “accidents”. Garfinkel characterises the general story in the following terms:

First, the author imagines that a single AI system experiences a massive jump in capabilities. Over some short period of time, a single system becomes much more general or much more capable than any other system in existence, and in fact any human in existence. Then given the system, researchers specify a goal for it. They give it some input which is meant to communicate what behavior it should engage in. The goal ends up being something quite simple, and the system goes off and single-handedly pursues this very simple goal in a way that violates the full nuances of what its designers intended.” Importantly, “At the limit you might worry that these safety failures could become so extreme that they could perhaps derail civilization on the whole.

These catastrophic accidents constitute the main worry.

If the risk scenario is adequately represented by a principal-agent problem, agency rents extracted by AI agents can be used to measure the cost of misalignment. This time agency rents are a better measure, because failure is expected to occur when the agent works on the task it was delegated.^[8] The scenario implies very high agency rents, with the principal being made much worse off because he delegated the task to the agent.

As Garfinkel’s nomenclature suggests, this story is about the designers being caught by surprise, not anticipating the actions the AI would take. The Wikipedia synopsis of Superintelligence also emphasizes that something unexpected occurs: “Solving the control problem is surprisingly difficult because most goals, when translated into machine-implementable code, lead to unforeseen and undesirable consequences.” In other words, the principal is unaware of some specific catastrophically harmful actions that the agent can take to achieve its goal.^[9] This could be because they incorrectly believe that the system doesn’t have certain capabilities, or they don’t foresee that certain actions satisfy the agent’s goal, as with perverse instantiation. Due to this, the agent takes actions that greatly harm the principal, at great benefit to herself.

PAL doesn’t tell us much about AI risk from accidents

Hanson’s critique was aimed at Christiano’s scenario, but it could equally apply to this one. Is PAL at odds with this scenario?

As an AI agent becomes more intelligent, it’s action set will expand, thinking of new and sometimes unanticipated actions to achieve its goals. This may include catastrophic actions that the principal is not aware of.^[10] PAL can't tell us what these actions will be, nor if the principal will be aware of them.^[11]

Instead, the vast majority of principal-agent models assume that the principal understands the environment perfectly, including perfect knowledge of the agent’s action set, while the premise of the accident scenario is that the principal is unaware of a catastrophic action that the agent could take. Because the principal’s unawareness is central, these models assume, rather than show, that this source of AI risk does not exist. They therefore don’t tell us much about the plausibility of AI accidents.

Microeconomist Daniel Garrett expressed this point nicely. We asked him about a hypothetical example, slightly misremembered from Stuart Russell’s book, concerning an advanced climate control AI system.^[12] He replied:

You can easily write down a model where the agent is rewarded according to some outcome, and the principal isn't aware the outcome can be achieved by some action the principal finds harmful. In your example, the outcome is the reduction of Co2 emissions. If the principal thinks carbon sequestration is the only way to achieve this, but doesn't think of another chemical reaction option which would indirectly kill everyone, she could end up providing incentives to kill everyone. The fact this conclusion is so immediate may explain why this kind of unawareness by the principal is given little attention in the literature. The principal-agent literature should not be understood as saying that these kinds of incentives with perverse outcomes cannot happen. (our emphasis)

PAL models do typically have modest agency rents; they typically don’t model the principal as being unaware of actions with catastrophic consequences. But this is the situation discussed by proponents of AI accident risk, so we can’t infer much from PAL except that such a situation has not been of much interest to economists.

Extending agency models doesn’t seem promising for understanding AI risk from “accidents”

Most PAL models don’t include the kind of unawareness needed to model the accident scenario, but extensions of this sort are certainly possible. However, we suspect trying to model AI risk in this way wouldn’t be fruitful, for three main reasons.

Firstly, as Daniel Garrett suggests, we suspect the assumptions about the principal’s unawareness of the agents action set would imply the action chosen by the agent, and its consequences for the principal, in a fairly direct and uninteresting way. There is a (very) small sub-literature on unawareness in agency problems where one can find models like this. In one paper, a principal hires an agent to do a work task, but isn’t aware that the agent can manipulate “short-run working performance at the expense of the employer’s future benefit.” The agent “is better off if he is additionally aware that he could manipulate the working performance,” and “in the post-contractual stage, [the principal] is hurt by the manipulating action of [the agent].” However, the model didn’t reveal anything unexpected about the situation, and the outcome was directly determined by the action set and unawareness assumptions.

Secondly, the major source of the uncertainty surrounding accident risk concerns whether the principal will be unaware of catastrophic agent actions. The agency literature can’t help us reduce this uncertainty as the unawareness is built into models’ assumptions. For instance, AI scientist Yann LeCun thinks that harmful actions “are easily avoidable by simple terms in the objective”. If LeCun implemented a superintelligent AI in this way, agency models couldn’t tell us whether he had correctly covered all bases.

Lastly, the assumptions about the agent’s action set would be highly speculative. We don’t know what actions superintelligent systems might take to pursue their goals. Agency models must make assumptions about these actions, and we don’t know what these assumptions should be.

In short, the uncertainty pertains to the assumptions of the model, not the way the assumptions translate into outcomes. PAL does not, and probably can not, provide much evidence for or against this scenario.

General difficulties with using PAL to assess AI risk

We’ve discussed the most relevant considerations regarding what PAL can tell us about two specific visions of AI risk. We now discuss some difficulties relevant to a broader set of possible scenarios (including those just examined). We list the difficulties from most serious to least serious.

PAL models rarely consider weak principals and more capable agents^[13]

AI risk scenarios typically involve the AI being more intelligent than humans. The type of problems that economists study usually don’t have this feature, and there seem to be very few models where the principal is weaker than the agent. Despite extensive searching, including talking to multiple contract theorists, we were only able to find two papers with a principal who is more boundedly rational than the agent.^[14] This is perhaps not so surprising given that bounded-rationality models are relatively rare, and when they do exist, they tend to bound both the principal and the agent in the same way, or have the principal more capable. The latter is because such a set up is more relevant to typical economic problems, e.g. “exploitative” contracting studies the mistakes made by an individual (the agent) when interacting with a more capable firm (the principal).

Microeconomist Takuro Yamashita agrees:

Most economic questions related to bounded rationality explored in the principal-agent literature are appropriately modelled by a bounded agent. It’s certainly possible to bound the principal, but by and large this hasn’t been done, just because of the nature of the questions that have been asked.

A recent review of Behavioural Contract Theory also finds that such models are rare:

In almost all applications, researchers assume that the agent (she) behaves according to one psychologically based model, while the principal (he) is fully rational and has a classical goal (usually profit maximization).

There doesn’t seem to be, in Hanson’s terms, a “large (mostly economic) literature on agency failures” with an intelligence gap relevant to AI risk.

PAL models are brittle

PAL models don’t model agency problems in general. They consider very specific agency relationships, studied in highly structured environments. Conclusions can depend very sensitively on the assumptions used; findings from one model don’t necessarily generalise to new situations. From the textbook Contract Theory:

The basic moral hazard problem has a fairly simple structure, yet general conclusions have been difficult to obtain...Very few general results can be obtained about the form of optimal contracts. However, this limitation has not prevented applications that use this paradigm from flourishing...Typically, applications have put more structure on the moral hazard problem under consideration, thus enabling a sharper characterization of the optimal incentive contract.” (our emphasis)

Similar reasoning applies in adverse selection models where the outcome is very sensitive to the mapping between effort and outcomes. Given an arbitrary problem, the optimal incentives can look like anything.

The agency problems studied by economists are typically quite different to the scenarios envisaged by AI risk proponents. Therefore, because of the brittleness of PAL models, we shouldn’t be too surprised if the imagined AI risk outcomes aren’t present in the existing literature. PAL, in its current form, might just not be of much use. Further, we should not expect there to be any generic answer to the question “How big are AI agency rents?”: the answer will depend on the specific task the AI is doing and a host of other details.

Agents rents are too narrow a measure

As we’ve seen, AI risk scenarios can include bad outcomes that aren’t agency rents, but that we nevertheless care about. When applying PAL to AI risk, care must be taken to distinguish between rents and other bad outcomes, and we cannot assume that a bad outcome necessarily means high rents.

PAL models typically assume contract enforceability

Stuart Armstrong argued that Hanson’s critique doesn’t work because PAL assumes contract enforceability, and with advanced AI, institutions might not be up to the task.^[15] Indeed, contract enforceability is assumed in most of PAL, so it’s an important consideration regarding their applicability to AI scenarios more broadly.^[16]

The assumption isn’t plausible in pessimistic scenarios where human principals and institutions are insufficiently powerful to punish the AI agent, e.g. due to very fast take-off. But it is plausible for when AIs are similarly smart to humans, and in scenarios where powerful AIs are used to enforce contracts. Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.^[17]

There is a smaller sub-literature on self-enforcing contracts (seminal paper). Here contracts can be self-enforced because both parties have an interest in interacting repeatedly. We think these probably won’t be helpful for understanding situations without contract enforceability, because in worlds where contracts aren’t enforceable because of advanced AI, contracts likely won’t be self-enforcing either. If AIs are powerful enough that institutions like the police and military can’t constrain them, it seems unlikely that they’d have much to gain from repeated cooperative interactions with human principals. Why not make a copy of themselves to do the task, coerce humans into doing it, or cooperate with other advanced AIs?

PAL models typically assume AIs work for humans because they are paid

In reality AIs will probably not receive a wage, and instead work for humans because that is their default behaviour. We think changing this would probably not make a big difference to agency models, because the wage could be substituted for other resources the AI cares about. For instance, AI needs compute to run. If we substitute “wage” for “compute”, the agency rents that the agent extracts is additional compute that it can use for its own purposes.

There is a sub-literature on Optimal Delegation that does away with wages. This literature focuses on the best way to restrict the agents action set. For AI agents, this is equivalent to AI boxing. We don’t think this literature will be helpful; PAL doesn’t study how realistic it is to box AI successfully, it just assumes it’s technologically possible. It therefore isn’t informative about whether AI boxing will work.

Conclusion

There are similarities between the AI alignment and principal-agent problems, suggesting that PAL could teach us about AI risk. However, the situations economists have studied are very different to those discussed by proponents of AI risk, meaning that findings from PAL don’t transfer easily to this context. There are a few main issues. The principal-agent setup is only a part of AI risk scenarios, making agency rents too narrow a metric. PAL models rarely consider agents more intelligent than their principals and the models are very brittle. And the lack of insight from PAL unawareness models severely restricts their usefulness for understanding the accident risk scenario.

Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.^[18]

Thanks to Wei Dai for pointing out a previous inaccuracy ↩︎
Agency rents are about e.g. working vs shirking. If the agent uses the money she earned to buy a gun and later shoot the principal, clearly this is very bad for her, but it’s not captured by agency rents. ↩︎
It’s not totally clear to us why we should care about our fraction of influence over the future, rather than the total influence. Probably because the fraction of influence affects the total influence, influence being zero-sum and resources finite. ↩︎
It wasn’t clear to us from the original post, at least in Part 1 of the story with no conflict, that humans are better off in absolute terms. For instance, wording like “over time those proxies will come apart” and “People really will be getting richer for a while” seemed to suggest that things are expected to worsen. Given this, Hanson’s interpretation (that Christiano’s story implied massive agency rents) seems reasonable without further clarification. Ben Garfinkel mentioned an outside-view measure which he thought undermined the plausibility of Part 1: since the industrial revolution we seem to have been using more and more proxies, which are optimized for more and more heavily, but things have been getting better and better. So he also seems to have understood the scenario to mean things get worse in absolute terms. ↩︎
Clarifying what it means for an AI system to earn and use rents also seems important, helping us make sure that the abstraction maps cleanly onto the practical scenarios we are envisaging. Relatedly, what traits would an AI system need to have for it to make sense to think of the system as “accumulating and using rents”? Rents can be cashed out in influence of many different kinds — a human worker might get higher wage, or more free time — and what ends up occuring will depend on the capabilities of the AI systems. Concretely, money can be saved in a bank account, people can be influenced, or computer hardware can be bought and run. One example of an obvious capability constraint for AI: some AI systems will be “switched off” after they are run, limiting their ability to transfer rents through time. As AI agents will (initially) be owned by humans, historical instances of slaves earning rents seem worth looking into. ↩︎
Although his scenario is more plausible if a smarter agent extracts more agency rents. ↩︎
Hanson and Christiano agree on this point. Hanson: “Just as most wages that slaves earned above subsistence went to slave owners, most of the wealth generated by AI could go to the capital owners, i.e. their slave owners. Agency rents are the difference above that minimum amount.” Christiano: “Agency rents are what makes the AI rich. It's not that computers would "become rich" if they were superhuman, and they just aren't rich yet because they aren't smart enough. On the current trajectory computers just won't get rich.” ↩︎
One limitation is that rents are the cost to the principal, whereas the accident scenario has costs for all humanity. This distinction isn’t especially important because in the accident scenario the outcome for the principal is catastrophic (i.e. extremely high agency rents), and this is what is potentially in tension with PAL. Nonetheless, we should keep in mind that the total costs of this scenario are not limited to agency rents, just as in Christiano’s scenario. ↩︎
Perhaps a more realistic framing: the principal is aware that there’s some probability that the agent will take an unanticipated catastrophic action, without knowing what that action might be. Under competitive pressures, maybe in a time of war, it could be beneficial for the principal to delegate (in expectation) despite significant risk, while humanity is made worse off (in expectation). This, of course, would be modelled quite differently to the accident AI risk we consider in the text, and we suspect that economic models would confirm that principals would take the risk in sufficiently competitive scenarios. These models would focus on negative externalities of risky AI development, something more naturally studied in domains like public economics rather than with agency theory. In any case, we focus here on the more traditional AI risk framing along the lines of “you think you have the AI under control, but beware, you could be wrong”. ↩︎
AI accident risk will be large when the AI agent thinks of new actions that i) harm the principal ii) further the agent's goals iii) the principal hasn't anticipated. ↩︎
This is because claims about the actions available to the agent and the principal’s awareness are part of PAL models’ assumptions. We discuss this more below. ↩︎
The correct example: “If you prefer solving environmental problems, you might ask the machine to counter the rapid acidification of the oceans that results from higher carbon dioxide levels. The machine develops a new catalyst that facilitates an incredibly rapid chemical reaction between ocean and atmosphere and restores the oceans’ pH levels. Unfortunately, a quarter of the oxygen in the atmosphere is used up in the process, leaving us [humans] to asphyxiate slowly and painfully.” ↩︎
I.e. the principal’s rationality is bounded to a greater extent than the agent’s ↩︎
In the model in “Moral Hazard With Unawareness” either the principal or the agent’s rationality can be bounded ↩︎
As argued above, we don’t think contract enforceability is the main reason Hanson’s critique of Christiano fails; agency rents are just not unusually high in his scenario. ↩︎
From Contract Theory: “The benchmark contracting situation that we shall consider in this book is one between two parties who operate in a market economy with a well-functioning legal system. Under such a system, any contract the parties decide to write will be enforced perfectly by a court, provided, of course, that it does not contravene any existing laws.” ↩︎
Thanks to Ben Garfinkel for pointing this out. ↩︎
Robin Hanson pointed out to us that when thinking about strange future scenarios, we should try to think about similar strange scenarios that we have seen in the past (we are very sympathetic to this, despite our somewhat skeptical position regarding PAL). With this in mind, another field which seems worth looking into is Security, especially military security. National leaders have been assassinated by their guards; kings have been killed by their protectors. These seem like a closer analogue to many AI risk scenarios than the typical PAL setup. It seems important to understand what the major risk factors are in these situations, how people have guarded against catastrophic failures, and how this translates to cases of catastrophic AI risk. ↩︎

Curated. This post represents a significant amount of research, looking into the question of whether an established area of literature might be informative to concerns about AI alignment. It looks at that literature, examines its relevance in light of the questions that have been discussed so far, and checks the conclusions with existing domain experts. Finally, it suggests further work that might provide useful insights to these kinds of questions.

I do have the concern that currently, the post relies a fair bit on the reader trusting the authors to have done a comprehensive search - the post mentions having done "extensive searching", but besides the mention of consulting domain experts, does not elaborate on how that search process was carried out. This is a significant consideration since a large part of the post's conclusions rely on negative results (there not being papers which examine the relevant assumptions). I would have appreciated seeing some kind of a description of the search strategy, similar in spirit to the search descriptions included in systematic reviews. This would have allowed readers to both reproduce the search steps, as well as notice any possible shortcomings that might have led to relevant literature being missed.

Nonetheless, this is an important contribution, and I'm very happy both to see this kind of work done, as well as it being written up in a clear form on LW.

PAL confirms that due to diverging interests and imperfect monitoring, agents will get some rents.

Can you provide a source for this, or explain more? I'm asking because your note about competition between agents reducing agency rents made me think that such competition ought to eliminate all rents that the agent could (for example) gain by shirking, because agents will bid against each other to accept lower wages until they have no rent left. For example in the model of principle-agent problem presented in this lecture (which has diverging interests and imperfect monitoring) there is no agency rent. (ETA: This model does not have explicit competition between agents, but it models the principle as having all of the bargaining power, by letting it make a take-it-or-leave-it offer to the agent.)

If agents only earn rents when there isn't enough competition, that seems more like "monopoly rent" than "agency rent", plus it seemingly wouldn't apply to AIs... Can you help me develop a better intuition of where agency rents come from, according to PAL?

Thanks for catching this! You’re correct that that sentence is inaccurate. Our views changed while iterating the piece and that sentence should have been changed to: “PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents.”

This sentence too: “Overall, PAL tells us that agents will inevitably extract some agency rents…” would be better as “Overall, PAL is consistent with AI agents extracting some agency rents…”

I’ll make these edits, with a footnote pointing to your comment.

The main aim of that section was to point out that Paul’s scenario isn’t in conflict with PAL. Without further research, I wouldn’t want to make strong claims about what PAL implies for AI agency rents because the models are so brittle and AIs will likely be very different to humans; it’s an open question.

For there to be no agency rents at all, I think you’d need something close to perfect competition between agents. In practice the necessary conditions are basically never satisfied because they are very strong, so it seems very plausible to me that AI agents extract rents.

Re monopoly rents vs agency rents: Monopoly rents refer to the opposite extreme with very little competition, and in the economics literature is used when talking about firms, while agency rents are present whenever competition and monitoring are imperfect. Also, agency rents refer specifically to the costs inherent to delegating to an agent (e.g. an agent making investment decisions optimising for commission over firm profit) vs the rents from monopoly power (e.g. being the only firm able to use a technology due to a patent). But as you say, it's true that lack of competition is a cause of both of these.

Thanks for making the changes, but even with "PAL confirms that due to diverging interests and imperfect monitoring, AI agents could get some rents." I'd still like to understand why imperfect monitoring could lead to rents, because I don't currently know a model that clearly shows this (i.e., where the rent isn't due to the agent having some other kind of advantage, like not having many competitors).

Also, I get that the PAL in its current form may not be directly relevant to AI, so I'm just trying to understand it on its own terms for now. Possibly I should just dig into the literature myself...

The intuition is that if the principal could perfectly monitor whether the agent was working or shirking, they can just specify a cause in the contract the punishes them whenever they shirk. Equivalently, if the principal knows the agent's cost of production (or ability level), they can extract all the surplus value without leaving any rent.

Pages 40-53 of The Theory of Incentives contrasts these "first best" and "second-best" solutions (it's easy to find online).

(This rambly comment is offered in the spirit of Socratic grilling.)

I hadn't noticed I should be confused about the agency rent vs monopoly rent distinction till I saw Wei Dai's comment, but now I realise I'm confused. And the replies don't seem to clear it up for me. Tom wrote:

Re the difference between Monopoly rents and agency rents: monopoly rents would be eliminated by competition between firms whereas agency rents would be eliminated by competition between workers. So they're different in that sense.

That's definitely one way in which they're different. Is that the only way? Are they basically the same concept, and it's just that you use one label (agency rents) when focusing on rents the worker can extract due to lack of competition between workers, and the other (monopoly rents) when focusing on rents the firms can extract due to lack of competition between firms? But everything is the same on an abstract/structural level?

Could we go a little further, and in fact describe the firm as an agent, with consumers as its principal? The agent (the firm) can extract agency rents to the extent that (a) its activities at least somewhat align with those of the principal (e.g., it produces a product that the public prefers to nothing, and that they're willing to pay something for), and (b) there's limited competition (e.g., due to a patent). I.e., are both types rents due to one actor (a) optimising for something other than what the other actors wants, and (b) being able to get away with it?

That seems consistent with (but not stated in) most of the following quote from you:

Re monopoly rents vs agency rents: Monopoly rents refer to the opposite extreme with very little competition, and in the economics literature is used when talking about firms, while agency rents are present whenever competition and monitoring are imperfect. Also, agency rents refer specifically to the costs inherent to delegating to an agent (e.g. an agent making investment decisions optimising for commission over firm profit) vs the rents from monopoly power (e.g. being the only firm able to use a technology due to a patent). But as you say, it's true that lack of competition is a cause of both of these.

What my proposed framing seems to not account for is that discussion of agency rents involves mention of imperfect monitoring as well as imperfect competition. But I think I share Wi Dai's confusion there. If the principal had no other choice (i.e., there's no competition), then even with perfect monitoring, wouldn't there still be agency rents, as long as the agent is optimising for something at least somewhat correlated with the principal's interests? Is it just that imperfect monitoring increases how much the agent can "get away with", at any given level of correlation between its activities and the principal's interests?

And could we say a similar thing for monopoly rents - e.g., a monopolistic firm, or one with little competition, may be able to extract somewhat more rents if it's especially hard to tell how valuable its product is in advance?

Note that I don't have a wealth of econ knowledge and didn't take the option of doing a bunch of googling to try to figure this out for myself. No one is obliged to placate my lethargy with a response :)

Re the difference between Monopoly rents and agency rents: monopoly rents would be eliminated by competition between firms whereas agency rents would be eliminated by competition between workers. So they're different in that sense.

I think that more engagement in this area is useful, and mostly agree. I'll point out that I think much of the issue with powerful agents and missed consequences is more usefully captured by work on Goodhart's law, which is definitely my pet idea, but seems relevant. I'll self promote shamelessly here.

Technical-ish paper with Scott Garrabrant: https://arxiv.org/abs/1803.04585

A more qualitative argument about multi-agent cases, with some examples of how it's already failing: https://www.mdpi.com/2504-2289/3/2/21/htm

A hopefully someday accepted / published article on paths to minimize these risks in non-AI systems: https://mpra.ub.uni-muenchen.de/98288/5/MPRA_paper_98288.pdf

Thank you for putting all the time and thoughtfulness into this post, even if the conclusion is "nope, doesn't pan out." I'm grateful that it's out here.

Thank you! :)

I wouldn't characterise the conclusion as "nope, doesn't pan out". Maybe more like: we can't infer too much from existing PAL, but AI agency rents are an important consideration, and for a wide range of future scenarios new agency models could tell us about the degree of rent extraction.

Very interesting post!

Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.

I could easily be wrong, but this strikes me as a plausible but debatable statement, rather than a certainty. It seems like more argument would be required even to establish that it's likely, and much more to establish we can say "people will promptly realise..." It also seems like that statement is sort of assuming part of precisely what's up for debate in these sorts of discussions.

Some fragmented thoughts that feed into those opinions:

As you note just before that: "The assumption [of contract enforceability] isn’t plausible in pessimistic scenarios where human principals and institutions are insufficiently powerful to punish the AI agent, e.g. due to very fast take-off." So the Bostrom/Yudkowsky scenario is precisely one in which contracts aren't enforceable, for very similar reasons to why that scenario could lead to existential catastrophe.
Very relatedly - perhaps this is even just the same point in different words - you say "then people will promptly realise and stop using AIs". This assumes some possibility of at least some trial-and-error, and thus assumes that there'll be neither a very discontinuous capability jump towards decisive strategic advantage, nor deception followed by a treacherous turn.
As you point out, Paul Christiano's "Part 1" scenario might be one in which all or most humans are happy, and increasingly wealthy, and don't have motivation to stop using the AIs. You quote him saying "humans are better off in absolute terms unless conflict leaves them worse off (whether military conflict or a race for scarce resources). Compare: a rising China makes Americans better off in absolute terms. Also true, unless we consider the possibility of conflict....[without conflict] humans are only worse off relative to AI (or to humans who are able to leverage AI effectively). The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term."

Similarly, it seems to me that we could have a scenario in which people realise they can't enforce contracts with AIs, but the losses that result from that are relatively small, and are outweighed by the benefits of the AI, so people continue using the AIs despite the lack of enforceability of the contracts.
And then this could still lead to existential catastrophe due to black swan events people didn't adequately account for, competitive dynamics, or "externalities" e.g. in relation to future generations.

I'm not personally sure how likely I find any of the above scenarios. I'm just saying that they seem to reveal reasons to have at least some doubts that "if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs".

Although I think it would still be true that the possibilities of trial-and-error, recognition of lack of enforceability, and people's concerns about that are at least some reason to assume that if AIs are used contracts will be enforceability.

Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.

For myself, this is the most exciting thing in this post—the possibility of taking the principal-agent model and using it to reason about AI even if most of the existing principal-agent literature doesn't provide results that apply. I see little here to make me think the principal-agent model wouldn't be useful, only that it hasn't been used in ways that are useful to AI risk scenarios yet. It seems worthwhile, for example, to pursue research on the principal-agent problem with some of the adjustments to make it better apply to AI scenarios, such as letting the agent be more powerful than the principal and adjusting the rent measure to better work with AI.

Maybe this approach won't yield anything (as we should expect on priors, simply because most approaches to AI safety are likely not going to work), but it seems worth exploring further on the chance it can deliver valuable insights, even if, as you say, the existing literature doesn't offer much that is directly useful to AI risk now.

I agree that this seems like a promising research direction! I think this would be done best while also thinking about concrete traits of AI systems, as discussed in this footnote. One potential beneficial outcome would be to understand which kind of systems earn rents and which don't; I wouldn't be surprised if the distinction between rent earning agents vs others mapped pretty cleanly onto a Bostromian utility maximiser vs CAIS distinction, but maybe it won't.

In any case, the alternative perspective offered by the agency rents framing compared to typical AI alignment discussion could help generate interesting new insights.

Stuart Armstrong argued that Hanson’s critique doesn’t work because PAL assumes contract enforceability, and with advanced AI, institutions might not be up to the task. Indeed, contract enforceability is assumed in most of PAL, so it’s an important consideration regarding their applicability to AI scenarios more broadly.

This seems kind of off to me. When I think about using the analysis of contracts between humans and AIs, I'm not imagining legal contracts: I'm using it as a metaphor for the human getting to directly set what 'the AI'* is motivated to do. As such, the contract really is strictly enforced, because the 'contract' is the motivational system of 'the AI' which there's reason think 'the AI' is motivated to preserve and capable of preserving, a la Omohundro's basic AI drives.

Now, I think there are two issues with this:

Agents sometimes are incentivised to change their preferences as a result of bargaining. Think: "In order for us to work together on this project, which will deliver you bountiful rewards, I need you to stop being motivated to steal my lightly-guarded belongings, because I'm just not good enough at security to disincentivise you from stealing them myself."
More generally, we can think of the 'contract' as the program that constitutes 'the AI', which might be so complicated that humans don't understand it, and that might include a planning routine. In this case, 'the AI' might be motivated to modify the 'contract' to make 'itself' smarter.

But at any rate, I think the contract enforceability problem isn't a knock-down against the PAL being relevant.

[*] scare quotes to be a little more accurate and to placate my simulated Eric Drexler

I think it's worth distinguishing between a legal contract and setting the AI's motivational system, even though the latter is a contract in some sense. My reading of Stuart's post was that it was intended literally, not as a metaphor. Regardless, both are relevant; in PAL, you'd model motivational system via the agents utility function, and the contract enforceability via the background assumption.

But I agree that contract enforceability isn't a knock-down, and indeed won't be an issue by default. I think we should have framed this more clearly in the post. Here's the most important part of what we said:

But it is plausible for when AIs are similarly smart to humans, and in scenarios where powerful AIs are used to enforce contracts. Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.

I think it's worth distinguishing between a legal contract and setting the AI's motivational system, even though the latter is a contract in some sense.

To restate/clarify my above comment, I agree, but think that we are likely to delegate tasks to AIs by setting their motivational system and not by drafting literal legal contracts with them. So the PAL is relevant to the extent that it works as a metaphor for setting an AIs motivational system and source code, and in this context contract enforceability isn't an issue, and Stuart is making a mistake to be thinking about literal legal contracts (assuming that he is doing so).

Thanks for clarifying. That's interesting and seems right if you think we won't draft legal contracts with AI. Could you elaborate on why you think that?

Well because I think they wouldn't be enforceable in the really bad cases the contracts would be trying to prevent :) And also by default people currently delegate tasks to computers by writing software, which I expect to continue in future (although I guess smart contracts are an interesting edge case here).

The agency literature is there to model real agency relations in the world. Those real relations no doubt contain plenty of "unawareness". If models without unawareness were failing to capture and explain a big fraction of real agency problems, there would be plenty of scope for people to try to fill that gap via models that include it. The claim that this couldn't work because such models are limited seems just arbitrary and wrong to me. So either one must claim that AI-related unawareness is of a very different type or scale from ordinary human cases in our world today, or one must implicitly claim that unawareness modeling would in fact be a contribution to the agency literature. It seems to me a mild burden of proof sits on advocates for this latter case to in fact create such contributions.

The claim that this couldn't work because such models are limited seems just arbitrary and wrong to me.

The economists I spoke to seemed to think that in agency unawareness models conclusions follow pretty immediately from the assumptions and so don't teach you much. It's not that they can't model real agency problems, just that you don't learn much from the model. Perhaps if we'd spoken to more economists there would have been more disagreement on this point.

We have lots of models that are useful even when the conclusions follow pretty directly. Such as supply and demand. The question is whether such models are useful, not if they are simple.

So either one must claim that AI-related unawareness is of a very different type or scale from ordinary human cases in our world today, or one must implicitly claim that unawareness modeling would in fact be a contribution to the agency literature.

I agree that the Bostrom/Yudkowsky scenario implies AI-related unawareness is of a very different scale from ordinary human cases. From an outside view perspective, this is a strike against the scenario. However, this deviation from past trends does follow fairly naturally (though not necessarily) from the hypothesis of a sudden and massive intelligence gap

"Hanson believes that the principal-agent literature (PAL) provides strong evidence against rents being this high."

I didn't say that. This is what I actually said:

"surely the burden of 'proof' (really argument) should lie on those say this case is radically different from most found in our large and robust agency literatures."

There are THOUSANDS of critiques out there of the form "Economic theory can't be trusted because economic theory analyses make assumptions that can't be proven and are often wrong, and conclusions are often sensitive to assumptions." Really, this is a very standard and generic critique, and of course it is quite wrong, as such a critique can be equally made against any area of theory whatsoever, in any field.

But of course, it can't be used against them all equally. Physics is so good you can send a probe to a planet millions of miles away. But trying to achieve a practical result in economics is largely guesswork.

Aside from the arguments we made about modelling unawareness, I don't think we were claiming that econ theory wouldn't be useful. We argue that new agency models could tell us about the levels of rents extracted by AI agents; just that i) we can't infer much from existing models because they model different situations and are brittle, ii) that models won't shed light on phenomena beyond what they are trying to model

"models are brittle" and "models are limited" ARE the generic complaints I pointed to.

Great post! It explained clearly both positions, clarified the potential uses of PAL and proposed variations when it was considered accessible.

Maybe my only issue is with the (lack of) definition of the principal-agent problem. The rest of the post works relatively well without you defining it explicitly, but I think a short definition (even just a rephrasing of the one on Wikipedia) would make the post even more readable.

Thanks! Yeah, we probably should have included a definition. The wikipedia page is good.

Nonetheless, this is an important contribution, and I'm very happy both to see this kind of work done, as well as it being written up in a clear form on LW.

PAL confirms that due to diverging interests and imperfect monitoring, agents will get some rents.

I’ll make these edits, with a footnote pointing to your comment.

Also, I get that the PAL in its current form may not be directly relevant to AI, so I'm just trying to understand it on its own terms for now. Possibly I should just dig into the literature myself...

Pages 40-53 of The Theory of Incentives contrasts these "first best" and "second-best" solutions (it's easy to find online).

(This rambly comment is offered in the spirit of Socratic grilling.)

Re the difference between Monopoly rents and agency rents: monopoly rents would be eliminated by competition between firms whereas agency rents would be eliminated by competition between workers. So they're different in that sense.

That seems consistent with (but not stated in) most of the following quote from you:

Re monopoly rents vs agency rents: Monopoly rents refer to the opposite extreme with very little competition, and in the economics literature is used when talking about firms, while agency rents are present whenever competition and monitoring are imperfect. Also, agency rents refer specifically to the costs inherent to delegating to an agent (e.g. an agent making investment decisions optimising for commission over firm profit) vs the rents from monopoly power (e.g. being the only firm able to use a technology due to a patent). But as you say, it's true that lack of competition is a cause of both of these.

Technical-ish paper with Scott Garrabrant: https://arxiv.org/abs/1803.04585

A more qualitative argument about multi-agent cases, with some examples of how it's already failing: https://www.mdpi.com/2504-2289/3/2/21/htm

A hopefully someday accepted / published article on paths to minimize these risks in non-AI systems: https://mpra.ub.uni-muenchen.de/98288/5/MPRA_paper_98288.pdf

Thank you for putting all the time and thoughtfulness into this post, even if the conclusion is "nope, doesn't pan out." I'm grateful that it's out here.

Thank you! :)

Very interesting post!

Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.

Some fragmented thoughts that feed into those opinions:

As you note just before that: "The assumption [of contract enforceability] isn’t plausible in pessimistic scenarios where human principals and institutions are insufficiently powerful to punish the AI agent, e.g. due to very fast take-off." So the Bostrom/Yudkowsky scenario is precisely one in which contracts aren't enforceable, for very similar reasons to why that scenario could lead to existential catastrophe.
Very relatedly - perhaps this is even just the same point in different words - you say "then people will promptly realise and stop using AIs". This assumes some possibility of at least some trial-and-error, and thus assumes that there'll be neither a very discontinuous capability jump towards decisive strategic advantage, nor deception followed by a treacherous turn.
As you point out, Paul Christiano's "Part 1" scenario might be one in which all or most humans are happy, and increasingly wealthy, and don't have motivation to stop using the AIs. You quote him saying "humans are better off in absolute terms unless conflict leaves them worse off (whether military conflict or a race for scarce resources). Compare: a rising China makes Americans better off in absolute terms. Also true, unless we consider the possibility of conflict....[without conflict] humans are only worse off relative to AI (or to humans who are able to leverage AI effectively). The availability of AI still probably increases humans’ absolute wealth. This is a problem for humans because we care about our fraction of influence over the future, not just our absolute level of wealth over the short term."

Similarly, it seems to me that we could have a scenario in which people realise they can't enforce contracts with AIs, but the losses that result from that are relatively small, and are outweighed by the benefits of the AI, so people continue using the AIs despite the lack of enforceability of the contracts.
And then this could still lead to existential catastrophe due to black swan events people didn't adequately account for, competitive dynamics, or "externalities" e.g. in relation to future generations.

Nevertheless, extensions to PAL might still be useful. Agency rents are what might allow AI agents to accumulate wealth and influence, and agency models are the best way we have to learn about the size of these rents. These findings should inform a wide range of future scenarios, perhaps barring extreme ones like Bostrom/Yudkowsky.

In any case, the alternative perspective offered by the agency rents framing compared to typical AI alignment discussion could help generate interesting new insights.

Stuart Armstrong argued that Hanson’s critique doesn’t work because PAL assumes contract enforceability, and with advanced AI, institutions might not be up to the task. Indeed, contract enforceability is assumed in most of PAL, so it’s an important consideration regarding their applicability to AI scenarios more broadly.

Now, I think there are two issues with this:

Agents sometimes are incentivised to change their preferences as a result of bargaining. Think: "In order for us to work together on this project, which will deliver you bountiful rewards, I need you to stop being motivated to steal my lightly-guarded belongings, because I'm just not good enough at security to disincentivise you from stealing them myself."
More generally, we can think of the 'contract' as the program that constitutes 'the AI', which might be so complicated that humans don't understand it, and that might include a planning routine. In this case, 'the AI' might be motivated to modify the 'contract' to make 'itself' smarter.

But at any rate, I think the contract enforceability problem isn't a knock-down against the PAL being relevant.

[*] scare quotes to be a little more accurate and to placate my simulated Eric Drexler

But it is plausible for when AIs are similarly smart to humans, and in scenarios where powerful AIs are used to enforce contracts. Furthermore, if we cannot enforce contracts with AIs then people will promptly realise and stop using AIs; so we should expect contracts to be enforceable conditional upon AIs being used.

I think it's worth distinguishing between a legal contract and setting the AI's motivational system, even though the latter is a contract in some sense.

Thanks for clarifying. That's interesting and seems right if you think we won't draft legal contracts with AI. Could you elaborate on why you think that?

The claim that this couldn't work because such models are limited seems just arbitrary and wrong to me.

We have lots of models that are useful even when the conclusions follow pretty directly. Such as supply and demand. The question is whether such models are useful, not if they are simple.

So either one must claim that AI-related unawareness is of a very different type or scale from ordinary human cases in our world today, or one must implicitly claim that unawareness modeling would in fact be a contribution to the agency literature.

"Hanson believes that the principal-agent literature (PAL) provides strong evidence against rents being this high."

I didn't say that. This is what I actually said:

"surely the burden of 'proof' (really argument) should lie on those say this case is radically different from most found in our large and robust agency literatures."

"models are brittle" and "models are limited" ARE the generic complaints I pointed to.

Great post! It explained clearly both positions, clarified the potential uses of PAL and proposed variations when it was considered accessible.

Thanks! Yeah, we probably should have included a definition. The wikipedia page is good.

LESSWRONG
LW

LESSWRONG
LW

104

What can the principal-agent literature tell us about AI risk?

104

Ω 39

Introduction

Summary

PAL and Christiano’s AI risk scenarios

PAL isn’t in tension with Christiano’s story and isn’t especially informative

Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario

PAL and AI risk from “accidents”

PAL doesn’t tell us much about AI risk from accidents

Extending agency models doesn’t seem promising for understanding AI risk from “accidents”

General difficulties with using PAL to assess AI risk

PAL models rarely consider weak principals and more capable agents^[13]

PAL models are brittle

Agents rents are too narrow a measure

PAL models typically assume contract enforceability

PAL models typically assume AIs work for humans because they are paid

Conclusion

104

Ω 39

104

Ω 39

104

What can the principal-agent literature tell us about AI risk?

104

Ω 39

Introduction

Summary

PAL and Christiano’s AI risk scenarios

PAL isn’t in tension with Christiano’s story and isn’t especially informative

Extending agency models seems promising for understanding the level of agency rents in Christiano’s scenario

PAL and AI risk from “accidents”

PAL doesn’t tell us much about AI risk from accidents

Extending agency models doesn’t seem promising for understanding AI risk from “accidents”

General difficulties with using PAL to assess AI risk

PAL models rarely consider weak principals and more capable agents[13]

PAL models are brittle

Agents rents are too narrow a measure

PAL models typically assume contract enforceability

PAL models typically assume AIs work for humans because they are paid

Conclusion

104

Ω 39

104

Ω 39

PAL models rarely consider weak principals and more capable agents^[13]